Semantic Cache

Return cached answers for semantically similar queries -- not just exact string matches. Uses a vector index (FAISS or ChromaDB) to find near-duplicate questions.

pip install 'omnicache-ai[vector-faiss]'

8.1 Standalone semantic cache

Build a semantic cache with your own embedding function and a cosine-similarity threshold.

import numpy as np
from omnicache_ai import SemanticCache
from omnicache_ai.backends.memory_backend import InMemoryBackend
from omnicache_ai.backends.vector_backend import FAISSBackend

# Provide your own embed function (wraps any embedder)
from langchain_openai import OpenAIEmbeddings
embedder = OpenAIEmbeddings(model="text-embedding-3-small")

def embed(text: str) -> np.ndarray:
    return np.array(embedder.embed_query(text), dtype=np.float32)

cache = SemanticCache(
    exact_backend=InMemoryBackend(),
    vector_backend=FAISSBackend(dim=1536),
    embed_fn=embed,
    threshold=0.93,   # cosine similarity cutoff
)

# Populate
cache.set("What is the capital of France?", "Paris")
cache.set("Who wrote Hamlet?", "William Shakespeare")

# Exact match
print(cache.get("What is the capital of France?"))   # "Paris"

# Semantically similar — still hits
print(cache.get("What's France's capital city?"))     # "Paris"
print(cache.get("Which city is the capital of France?"))  # "Paris"

# Unrelated — cache miss
print(cache.get("What is the speed of light?"))       # None

tip

Tune the threshold parameter to balance between cache hit rate and answer relevance. A value of 0.93 works well for paraphrase detection; lower it to 0.88-0.90 for broader matching.

8.2 Semantic cache inside a LangChain chain

Use the semantic cache as a front-end for a LangChain LLM call so that rephrased questions hit the cache instead of the model.

import numpy as np
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from omnicache_ai import SemanticCache
from omnicache_ai.backends.memory_backend import InMemoryBackend
from omnicache_ai.backends.vector_backend import FAISSBackend

embedder = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o-mini")

sem_cache = SemanticCache(
    exact_backend=InMemoryBackend(),
    vector_backend=FAISSBackend(dim=1536),
    embed_fn=lambda t: np.array(embedder.embed_query(t), dtype=np.float32),
    threshold=0.95,
)

def cached_ask(question: str) -> str:
    cached = sem_cache.get(question)
    if cached is not None:
        print("[CACHE HIT]")
        return cached
    answer = llm.invoke(question).content
    sem_cache.set(question, answer)
    return answer

print(cached_ask("What is RAG?"))
print(cached_ask("Can you explain RAG?"))          # hit
print(cached_ask("Describe retrieval augmented generation."))  # hit

8.3 Semantic cache with ChromaDB backend

Swap FAISS for ChromaDB for a persistent, production-ready vector store.

# pip install 'omnicache-ai[vector-chroma]'
import numpy as np
from omnicache_ai import SemanticCache
from omnicache_ai.backends.memory_backend import InMemoryBackend
from omnicache_ai.backends.vector_backend import ChromaBackend

cache = SemanticCache(
    exact_backend=InMemoryBackend(),
    vector_backend=ChromaBackend(collection_name="qa_cache", dim=1536),
    embed_fn=lambda t: np.random.rand(1536).astype(np.float32),  # replace with real embedder
    threshold=0.92,
)
cache.set("Explain transformers", "Transformers are a type of neural network...")
print(cache.get("What are transformers?"))

note

Replace the random embedding function with a real embedder (e.g., OpenAI, Sentence Transformers) for meaningful semantic matching.

8.1 Standalone semantic cache​

8.2 Semantic cache inside a LangChain chain​

8.3 Semantic cache with ChromaDB backend​

8.1 Standalone semantic cache

8.2 Semantic cache inside a LangChain chain

8.3 Semantic cache with ChromaDB backend