ResponseCache
Cache LLM completions so identical prompts never hit the API twice.
Overview
ResponseCache stores the full response object returned by an LLM provider, keyed by a composite of:
- model ID -- distinguishes responses from different models (e.g.
gpt-4vsclaude-3). - messages hash -- SHA-256 digest of the serialized message list.
- params hash -- SHA-256 digest of generation parameters (temperature, max_tokens, etc.).
The cache uses pickle for serialization, so any Python object that is picklable can be stored -- raw strings, Pydantic models, or full SDK response objects.
When to use: Any time you call an LLM with deterministic, non-personalized prompts -- classification, extraction, summarization of static content, tool-calling chains that replay the same messages.
If your prompt includes timestamps, random IDs, or per-user data, those values become part of the key and will prevent cache hits. Strip volatile fields before caching or move them into a separate context layer.
Usage
Basic get / set
from omnicache_ai import CacheManager, InMemoryBackend, CacheKeyBuilder
from omnicache_ai.layers.response_cache import ResponseCache
manager = CacheManager(
backend=InMemoryBackend(),
key_builder=CacheKeyBuilder(),
)
cache = ResponseCache(manager)
messages = [{"role": "user", "content": "Translate 'hello' to French."}]
# Store a response
cache.set(messages, "Bonjour", model_id="gpt-4", ttl=3600)
# Retrieve it
result = cache.get(messages, model_id="gpt-4")
print(result) # "Bonjour"
get_or_generate -- compute on miss
def call_llm(msgs):
return openai.chat.completions.create(
model="gpt-4", messages=msgs
)
# Returns cached value if available; otherwise calls call_llm,
# caches the result, and returns it.
response = cache.get_or_generate(
messages=messages,
generate_fn=call_llm,
model_id="gpt-4",
ttl=3600,
)
Invalidate all entries for a model
removed = cache.invalidate_model("gpt-4")
print(f"Removed {removed} cached responses")
How It Works
- Key construction --
_build_keyhashes the message list and the params dict independently using SHA-256 (truncated to 16 hex chars), then delegates toCacheKeyBuilder.build()which produces a key likeomnicache:resp:a3f9b2c1d4e5f678. - Serialization -- Responses are serialized with
pickle.dumpson write and deserialized withpickle.loadson read. - Storage -- The pickled bytes are passed to
CacheManager.set(), which applies the TTL policy and routes to the configured backend. - Tagging -- Each entry is tagged
model:{model_id}by default, enabling bulk invalidation viainvalidate_model().
messages + params
|
v
_build_key()
SHA-256(messages) + SHA-256(params)
|
v
CacheKeyBuilder.build("response", hash, extra={model, params_hash})
|
v
"omnicache:resp:a3f9b2c1d4e5f678"
API Reference
Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
manager | CacheManager | required | The underlying cache manager instance. |
Methods
| Method | Parameters | Returns | Description |
|---|---|---|---|
get | messages: list, model_id: str = "default", params: dict | None = None | Any | None | Return the cached response, or None on miss. |
set | messages: list, response: Any, model_id: str = "default", params: dict | None = None, ttl: int | None = None, tags: list[str] | None = None | None | Store a response in the cache. Tags default to ["model:{model_id}"]. |
get_or_generate | messages: list, generate_fn: Callable[[list], Any], model_id: str = "default", params: dict | None = None, ttl: int | None = None | Any | Return cached response or call generate_fn, cache, and return the result. |
invalidate_model | model_id: str | int | Remove all cached responses tagged with the given model. Returns the number of keys removed. |
If you call the same prompt with temperature=0.0 and later with temperature=0.7, these are separate cache entries. Pass the same params dict to get a hit.
Source
omnicache_ai/layers/response_cache.py