A company handling 8,000 monthly queries to an AI assistant can pay 800-1,600 PLN for tokens, with 40-55% coming from questions repeated in nearly identical form. Semantic cache eliminates this cost category without degrading response quality—provided you implement it with the right TTL and similarity threshold.
How semantic cache works step by step
#The flow is as follows. The user’s query hits the cache layer before the model. The system generates its embedding (e.g., BGE-M3, text-embedding-3-small) and compares it with embeddings of previously stored queries in a vector database. If cosine similarity exceeds the configured threshold, the stored response is returned. Below the threshold, the query proceeds to the model, and the new (query, response) pair is added to the cache with an assigned TTL.
Three parameters control the mechanism’s behavior:
Similarity threshold: Values of 0.92-0.96 work for most customer service deployments. A lower threshold increases hit-rate but raises the risk of semantically close pairs receiving contextually mismatched responses. A higher threshold is safer but lets more queries bypass the cache.
TTL (Time to Live): Determines how long a response remains valid. Store hours may have a 24 h TTL, but product prices or order statuses require TTLs of 5-15 minutes or event-based invalidation after updates in the source system.
Context key: In multi-tenant systems (SaaS, multi-company support), the cache key should include the tenant ID. Without it, Company A might receive a response tailored to Company B’s data.
The physical architecture is simple: A vector database (e.g., Qdrant, Redis with vector module) stores embedding-response pairs, and the embedding model runs locally or via a fast API. Latency for cache hits is typically 10-30 ms compared to 300-1500 ms for full inference.
When semantic cache actually pays off
#Not all query types are cacheable. The highest hit-rate comes from repetitive questions with infrequently changing answers.
Customer service is the classic case: “What are your hours?”, “How long does delivery take?”, “Can I return a product after 30 days?”—these questions recur dozens of times weekly. A FAQ corpus of 50-100 Q&A pairs with a 24 h TTL yields a 50-65% hit-rate for stable product lines.
Technical documentation and internal knowledge bases are a second strong use case. Developer queries about configuration, API parameters, or installation steps are repetitive, and answers rarely change (a bulk invalidation by version tag suffices when software updates).
Onboarding assistants for new employees are a third scenario: Hundreds of users go through the same set of questions about procedures, systems, and permissions.
When cache doesn’t help or harms: Queries requiring fresh data (stock prices, real-time order statuses), personalized queries with user-specific data, and long conversational dialogues where session context alters the meaning of each subsequent message.
Table: Deployment scenario vs. hit-rate and risk
#| Scenario | Estimated hit-rate | Main risk | Recommended TTL |
|---|---|---|---|
| Customer service FAQ (hours, delivery, returns) | 50-65% | Outdated hours after seasonal changes | 24 h + event-based invalidation |
| Technical documentation / knowledge base | 40-55% | Old doc version after update | Per version tag |
| HR onboarding assistant | 45-60% | Outdated procedures after policy changes | 7 days |
| Product prices / inventory levels | 5-15% | Customer sees incorrect price | Event-based invalidation (webhook) |
| Order statuses, transactional data | < 5% | Mismatch with ERP system | Do not cache |
| Personalized recommendations | 10-25% | False-positives across profiles | Key per user-id |
Risks and how to control them
#False-positive hits are the most serious operational risk. At a 0.90 threshold, the query “Is product X available in blue?” might hit a cached response for “Is product Y available in red?” if sentence structures are semantically similar. Solution: Monitor false-positive rate (FPR) on a test set and raise the threshold if FPR exceeds 1-2%.
Stale responses erode customer trust more than a slow assistant. A company that changes its return policy from 14 to 30 days but fails to invalidate the cache will send incorrect information for hours. Event-based invalidation (webhook from CMS or CRM → DELETE cache by tag) is more critical than TTL alone.
Different answers to the same question can occur when the model updates but the cache retains old responses. Include the model version ID in the cache key or flush the cache after each update.
Increased operational complexity: Cache is an additional component that can fail (vector database downtime), requires monitoring, and consumes memory. Before deploying, assess whether your hit-rate justifies the overhead using the ROI calculator.
Useful metrics to monitor: hit-rate (target: > 35% for FAQ), FPR on test set (target: < 2%), average cache vs. model latency, percentage of responses invalidated before TTL. Learn more about AI system monitoring in the article on AI agent quality monitoring.
Invalidation strategy: TTL isn’t enough
#TTL alone only partially solves staleness. In practice, you need three mechanisms working in parallel.
Event-based invalidation: A webhook or domain event from the source system (CMS, ERP, CRM) hits the cache management layer and removes entries tied to a specific tag (e.g., product:P-123, policy:returns). This is the only effective method for irregularly changing data.
Version-based invalidation: Every LLM model or knowledge base corpus update clears the cache entirely or per segment. Cache keys include a document version hash, so old entries automatically expire after updates (hash mismatch).
TTL as a safety net: A minimum TTL of 1 h even for “rarely changing” data protects against missed invalidation events (webhook failure, deployment bug).
We detail knowledge management patterns in RAG, including versioning and incremental reindexing, in the article on RAG knowledge updates.
Implementation step by step
#Minimal stack for semantic cache in customer service: An embedding model (BGE-M3 locally or API), Qdrant as a vector database with metadata filtering (tenant, tag, version), Redis for TTL and invalidation tag storage, and a lightweight orchestrator (n8n or custom code) tying the layers together.
Deploy in three phases: First, run in shadow mode (log hits and misses but always respond via the model) to measure realistic hit-rate before production. Second, calibrate the threshold: Test values of 0.90, 0.92, 0.94, 0.96 on a validation set with manually scored pairs. Third, gradually enable cache for categories with the highest hit-rate and lowest staleness risk.
For detailed token cost optimization patterns, including prompt caching at the model API level (complementary to semantic cache), see the article on LLM token cost optimization. Semantic cache and AI ticket classification and routing work well as a shared layer before the model router, as we analyze in AI agent maintenance costs.
Try it yourself: Cache project for a customer service assistant
#FAQ
#What similarity threshold should I set initially?
#Start with 0.94 and test on a set of 50-100 pairs with manually assessed accuracy. If the false-positive rate exceeds 2%, raise it to 0.96. If the hit-rate is below 20% and your data has low noise, try 0.92. There’s no one-size-fits-all value.
Does semantic cache work for conversational chatbots (multi-turn)?
#With limitations. Caching entire conversation contexts is impractical (low hit-rate, large keys). A more effective approach is to cache only the first query in a session or fact-based questions (FAQ-like) that appear during the conversation and don’t depend on prior turns.
How do I measure if the cache actually reduces costs?
#Track two counters: Total system queries and queries passed to the model. Hit-rate = (cache hits / total queries). Cost without cache = total queries × average token cost. Cost with cache = model queries × token cost + fixed cache infrastructure cost (embedding + vector DB). The difference is your actual savings, offset by the cost of maintaining the additional component.
Is semantic cache compliant with RODO if responses contain personal data?
#Cached responses should not include personal data (names, order numbers, specific user addresses). Cache is suitable for generic responses (policies, procedures, FAQ). Personalized queries with user data should bypass the cache or be cached with a per-user key and TTL aligned with retention policies. Conduct a DPIA for any scenario where responses might indirectly contain sensitive data.
Are semantic cache and RAG the same technology?
#No. RAG retrieves document fragments as context for the model’s response. Semantic cache stores ready-made responses and returns them without model involvement for similar queries. Both can work together: RAG provides knowledge for the first response, and the cache stores that response for subsequent identical queries. The article on semantic search and embeddings covers the vector layer common to both approaches.