The cheapest token is the one you never generate. In traditional web development, caching (CDN, Redis, Browser Cache) is standard practice. You wouldn't query your primary SQL database for the same static homepage content 50 times a second.
Yet, in AI engineering, developers routinely send the exact same queries to OpenAI or Anthropic, paying full price every time. In 2026, if you aren’t caching, you are voluntarily overpaying. Here is a comprehensive guide to the three layers of AI Caching.
Layer 1: Semantic Caching (The Heavy Lifter)
Traditional caching is deterministic: it matches exact string keys ("weather:paris" == "weather:paris"). But LLM users are human. They might ask:
"How do I reset my password?"
"I forgot my password, can you help me change it?"
"Password reset steps please."
A string cache would treat these as three unique misses. Semantic Caching treats them as one hit.
How It Works
Embed: When a query comes in, generate a vector embedding (using a cheap model like
text-embedding-3-small).Search: Query your cache database (Redis/Valkey/Qdrant) for any existing vector with a Cosine Similarity > 0.90 (or your chosen threshold).
Hit: If found, return the stored LLM response from the previous query.
Miss: If not found, call the LLM, and then store the result + embedding in the cache.
The Economics:
LLM Call Cost: $0.02
Embedding + Cache Lookup Cost: $0.0001
Latency: 50ms (Cache) vs 2,000ms (LLM)
Result: If you have a 30% hit rate (common in support bots), you slash your bill by nearly 30% instantly.
Layer 2: Prompt Caching (The Infrastructure Strategy)
In mid-2024, providers like Anthropic and OpenAI introduced native Prompt Caching. This handles the "System Prompt" problem.
If you have an Agent with a complex persona, it might have a 5,000-token System Prompt containing:
2,000 tokens of Brand Voice guidelines.
2,000 tokens of Database Schema definitions.
1,000 tokens of "Few-Shot" examples.
Without storage, you send these 5,000 tokens with every single request, paying full price. With Prompt Caching, the API treats this prefix as a "Cacheable Artifact."
First Call: You pay full price to write the cache (~$15/1M tokens).
Subsequent Calls (within 5 min): You pay a drastically reduced "Read" price (~$1.50/1M tokens)—often a 90% discount.
Implementation Strategy: Place your static content (Persona, Guidelines) at the very start of your message array. Place dynamic content (User Query, RAG context) at the end. The cache breaks as soon as it sees a change, so keep the static prefix as long as possible.
Layer 3: Tool Output Caching (The Executor Strategy)
Agents love to use tools. They will obsessively check the weather, stock prices, or query your SQL database. If an agent checks the weather for "New York" five times in a 10-minute session, that is wasteful.
Wrap your Tool Execution logic in a standard cache with a Time-To-Live (TTL).
Python
@tool
@cache(ttl_seconds=600)
def get_weather(city: str):
# Expensive API call to WeatherProvider
return api.call(city)
This does not save you LLM tokens (the agent still generates the tool call usage), but it saves you Time and Downstream API Costs. Speed is a feature.
The Risk: Cache Invalidation
The hard part of computer science remains true here. You must handle invalidation.
Semantic Drift: If your product pricing changes, your cached answer "The price is $50" is now a liability. You must wipe the cache on deployment.
Personalization: Do not semantically cache queries that rely on user-specific data ("What is my balance?"). Ensure your embedding includes a UserID or TenantID in the metadata filter to prevent leaking data between users.
Conclusion
Caching is the hallmark of a mature engineering team. In the AI Gold Rush, everyone is focused on "Intelligence." But "Efficiency" is what keeps you in business. Start with Semantic Caching for your FAQ-style interactions, enabling Prompt Caching for your heavy agents, and enjoy the 40% margin improvement.
All in One Place
Atler Pilot decodes your cloud spend story by bringing monitoring, automation, and intelligent insights together for faster and better cloud operations.

