Optimizing RAG: The Gap Between Demo and Production

In 2023, every developer built a RAG (Retrieval Augmented Generation) app. The tutorial is simple:

PDF -> Text
Text -> Chunks
Chunks -> Embeddings (OpenAI ada-002)
Embeddings -> Vector DB (Pinecone)
Query -> KNN Search -> GPT-4.

This "Naïve RAG" works for simple questions like "What is the capital of France?" because the answer is explicit and dense. It fails catastrophically for complex queries like "Compare the liability clauses in the 2023 and 2024 contracts and highlight the risk changes."

This post is a comprehensive guide to building Production RAG—the kind that legal firms and banks actually use.

Part 1: The Chunking Problem

Standard tutorials use CharacteTextSplitter (chunk size = 500, overlap = 50). This splits the text blindly every 500 characters. It creates chunks that start in the middle of a sentence and end in the middle of a table.

Technique A: Recursive Retrieval

Instead of splitting blindly, we split by structure. We look for paragraph breaks (\n\n). If that's too big, we look for sentences (.). If the segment is still too large, then we look for words. This ensures that we never break a cohesive semantic unit. A paragraph is a "Thought." We want to index "Thoughts," not "Strings."

Technique B: Semantic Chunking (Advanced)

We stream the text and calculate the Cosine Dissimilarity between passing sentences. Sentence 1: "The cat sat on the mat." (Topic: Cats) Sentence 2: "He purred." (Topic: Cats) Sentence 3: "The stock market crashed." (Topic: Finance)

The embedding distance between S2 and S3 is huge. This triggers a "Breakpoint." We cut the chunk there. This ensures that each chunk contains exactly one semantic idea. This technique (popularized by Greg Kamradt) drastically improves retrieval quality for mixed-topic documents.

Part 2: Why Vectors Are Not Enough (Hybrid Search)

Vector Search (Semantic) is great for concepts. "Show me things about happiness" matches "Joy," "Elation," "Smile." The vector captures the "Vibe." But Vector Search is terrible for Exact Matches (Keywords). If you search for "Error Code 503", a Vector DB might return "Error Code 404" because they are semantically similar (both are errors). But in Engineering, they are opposites.

The Solution: Hybrid Search (RRF) We run TWO searches in parallel.
Dense: Vector Search (captures meaning).
Sparse: BM25 Search (captures keywords like '503').
We combine the results using Reciprocal Rank Fusion (RRF). If a document appears in the top 5 of both lists, it gets a massive score boost. This effectively acts as a dynamic filter.

Part 3: The Re-Ranking Layer

Retrieval (getting 100 docs from 1 million) is fast but inaccurate. Ranking (sorting those 100 docs) is slow but accurate.

The standard pattern is the "Two-Stage Pipeline." Stage 1 (Bi-Encoder): Use a Vector DB to get the top 50 candidates. This uses a pre-computed dot product. It is ultra-fast (< 50ms). Stage 2 (Cross-Encoder): Use a Re-ranker Model (like Cohere Rerank v3 or BGE-Reranker) to score each pair of (Query, Document). The Cross-Encoder looks at every word interaction. It is computationally expensive (requires GPU) but filters out the "False Positives." It boosts precision from ~60% to ~90%.

Part 4: GraphRAG (The Knowledge Graph)

Sometimes, the answer isn't in a chunk. It's in the relationship between chunks. Query: "How does the CEO's bonus policy affect the Q3 layoffs?" Chunk A: "CEO Bonus Policy..." Chunk B: "Q3 Layoffs..." A Vector search might find A and B, but it won't understand the causal link.

Microsoft GraphRAG We extract entities (CEO, Bonus, Q3, Layoffs) and their relationships (Affects, Causes, During) into a Knowledge Graph (Neo4j). When we query, we traverse the graph. We can say: "Find all entities connected to Layoffs within 2 hops." This surfaces the Bonus Policy even if they don't share keywords.

Part 5: Code Implementation

Here is how to implement Hybrid Search + Reranking using Python and Weaviate.

Python

import weaviate
from cohere import Client

# 1. Hybrid Search (Weaviate)
response = client.query.get("Article", ["content", "title"]) \
    .with_hybrid(
        query="What is the liability cap?",
        alpha=0.5 # 0.5 = Equal weight to Vector and Keyword
    ) \
    .with_limit(50) \
    .do()

initial_docs = response['data']['Get']['Article']

# 2. Re-Ranking (Cohere)
co = Client('COHERE_API_KEY')
rerank_results = co.rerank(
    model="rerank-english-v3.0",
    query="What is the liability cap?",
    documents=[doc['content'] for doc in initial_docs],
    top_n=5
)

# 3. Context Construction
final_context = "\n".join([res.document['text'] for res in rerank_results])

Part 6: Advanced Contextual Retrieval (Anthropic)

In late 2024, Anthropic introduced a new technique: Contextual Embeddings. The problem: A chunk might say "The revenue was $5M." The vector doesn't know which company that refers to. The context was lost when we split the PDF.

The Fix: Before embedding, we use an LLM to "prepend" the document title/summary to every chunk. Original Chunk: "The revenue was $5M." Enriched Chunk: "[Document: Q3 Apple Earnings] The revenue was $5M." Now, the vector contains the company name. Retrieval accuracy improves by ~20%. It costs more to index (because you run LLM on every chunk), but retrieval is far greater.

Part 7: Evaluation (RAGAS)

How do you know if your RAG is good? You need RAGAS (Retrieval Augmented Generation Assessment). It measures three metrics:

Faithfulness: Does the answer come from the context, or did the LLM hallucinate it?
Answer Relevance: Does the answer actually address the user's question?
Context Precision: Did the retrieval find the right chunks?

You run this as a CI/CD test. If your RAGAS score drops, you don't deploy.

Part 8: Future Outlook (Long Context vs RAG)

Gemini 1.5 Pro has a 2 Million Token context window. Can we just dump the entire database into the prompt and kill RAG? No.

Cost: Putting 1M tokens into the prompt for every query costs $5. RAG costs $0.05.
Latency: Processing 1M tokens takes 30 seconds. RAG takes 2 seconds.
Performance: "Lost in the Middle" phenomenon. LLMs are bad at finding a needle in a 2M token haystack.

RAG is effectively "Dynamic Context Caching." It works because the "Working Memory" of an LLM is expensive, while the "Long Term Memory" of a Vector DB is cheap. This economic fundamental will not change.

Part 9: Implementation Checklist

Implement Hybrid Search: Don't rely on Vectors alone.
Add Re-ranking: This is the easiest way to boost precision 20%.
Use Semantic Chunking: Stop cutting sentences in half.
Measure with RAGAS: Stop guessing if your bot is good.

Deep Dive: GraphRAG (The Knowledge Graph Revolution) Vector DBs fail at "Multi-Hop Reasoning." Query: "How does the CEO's strategy affect the Engineering Budget?" Vectors find "CEO strategy" and "Engineering Budget" separately. They miss the link. GraphRAG: Extracts entities (CEO, Budget) and relationships (affects) into a Knowledge Graph (Neo4j). It traverses the graph to answer constraints. It is 10x more expensive to build, but 100x more accurate for complex logic.

Python

# Python: LlamaIndex Knowledge Graph Construction
# Turning text into triplets (Subject, Predicate, Object)

from llama_index.core import KnowledgeGraphIndex
from llama_index.llms import Gemini

llm = Gemini(model="models/gemini-ultra")

# The LLM extracts triplets like (Elon Musk, owns, X)
index = KnowledgeGraphIndex.from_documents(
    documents,
    max_triplets_per_chunk=2,
    llm=llm
)

# Querying the graph
response = index.as_query_engine().query("Who owns X?")

Part 10: Glossary

Embeddings: Converting text into a vector of numbers (e.g. [0.1, 0.5, ...]).
RRF: Reciprocal Rank Fusion. An algorithm to merge keyword and vector search results.
Chunks: Small segments of text used for retrieval.
GraphRAG: Using Knowledge Graphs to perform retrieval based on relationships.

Conclusion

Building RAG is easy. Tuning RAG is an art. It requires understanding the full stack: from the linguistics of Chunking to the mathematics of Vectors to the economics of LLMs. Start with Hybrid Search. Add Re-ranking. Then, and only then, look at GraphRAG.

See, Understand, Optimize -
All in One Place

Atler Pilot decodes your cloud spend story by bringing monitoring, automation, and intelligent insights together for faster and better cloud operations.