Building and Evaluating Advanced RAG
Notes from the Building and Evaluating Advanced RAG course on DeepLearning.AI — Jerry Liu (LlamaIndex) & Anupam Datta (TruEra).
Table of Contents
- Advanced RAG Pipeline Overview
- RAG Triad of Metrics
- Sentence-Window Retrieval
- Auto-Merging Retrieval
- Evaluation with TruLens
- Summary & Choosing the Right Technique
Advanced RAG Pipeline Overview
Why Naive RAG Falls Short
A naive RAG pipeline — embed documents in fixed-size chunks, retrieve top-k, stuff into prompt — works for simple cases but breaks down on:
| Problem | Symptom |
|---|---|
| Chunks too small | Retrieved chunk lacks enough context; LLM gives incomplete answers |
| Chunks too large | Noisy retrieval; relevant detail diluted by surrounding text |
| Embedding granularity mismatch | You retrieve at chunk level but need surrounding paragraphs |
| No evaluation | No way to know if retrieval or generation is failing |
Advanced RAG Landscape
Query
│
▼
[Pre-Retrieval] — query rewriting, HyDE, query expansion
│
▼
[Retrieval] — sentence-window, auto-merging, hybrid search
│
▼
[Post-Retrieval] — re-ranking, context compression
│
▼
[Generation] — LLM with retrieved context
│
▼
[Evaluation] — RAG triad (TruLens / RAGAS)
LlamaIndex Building Blocks
| Component | Role |
|---|---|
SimpleDirectoryReader |
Load documents from disk |
SentenceSplitter |
Chunk documents with overlap |
VectorStoreIndex |
Build vector index over chunks |
RetrieverQueryEngine |
Orchestrate retrieval + generation |
NodePostprocessor |
Post-process retrieved nodes (rerank, filter) |
RAG Triad of Metrics
The RAG Triad (from TruEra) defines three independent axes of quality — all three must be high for a trustworthy RAG system.
Answer Relevance
↑
│
Context ────┼──── Groundedness
Relevance │
│
(query, context, response)
1. Context Relevance
“Is the retrieved context actually relevant to the query?”
- Measures whether the retriever is surfacing useful chunks
- Failure mode: retriever returns plausible-sounding but off-topic chunks
- Score: fraction of retrieved context sentences relevant to the query
2. Groundedness
“Is the LLM’s answer supported by the retrieved context?”
- Also called faithfulness — measures hallucination
- Failure mode: LLM generates confident statements not found in context
- Score: fraction of claims in the response that can be traced back to context
3. Answer Relevance
“Does the answer actually address what the user asked?”
- Measures end-to-end response quality relative to the original query
- Failure mode: technically grounded response that doesn’t answer the question
- Score: semantic similarity between response and query intent
Why All Three Together?
| Scenario | Context Rel. | Groundedness | Answer Rel. | Problem |
|---|---|---|---|---|
| Hallucination | ✅ High | ❌ Low | ❌ Low | LLM ignores context |
| Wrong retrieval | ❌ Low | ✅ High | ❌ Low | Faithful to wrong docs |
| Off-topic answer | ✅ High | ✅ High | ❌ Low | Doesn’t answer the question |
| Good RAG | ✅ High | ✅ High | ✅ High | — |
Sentence-Window Retrieval
The Problem
Small chunks embed well (each chunk is semantically coherent), but retrieved chunks often lack enough surrounding context for the LLM to synthesize a good answer.
The Technique
Embed small, retrieve big.
- Split documents into small sentences (1–3 sentences each)
- Each sentence node stores a pointer to its surrounding window (e.g., ±2 sentences on each side)
- At query time: embed the small sentence for accurate retrieval
- When sending to the LLM: replace the sentence with its full window
Document: ... [S1] [S2] [S3] [S4] [S5] [S6] [S7] ...
↑
Retrieved (S4)
Window sent to LLM: [S2] [S3] [S4] [S5] [S6]
(window_size=2 on each side)
LlamaIndex Implementation
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
# Build nodes with sentence windows
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
# At query time: replace the matched sentence with its window
postproc = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
query_engine = index.as_query_engine(
similarity_top_k=6,
node_postprocessors=[postproc],
)
When to Use
- Documents with dense, information-rich prose
- When individual sentences are semantically meaningful on their own
- When you want tight retrieval precision but broader LLM context
Auto-Merging Retrieval
The Problem
Chunking is arbitrary — a topic may span multiple adjacent chunks. If multiple chunks from the same parent section are retrieved, it’s better to send the full parent than redundant fragments.
The Technique
Build a hierarchy; merge up when children dominate.
- Parse documents into a hierarchical tree — large parent chunks containing smaller leaf chunks
- Index and embed only the leaf nodes (small chunks → precise retrieval)
- At query time: if enough leaf children of a parent are retrieved, merge them into the parent node
Document
└── Chapter (parent, ~512 tokens)
├── Paragraph 1 (leaf, ~128 tokens) ← retrieved
├── Paragraph 2 (leaf, ~128 tokens) ← retrieved
├── Paragraph 3 (leaf, ~128 tokens) ← retrieved
└── Paragraph 4 (leaf, ~128 tokens)
If 3/4 leaves retrieved → merge into Chapter node
LlamaIndex Implementation
from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.storage.docstore import SimpleDocumentStore
# Build hierarchy: 2048 → 512 → 128 tokens
node_parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128]
)
nodes = node_parser.get_nodes_from_documents(documents)
# Store all nodes (parents + leaves)
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
# Retrieve leaves, auto-merge to parents
base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
base_retriever,
storage_context,
verbose=True,
simple_ratio_thresh=0.4, # merge if 40%+ of children retrieved
)
When to Use
- Long documents with nested structure (chapters, sections, paragraphs)
- When queries often span a sub-topic that exists within one section
- To reduce redundancy when multiple overlapping chunks get retrieved
Evaluation with TruLens
Why Automated Evaluation Matters
Manual evaluation doesn’t scale. TruLens provides automatic, LLM-assisted evaluation using the RAG Triad — letting you compare pipeline configurations systematically.
Setup
from trulens_eval import Tru, TruLlama
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.openai import OpenAI as fOpenAI
tru = Tru()
provider = fOpenAI()
# Define feedback functions
grounded = Groundedness(groundedness_provider=provider)
f_groundedness = Feedback(grounded.groundedness_measure_with_cot_reasons)
f_context_relevance = Feedback(provider.context_relevance_with_cot_reasons)
f_answer_relevance = Feedback(provider.relevance)
# Wrap your query engine
tru_recorder = TruLlama(
query_engine,
app_id="my-rag-v1",
feedbacks=[f_groundedness, f_context_relevance, f_answer_relevance],
)
Running Eval
eval_questions = ["What is a vector database?", "How does HNSW work?"]
with tru_recorder as recording:
for q in eval_questions:
query_engine.query(q)
tru.get_leaderboard(app_ids=["my-rag-v1"])
Reading the Leaderboard
| Pipeline | Context Rel. | Groundedness | Answer Rel. | Latency |
|---|---|---|---|---|
| Naive RAG | 0.62 | 0.71 | 0.65 | 1.2s |
| Sentence Window | 0.79 | 0.83 | 0.80 | 1.5s |
| Auto-Merging | 0.81 | 0.88 | 0.82 | 1.8s |
Compare configurations and iterate — TruLens shows which component is failing, not just overall quality.
TruLens Dashboard
tru.run_dashboard() # Opens at http://localhost:8501
Provides per-question drill-down: which context chunks were retrieved, which claims were grounded, where the pipeline failed.
Summary & Choosing the Right Technique
Comparison
| Technique | How It Works | Best For |
|---|---|---|
| Naive RAG | Fixed chunks, embed, retrieve | Prototyping, simple corpora |
| Sentence Window | Embed sentences, retrieve windows | Dense prose; high retrieval precision |
| Auto-Merging | Hierarchical chunks; merge if dominant | Long structured docs; reducing redundancy |
| Hybrid Search | Dense + sparse retrieval | Mixed semantic + keyword queries |
Iterative Improvement Loop
1. Define eval questions
2. Baseline: measure RAG Triad on naive pipeline
3. Identify bottleneck (context relevance? groundedness?)
4. Apply targeted technique
5. Re-evaluate; compare leaderboard
6. Repeat
Key Design Decisions
| Decision | Naive Default | Advanced Options |
|---|---|---|
| Chunk size | Fixed (512 tokens) | Sentence-level, hierarchical |
| Retrieval | Dense kNN | Hybrid, sentence window, auto-merge |
| Post-processing | None | Reranking (CohereRerank, ColBERT) |
| Evaluation | Manual spot-check | TruLens / RAGAS automated triad |
Key Takeaways
| Concept | Key Idea |
|---|---|
| RAG Triad | Context Relevance + Groundedness + Answer Relevance — all three must be high |
| Sentence Window | Embed small → precise match; retrieve window → richer LLM context |
| Auto-Merging | Embed leaves → fast retrieval; merge to parents → reduces fragmentation |
| TruLens | Automated LLM-based evaluation; per-component failure diagnosis |
| Iterative Eval | Measure → identify bottleneck → apply technique → re-measure |
Course: Building and Evaluating Advanced RAG — DeepLearning.AI
Vector Databases: from Embeddings to Applications
Notes from the Vector Databases: from Embeddings to Applications course on DeepLearning.AI — Sebastian Witalec (Weaviate).
Table of Contents
- Introduction
- How to Obtain Vector Representations of Data
- Search for Similar Vectors
- Approximate Nearest Neighbors
- Vector Databases in LLM Applications
- Sparse vs Dense Vectors & Hybrid Search
- Application: Multilingual Search
Introduction
What is and why do we need RAGs?
AI models aren’t trained on recent or prioriterary data. To tackle this problem, you can use retrieval augmented generation or RAGs
Why vector databases?
Traditional databases store exact values and search with exact matches or range queries. AI applications need something different — the ability to search by meaning, not exact text.
A vector database stores data as high-dimensional numerical vectors (embeddings) and enables fast similarity search over those vectors. This is the foundation of semantic search, RAG pipelines, recommendation systems, and more.
How to Obtain Vector Representations of Data
What Is an Embedding?
An embedding is a fixed-length numerical vector that encodes the semantic meaning of a piece of data (text, image, audio, etc.). Semantically similar items end up close together in vector space. Vector embedding captures the underlying meaning of the data.
Encoding-decoding architecture
Its build on encoder-decoder architecture. For e.g. if an MNIST image contains 28 pixels in one dimension, its 28 * 28 = 784 dimensions. Encoder goes from 784 -> 256 -> 128 -> 2, and decoder goes from 2 -> 128 -> 256 -> 784, and tries to re-create the image. Image won’t be perfect in first go, so this process repeats and adjustments are made to achieve perfect image which is called weights.
2 similar images or text will be closer:
"dog" → [0.12, -0.84, 0.23, ..., 0.55] # 1536-dim vector
"puppy" → [0.13, -0.81, 0.25, ..., 0.52] # close to "dog"
"car" → [-0.72, 0.44, -0.11, ..., 0.08] # far from "dog"
Embedding Models
| Model | Dimensions | Notes |
|---|---|---|
text-embedding-ada-002 (OpenAI) |
1536 | General purpose, widely used |
all-MiniLM-L6-v2 (Sentence Transformers) |
384 | Fast, open-source |
multi-qa-mpnet-base-dot-v1 |
768 | Optimized for QA retrieval |
CLIP (OpenAI) |
512 | Multimodal — text + images |
How to calculate disctance (similarity) between 2 embeddings:
4 most popular distance metrics thats used in context of vector databases:
- Euclidean Distance: The length of the shortest path between two points or vectors.
- Manhattan Distance: Distance between two points if one was constrained to move only along one axis at a time.
- Dot Product: Measures the magnitude of the projection of one vector onto the other.
- Cosine: Measure the difference in directionality between vectors.
Dot Product and Cosine Distance are commonly used in the field of NLP, to evaluate how similar two sentence embeddings are
Search for Similar Vectors
Distance Metrics
The core of similarity search is a distance function between two vectors:
| Metric | Formula | Best For |
|---|---|---|
| Euclidean (L2) | $\sqrt{\sum (a_i - b_i)^2}$ | When magnitude matters |
| Cosine Similarity | $\frac{a \cdot b}{|a| |b|}$ | Normalized text embeddings |
| Dot Product | $\sum a_i b_i$ | When vectors are pre-normalized |
Cosine similarity is the most common choice for text embeddings — it measures the angle between vectors and ignores magnitude, so “dog” and “puppy” are close regardless of how many times they appear in training data.
Brute-Force (Exact) Search
Also called k-Nearest Neighbor (kNN) — compare query vector against every stored vector and return the $k$ closest.
- Accuracy: 100% exact
- Complexity: $O(n \cdot d)$ per query — linear in dataset size $n$ and vector dimension $d$
- Problem: Infeasible for millions of vectors at low latency
Approximate Nearest Neighbors
ANN algorithms trade a small accuracy loss for dramatic speed gains — instead of finding the exact nearest neighbors, they find very close neighbors in milliseconds.
Key ANN Algorithms
| Algorithm | Approach | Notes |
|---|---|---|
| HNSW (Hierarchical Navigable Small World) | Graph-based | Best accuracy/speed tradeoff; default in most vector DBs |
| IVF (Inverted File Index) | Clustering | Quantizes space into Voronoi cells; scales well |
| PQ (Product Quantization) | Compression | Reduces memory by compressing vectors |
| FAISS | Meta’s library combining the above | Open source, widely used |
HNSW Intuition
HNSW builds a layered graph where higher layers have coarser connections and lower layers have fine-grained connections — like zooming in on a map. Search starts at the top (few nodes, big jumps) and drills down to find the actual nearest neighbors.
Layer 2 (sparse): A ——— C ——— F
Layer 1: A — B — C — D — E — F
Layer 0 (dense): A-a-B-b-C-c-D-d-E-e-F-f
↑ query lands here
Recall vs. Speed Tradeoff
- Recall@k — fraction of true top-k neighbors returned by ANN vs. exact search
- Typical production target: Recall > 0.95 while achieving < 10ms latency
- Tuning HNSW:
ef_construction(build quality) andef(search quality)
Vector Databases in LLM Applications
What Is a Vector Database?
A vector database is purpose-built for storing, indexing, and querying embeddings at scale. Beyond raw ANN, vector databases add:
- CRUD operations (add, update, delete objects with metadata)
- Metadata filtering (pre/post-filter by structured fields)
- Multi-tenancy and access control
- Persistence (unlike in-memory FAISS)
- Horizontal scaling
Popular Vector Databases
| Database | Notes |
|---|---|
| Weaviate | Open source; schema-based; strong GraphQL API; built-in ML modules |
| Pinecone | Managed; simple API; popular for production |
| Chroma | Lightweight; great for local dev and notebooks |
| Qdrant | Open source; Rust-based; flexible filtering |
| pgvector | Postgres extension — adds vector search to existing Postgres |
| Milvus | Open source; designed for billion-scale datasets |
RAG with a Vector Database
The retrieval step in a RAG pipeline:
User Query
│
▼
Embed query → query vector
│
▼
Vector DB similarity search → top-k relevant chunks
│
▼
Inject chunks into LLM prompt as context
│
▼
LLM generates grounded answer
Weaviate Quickstart (Python)
import weaviate
client = weaviate.Client("http://localhost:8080")
# Define schema
class_obj = {
"class": "Article",
"vectorizer": "text2vec-openai",
"properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(class_obj)
# Add data
client.data_object.create({"content": "Vector databases enable semantic search."}, "Article")
# Semantic search
result = (
client.query
.get("Article", ["content"])
.with_near_text({"concepts": ["similarity search"]})
.with_limit(3)
.do()
)
Sparse vs Dense Vectors & Hybrid Search
Dense vs Sparse Embeddings
| Dense | Sparse | |
|---|---|---|
| Representation | All dimensions have values (e.g., 768 floats) | Most dimensions are 0 (e.g., BM25 term weights) |
| Captures | Semantic meaning | Exact keyword matches |
| Model type | Neural (sentence transformers) | TF-IDF, BM25 |
| Strength | Handles synonyms, paraphrases | High precision for exact terms |
| Weakness | May miss rare keywords | Misses semantic meaning |
Hybrid Search
Combine dense + sparse retrieval to get the best of both worlds:
Dense score (semantic) + α × Sparse score (keyword) → final ranking
Reciprocal Rank Fusion (RRF) is a common fusion method — ranks results from both systems and merges them without needing to tune a weight $\alpha$.
When to use hybrid: When queries may contain specific product names, codes, or jargon that dense-only search would miss, but you still want semantic generalization.
Application: Multilingual Search
Neural embedding models like paraphrase-multilingual-MiniLM-L12-v2 embed text from 50+ languages into the same vector space — enabling cross-lingual search:
- Query in English → retrieve French, Spanish, German documents
- No translation step needed
- Works because the model is trained on parallel multilingual corpora
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
en_query = model.encode("What is a vector database?")
es_doc = model.encode("Una base de datos vectorial almacena embeddings.")
# These will be semantically close despite different languages
Key Takeaways
| Concept | Key Idea |
|---|---|
| Embedding | Fixed-length vector encoding semantic meaning; similar items are nearby in vector space |
| Cosine Similarity | Most common metric for text; measures angle, not magnitude |
| Exact kNN | 100% accurate but $O(n)$ — impractical at scale |
| ANN (HNSW) | Near-exact results in milliseconds; tunable recall/speed tradeoff |
| Vector Database | Manages embeddings at scale: indexing, CRUD, filtering, persistence |
| Hybrid Search | Dense + sparse retrieval; handles semantic meaning AND exact keywords |
| Multilingual | Single multilingual model enables cross-language search without translation |
Course: Vector Databases: from Embeddings to Applications — DeepLearning.AI