🧱 Foundations 🤖 LLM Systems 🔍 RAG & Retrieval 🔐 Security × AI 📚 Reading Log
🔍
RAG & Retrieval
Retrieval-Augmented Generation — embedding, chunking, indexing, and hybrid search.
Advanced RAG Techniques
Course Notes

Building and Evaluating Advanced RAG

Notes from the Building and Evaluating Advanced RAG course on DeepLearning.AI — Jerry Liu (LlamaIndex) & Anupam Datta (TruEra).


Table of Contents

  1. Advanced RAG Pipeline Overview
  2. RAG Triad of Metrics
  3. Sentence-Window Retrieval
  4. Auto-Merging Retrieval
  5. Evaluation with TruLens
  6. Summary & Choosing the Right Technique

Advanced RAG Pipeline Overview

Why Naive RAG Falls Short

A naive RAG pipeline — embed documents in fixed-size chunks, retrieve top-k, stuff into prompt — works for simple cases but breaks down on:

Problem Symptom
Chunks too small Retrieved chunk lacks enough context; LLM gives incomplete answers
Chunks too large Noisy retrieval; relevant detail diluted by surrounding text
Embedding granularity mismatch You retrieve at chunk level but need surrounding paragraphs
No evaluation No way to know if retrieval or generation is failing

Advanced RAG Landscape

Query
  │
  ▼
[Pre-Retrieval] — query rewriting, HyDE, query expansion
  │
  ▼
[Retrieval] — sentence-window, auto-merging, hybrid search
  │
  ▼
[Post-Retrieval] — re-ranking, context compression
  │
  ▼
[Generation] — LLM with retrieved context
  │
  ▼
[Evaluation] — RAG triad (TruLens / RAGAS)

LlamaIndex Building Blocks

Component Role
SimpleDirectoryReader Load documents from disk
SentenceSplitter Chunk documents with overlap
VectorStoreIndex Build vector index over chunks
RetrieverQueryEngine Orchestrate retrieval + generation
NodePostprocessor Post-process retrieved nodes (rerank, filter)

RAG Triad of Metrics

The RAG Triad (from TruEra) defines three independent axes of quality — all three must be high for a trustworthy RAG system.

          Answer Relevance
               ↑
               │
  Context  ────┼──── Groundedness
  Relevance    │
               │
         (query, context, response)

1. Context Relevance

“Is the retrieved context actually relevant to the query?”

  • Measures whether the retriever is surfacing useful chunks
  • Failure mode: retriever returns plausible-sounding but off-topic chunks
  • Score: fraction of retrieved context sentences relevant to the query

2. Groundedness

“Is the LLM’s answer supported by the retrieved context?”

  • Also called faithfulness — measures hallucination
  • Failure mode: LLM generates confident statements not found in context
  • Score: fraction of claims in the response that can be traced back to context

3. Answer Relevance

“Does the answer actually address what the user asked?”

  • Measures end-to-end response quality relative to the original query
  • Failure mode: technically grounded response that doesn’t answer the question
  • Score: semantic similarity between response and query intent

Why All Three Together?

Scenario Context Rel. Groundedness Answer Rel. Problem
Hallucination ✅ High ❌ Low ❌ Low LLM ignores context
Wrong retrieval ❌ Low ✅ High ❌ Low Faithful to wrong docs
Off-topic answer ✅ High ✅ High ❌ Low Doesn’t answer the question
Good RAG ✅ High ✅ High ✅ High

Sentence-Window Retrieval

The Problem

Small chunks embed well (each chunk is semantically coherent), but retrieved chunks often lack enough surrounding context for the LLM to synthesize a good answer.

The Technique

Embed small, retrieve big.

  1. Split documents into small sentences (1–3 sentences each)
  2. Each sentence node stores a pointer to its surrounding window (e.g., ±2 sentences on each side)
  3. At query time: embed the small sentence for accurate retrieval
  4. When sending to the LLM: replace the sentence with its full window
Document: ... [S1] [S2] [S3] [S4] [S5] [S6] [S7] ...
                          ↑
                    Retrieved (S4)

Window sent to LLM: [S2] [S3] [S4] [S5] [S6]
                    (window_size=2 on each side)

LlamaIndex Implementation

from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Build nodes with sentence windows
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# At query time: replace the matched sentence with its window
postproc = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

query_engine = index.as_query_engine(
    similarity_top_k=6,
    node_postprocessors=[postproc],
)

When to Use

  • Documents with dense, information-rich prose
  • When individual sentences are semantically meaningful on their own
  • When you want tight retrieval precision but broader LLM context

Auto-Merging Retrieval

The Problem

Chunking is arbitrary — a topic may span multiple adjacent chunks. If multiple chunks from the same parent section are retrieved, it’s better to send the full parent than redundant fragments.

The Technique

Build a hierarchy; merge up when children dominate.

  1. Parse documents into a hierarchical tree — large parent chunks containing smaller leaf chunks
  2. Index and embed only the leaf nodes (small chunks → precise retrieval)
  3. At query time: if enough leaf children of a parent are retrieved, merge them into the parent node
Document
  └── Chapter (parent, ~512 tokens)
        ├── Paragraph 1 (leaf, ~128 tokens)  ← retrieved
        ├── Paragraph 2 (leaf, ~128 tokens)  ← retrieved
        ├── Paragraph 3 (leaf, ~128 tokens)  ← retrieved
        └── Paragraph 4 (leaf, ~128 tokens)

If 3/4 leaves retrieved → merge into Chapter node

LlamaIndex Implementation

from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.storage.docstore import SimpleDocumentStore

# Build hierarchy: 2048 → 512 → 128 tokens
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

nodes = node_parser.get_nodes_from_documents(documents)

# Store all nodes (parents + leaves)
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

# Retrieve leaves, auto-merge to parents
base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
    base_retriever,
    storage_context,
    verbose=True,
    simple_ratio_thresh=0.4,  # merge if 40%+ of children retrieved
)

When to Use

  • Long documents with nested structure (chapters, sections, paragraphs)
  • When queries often span a sub-topic that exists within one section
  • To reduce redundancy when multiple overlapping chunks get retrieved

Evaluation with TruLens

Why Automated Evaluation Matters

Manual evaluation doesn’t scale. TruLens provides automatic, LLM-assisted evaluation using the RAG Triad — letting you compare pipeline configurations systematically.

Setup

from trulens_eval import Tru, TruLlama
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.openai import OpenAI as fOpenAI

tru = Tru()
provider = fOpenAI()

# Define feedback functions
grounded = Groundedness(groundedness_provider=provider)
f_groundedness = Feedback(grounded.groundedness_measure_with_cot_reasons)
f_context_relevance = Feedback(provider.context_relevance_with_cot_reasons)
f_answer_relevance = Feedback(provider.relevance)

# Wrap your query engine
tru_recorder = TruLlama(
    query_engine,
    app_id="my-rag-v1",
    feedbacks=[f_groundedness, f_context_relevance, f_answer_relevance],
)

Running Eval

eval_questions = ["What is a vector database?", "How does HNSW work?"]

with tru_recorder as recording:
    for q in eval_questions:
        query_engine.query(q)

tru.get_leaderboard(app_ids=["my-rag-v1"])

Reading the Leaderboard

Pipeline Context Rel. Groundedness Answer Rel. Latency
Naive RAG 0.62 0.71 0.65 1.2s
Sentence Window 0.79 0.83 0.80 1.5s
Auto-Merging 0.81 0.88 0.82 1.8s

Compare configurations and iterate — TruLens shows which component is failing, not just overall quality.

TruLens Dashboard

tru.run_dashboard()  # Opens at http://localhost:8501

Provides per-question drill-down: which context chunks were retrieved, which claims were grounded, where the pipeline failed.


Summary & Choosing the Right Technique

Comparison

Technique How It Works Best For
Naive RAG Fixed chunks, embed, retrieve Prototyping, simple corpora
Sentence Window Embed sentences, retrieve windows Dense prose; high retrieval precision
Auto-Merging Hierarchical chunks; merge if dominant Long structured docs; reducing redundancy
Hybrid Search Dense + sparse retrieval Mixed semantic + keyword queries

Iterative Improvement Loop

1. Define eval questions
2. Baseline: measure RAG Triad on naive pipeline
3. Identify bottleneck (context relevance? groundedness?)
4. Apply targeted technique
5. Re-evaluate; compare leaderboard
6. Repeat

Key Design Decisions

Decision Naive Default Advanced Options
Chunk size Fixed (512 tokens) Sentence-level, hierarchical
Retrieval Dense kNN Hybrid, sentence window, auto-merge
Post-processing None Reranking (CohereRerank, ColBERT)
Evaluation Manual spot-check TruLens / RAGAS automated triad

Key Takeaways

Concept Key Idea
RAG Triad Context Relevance + Groundedness + Answer Relevance — all three must be high
Sentence Window Embed small → precise match; retrieve window → richer LLM context
Auto-Merging Embed leaves → fast retrieval; merge to parents → reduces fragmentation
TruLens Automated LLM-based evaluation; per-component failure diagnosis
Iterative Eval Measure → identify bottleneck → apply technique → re-measure

Course: Building and Evaluating Advanced RAG — DeepLearning.AI

Vector Databases & Embeddings
Course Notes

Vector Databases: from Embeddings to Applications

Notes from the Vector Databases: from Embeddings to Applications course on DeepLearning.AI — Sebastian Witalec (Weaviate).


Table of Contents

  1. Introduction
  2. How to Obtain Vector Representations of Data
  3. Search for Similar Vectors
  4. Approximate Nearest Neighbors
  5. Vector Databases in LLM Applications
  6. Sparse vs Dense Vectors & Hybrid Search
  7. Application: Multilingual Search

Introduction

What is and why do we need RAGs?

AI models aren’t trained on recent or prioriterary data. To tackle this problem, you can use retrieval augmented generation or RAGs

Why vector databases?

Traditional databases store exact values and search with exact matches or range queries. AI applications need something different — the ability to search by meaning, not exact text.

A vector database stores data as high-dimensional numerical vectors (embeddings) and enables fast similarity search over those vectors. This is the foundation of semantic search, RAG pipelines, recommendation systems, and more.


How to Obtain Vector Representations of Data

What Is an Embedding?

An embedding is a fixed-length numerical vector that encodes the semantic meaning of a piece of data (text, image, audio, etc.). Semantically similar items end up close together in vector space. Vector embedding captures the underlying meaning of the data.

Encoding-decoding architecture

Its build on encoder-decoder architecture. For e.g. if an MNIST image contains 28 pixels in one dimension, its 28 * 28 = 784 dimensions. Encoder goes from 784 -> 256 -> 128 -> 2, and decoder goes from 2 -> 128 -> 256 -> 784, and tries to re-create the image. Image won’t be perfect in first go, so this process repeats and adjustments are made to achieve perfect image which is called weights.

2 similar images or text will be closer:

"dog" → [0.12, -0.84, 0.23, ..., 0.55]  # 1536-dim vector
"puppy" → [0.13, -0.81, 0.25, ..., 0.52]  # close to "dog"
"car" → [-0.72, 0.44, -0.11, ..., 0.08]  # far from "dog"

Embedding Models

Model Dimensions Notes
text-embedding-ada-002 (OpenAI) 1536 General purpose, widely used
all-MiniLM-L6-v2 (Sentence Transformers) 384 Fast, open-source
multi-qa-mpnet-base-dot-v1 768 Optimized for QA retrieval
CLIP (OpenAI) 512 Multimodal — text + images

How to calculate disctance (similarity) between 2 embeddings:

4 most popular distance metrics thats used in context of vector databases:

  • Euclidean Distance: The length of the shortest path between two points or vectors.
  • Manhattan Distance: Distance between two points if one was constrained to move only along one axis at a time.
  • Dot Product: Measures the magnitude of the projection of one vector onto the other.
  • Cosine: Measure the difference in directionality between vectors.

Dot Product and Cosine Distance are commonly used in the field of NLP, to evaluate how similar two sentence embeddings are


Search for Similar Vectors

Distance Metrics

The core of similarity search is a distance function between two vectors:

Metric Formula Best For
Euclidean (L2) $\sqrt{\sum (a_i - b_i)^2}$ When magnitude matters
Cosine Similarity $\frac{a \cdot b}{|a| |b|}$ Normalized text embeddings
Dot Product $\sum a_i b_i$ When vectors are pre-normalized

Cosine similarity is the most common choice for text embeddings — it measures the angle between vectors and ignores magnitude, so “dog” and “puppy” are close regardless of how many times they appear in training data.

Also called k-Nearest Neighbor (kNN) — compare query vector against every stored vector and return the $k$ closest.

  • Accuracy: 100% exact
  • Complexity: $O(n \cdot d)$ per query — linear in dataset size $n$ and vector dimension $d$
  • Problem: Infeasible for millions of vectors at low latency

Approximate Nearest Neighbors

ANN algorithms trade a small accuracy loss for dramatic speed gains — instead of finding the exact nearest neighbors, they find very close neighbors in milliseconds.

Key ANN Algorithms

Algorithm Approach Notes
HNSW (Hierarchical Navigable Small World) Graph-based Best accuracy/speed tradeoff; default in most vector DBs
IVF (Inverted File Index) Clustering Quantizes space into Voronoi cells; scales well
PQ (Product Quantization) Compression Reduces memory by compressing vectors
FAISS Meta’s library combining the above Open source, widely used

HNSW Intuition

HNSW builds a layered graph where higher layers have coarser connections and lower layers have fine-grained connections — like zooming in on a map. Search starts at the top (few nodes, big jumps) and drills down to find the actual nearest neighbors.

Layer 2 (sparse):  A ——— C ——— F
Layer 1:           A — B — C — D — E — F
Layer 0 (dense):   A-a-B-b-C-c-D-d-E-e-F-f
                           ↑ query lands here

Recall vs. Speed Tradeoff

  • Recall@k — fraction of true top-k neighbors returned by ANN vs. exact search
  • Typical production target: Recall > 0.95 while achieving < 10ms latency
  • Tuning HNSW: ef_construction (build quality) and ef (search quality)

Vector Databases in LLM Applications

What Is a Vector Database?

A vector database is purpose-built for storing, indexing, and querying embeddings at scale. Beyond raw ANN, vector databases add:

  • CRUD operations (add, update, delete objects with metadata)
  • Metadata filtering (pre/post-filter by structured fields)
  • Multi-tenancy and access control
  • Persistence (unlike in-memory FAISS)
  • Horizontal scaling
Database Notes
Weaviate Open source; schema-based; strong GraphQL API; built-in ML modules
Pinecone Managed; simple API; popular for production
Chroma Lightweight; great for local dev and notebooks
Qdrant Open source; Rust-based; flexible filtering
pgvector Postgres extension — adds vector search to existing Postgres
Milvus Open source; designed for billion-scale datasets

RAG with a Vector Database

The retrieval step in a RAG pipeline:

User Query
    │
    ▼
Embed query → query vector
    │
    ▼
Vector DB similarity search → top-k relevant chunks
    │
    ▼
Inject chunks into LLM prompt as context
    │
    ▼
LLM generates grounded answer

Weaviate Quickstart (Python)

import weaviate

client = weaviate.Client("http://localhost:8080")

# Define schema
class_obj = {
    "class": "Article",
    "vectorizer": "text2vec-openai",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(class_obj)

# Add data
client.data_object.create({"content": "Vector databases enable semantic search."}, "Article")

# Semantic search
result = (
    client.query
    .get("Article", ["content"])
    .with_near_text({"concepts": ["similarity search"]})
    .with_limit(3)
    .do()
)

Dense vs Sparse Embeddings

  Dense Sparse
Representation All dimensions have values (e.g., 768 floats) Most dimensions are 0 (e.g., BM25 term weights)
Captures Semantic meaning Exact keyword matches
Model type Neural (sentence transformers) TF-IDF, BM25
Strength Handles synonyms, paraphrases High precision for exact terms
Weakness May miss rare keywords Misses semantic meaning

Combine dense + sparse retrieval to get the best of both worlds:

Dense score (semantic) + α × Sparse score (keyword) → final ranking

Reciprocal Rank Fusion (RRF) is a common fusion method — ranks results from both systems and merges them without needing to tune a weight $\alpha$.

When to use hybrid: When queries may contain specific product names, codes, or jargon that dense-only search would miss, but you still want semantic generalization.


Neural embedding models like paraphrase-multilingual-MiniLM-L12-v2 embed text from 50+ languages into the same vector space — enabling cross-lingual search:

  • Query in English → retrieve French, Spanish, German documents
  • No translation step needed
  • Works because the model is trained on parallel multilingual corpora
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

en_query = model.encode("What is a vector database?")
es_doc = model.encode("Una base de datos vectorial almacena embeddings.")

# These will be semantically close despite different languages

Key Takeaways

Concept Key Idea
Embedding Fixed-length vector encoding semantic meaning; similar items are nearby in vector space
Cosine Similarity Most common metric for text; measures angle, not magnitude
Exact kNN 100% accurate but $O(n)$ — impractical at scale
ANN (HNSW) Near-exact results in milliseconds; tunable recall/speed tradeoff
Vector Database Manages embeddings at scale: indexing, CRUD, filtering, persistence
Hybrid Search Dense + sparse retrieval; handles semantic meaning AND exact keywords
Multilingual Single multilingual model enables cross-language search without translation

Course: Vector Databases: from Embeddings to Applications — DeepLearning.AI