RAG & Retrieval · Mansi Sheth

🧱 Foundations 🤖 LLM Systems 🔍 RAG & Retrieval 🔐 Security × AI 📚 Reading Log

Advanced RAG Techniques

Course Notes

Building and Evaluating Advanced RAG

Notes from the Building and Evaluating Advanced RAG course on DeepLearning.AI — Jerry Liu (LlamaIndex) & Anupam Datta (TruEra).

Advanced RAG Pipeline Overview
RAG Triad of Metrics
Sentence-Window Retrieval
Auto-Merging Retrieval
Evaluation with TruLens
Summary & Choosing the Right Technique

Advanced RAG Pipeline Overview

Why Naive RAG Falls Short

A naive RAG pipeline — embed documents in fixed-size chunks, retrieve top-k, stuff into prompt — works for simple cases but breaks down on:

Problem	Symptom
Chunks too small	Retrieved chunk lacks enough context; LLM gives incomplete answers
Chunks too large	Noisy retrieval; relevant detail diluted by surrounding text
Embedding granularity mismatch	You retrieve at chunk level but need surrounding paragraphs
No evaluation	No way to know if retrieval or generation is failing

Advanced RAG Landscape

Query
  │
  ▼
[Pre-Retrieval] — query rewriting, HyDE, query expansion
  │
  ▼
[Retrieval] — sentence-window, auto-merging, hybrid search
  │
  ▼
[Post-Retrieval] — re-ranking, context compression
  │
  ▼
[Generation] — LLM with retrieved context
  │
  ▼
[Evaluation] — RAG triad (TruLens / RAGAS)

LlamaIndex Building Blocks

Component	Role
`SimpleDirectoryReader`	Load documents from disk
`SentenceSplitter`	Chunk documents with overlap
`VectorStoreIndex`	Build vector index over chunks
`RetrieverQueryEngine`	Orchestrate retrieval + generation
`NodePostprocessor`	Post-process retrieved nodes (rerank, filter)

RAG Triad of Metrics

The RAG Triad (from TruEra) defines three independent axes of quality — all three must be high for a trustworthy RAG system.

          Answer Relevance
               ↑
               │
  Context  ────┼──── Groundedness
  Relevance    │
               │
         (query, context, response)

1. Context Relevance

“Is the retrieved context actually relevant to the query?”

Measures whether the retriever is surfacing useful chunks
Failure mode: retriever returns plausible-sounding but off-topic chunks
Score: fraction of retrieved context sentences relevant to the query

2. Groundedness

“Is the LLM’s answer supported by the retrieved context?”

Also called faithfulness — measures hallucination
Failure mode: LLM generates confident statements not found in context
Score: fraction of claims in the response that can be traced back to context

3. Answer Relevance

“Does the answer actually address what the user asked?”

Measures end-to-end response quality relative to the original query
Failure mode: technically grounded response that doesn’t answer the question
Score: semantic similarity between response and query intent

Why All Three Together?

Scenario	Context Rel.	Groundedness	Answer Rel.	Problem
Hallucination	✅ High	❌ Low	❌ Low	LLM ignores context
Wrong retrieval	❌ Low	✅ High	❌ Low	Faithful to wrong docs
Off-topic answer	✅ High	✅ High	❌ Low	Doesn’t answer the question
Good RAG	✅ High	✅ High	✅ High	—

Sentence-Window Retrieval

The Problem

Small chunks embed well (each chunk is semantically coherent), but retrieved chunks often lack enough surrounding context for the LLM to synthesize a good answer.

The Technique

Embed small, retrieve big.

Split documents into small sentences (1–3 sentences each)
Each sentence node stores a pointer to its surrounding window (e.g., ±2 sentences on each side)
At query time: embed the small sentence for accurate retrieval
When sending to the LLM: replace the sentence with its full window

Document: ... [S1] [S2] [S3] [S4] [S5] [S6] [S7] ...
                          ↑
                    Retrieved (S4)

Window sent to LLM: [S2] [S3] [S4] [S5] [S6]
                    (window_size=2 on each side)

LlamaIndex Implementation

from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Build nodes with sentence windows
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# At query time: replace the matched sentence with its window
postproc = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

query_engine = index.as_query_engine(
    similarity_top_k=6,
    node_postprocessors=[postproc],
)

When to Use

Documents with dense, information-rich prose
When individual sentences are semantically meaningful on their own
When you want tight retrieval precision but broader LLM context

Auto-Merging Retrieval

The Problem

Chunking is arbitrary — a topic may span multiple adjacent chunks. If multiple chunks from the same parent section are retrieved, it’s better to send the full parent than redundant fragments.

The Technique

Build a hierarchy; merge up when children dominate.

Parse documents into a hierarchical tree — large parent chunks containing smaller leaf chunks
Index and embed only the leaf nodes (small chunks → precise retrieval)
At query time: if enough leaf children of a parent are retrieved, merge them into the parent node

Document
  └── Chapter (parent, ~512 tokens)
        ├── Paragraph 1 (leaf, ~128 tokens)  ← retrieved
        ├── Paragraph 2 (leaf, ~128 tokens)  ← retrieved
        ├── Paragraph 3 (leaf, ~128 tokens)  ← retrieved
        └── Paragraph 4 (leaf, ~128 tokens)

If 3/4 leaves retrieved → merge into Chapter node

LlamaIndex Implementation

from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever
from llama_index.core.storage.docstore import SimpleDocumentStore

# Build hierarchy: 2048 → 512 → 128 tokens
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)

nodes = node_parser.get_nodes_from_documents(documents)

# Store all nodes (parents + leaves)
docstore = SimpleDocumentStore()
docstore.add_documents(nodes)

# Retrieve leaves, auto-merge to parents
base_retriever = index.as_retriever(similarity_top_k=12)
retriever = AutoMergingRetriever(
    base_retriever,
    storage_context,
    verbose=True,
    simple_ratio_thresh=0.4,  # merge if 40%+ of children retrieved
)

When to Use

Long documents with nested structure (chapters, sections, paragraphs)
When queries often span a sub-topic that exists within one section
To reduce redundancy when multiple overlapping chunks get retrieved

Evaluation with TruLens

Why Automated Evaluation Matters

Manual evaluation doesn’t scale. TruLens provides automatic, LLM-assisted evaluation using the RAG Triad — letting you compare pipeline configurations systematically.

Setup

from trulens_eval import Tru, TruLlama
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.openai import OpenAI as fOpenAI

tru = Tru()
provider = fOpenAI()

# Define feedback functions
grounded = Groundedness(groundedness_provider=provider)
f_groundedness = Feedback(grounded.groundedness_measure_with_cot_reasons)
f_context_relevance = Feedback(provider.context_relevance_with_cot_reasons)
f_answer_relevance = Feedback(provider.relevance)

# Wrap your query engine
tru_recorder = TruLlama(
    query_engine,
    app_id="my-rag-v1",
    feedbacks=[f_groundedness, f_context_relevance, f_answer_relevance],
)

Running Eval

eval_questions = ["What is a vector database?", "How does HNSW work?"]

with tru_recorder as recording:
    for q in eval_questions:
        query_engine.query(q)

tru.get_leaderboard(app_ids=["my-rag-v1"])

Reading the Leaderboard

Pipeline	Context Rel.	Groundedness	Answer Rel.	Latency
Naive RAG	0.62	0.71	0.65	1.2s
Sentence Window	0.79	0.83	0.80	1.5s
Auto-Merging	0.81	0.88	0.82	1.8s

Compare configurations and iterate — TruLens shows which component is failing, not just overall quality.

TruLens Dashboard

tru.run_dashboard()  # Opens at http://localhost:8501

Provides per-question drill-down: which context chunks were retrieved, which claims were grounded, where the pipeline failed.

Summary & Choosing the Right Technique

Comparison

Technique	How It Works	Best For
Naive RAG	Fixed chunks, embed, retrieve	Prototyping, simple corpora
Sentence Window	Embed sentences, retrieve windows	Dense prose; high retrieval precision
Auto-Merging	Hierarchical chunks; merge if dominant	Long structured docs; reducing redundancy
Hybrid Search	Dense + sparse retrieval	Mixed semantic + keyword queries

Iterative Improvement Loop

Define eval questions
Baseline: measure RAG Triad on naive pipeline
Identify bottleneck (context relevance? groundedness?)
Apply targeted technique
Re-evaluate; compare leaderboard
Repeat

Key Design Decisions

Decision	Naive Default	Advanced Options
Chunk size	Fixed (512 tokens)	Sentence-level, hierarchical
Retrieval	Dense kNN	Hybrid, sentence window, auto-merge
Post-processing	None	Reranking (CohereRerank, ColBERT)
Evaluation	Manual spot-check	TruLens / RAGAS automated triad

Key Takeaways

Concept	Key Idea
RAG Triad	Context Relevance + Groundedness + Answer Relevance — all three must be high
Sentence Window	Embed small → precise match; retrieve window → richer LLM context
Auto-Merging	Embed leaves → fast retrieval; merge to parents → reduces fragmentation
TruLens	Automated LLM-based evaluation; per-component failure diagnosis
Iterative Eval	Measure → identify bottleneck → apply technique → re-measure

Course: Building and Evaluating Advanced RAG — DeepLearning.AI

Vector Databases & Embeddings

Course Notes

Vector Databases: from Embeddings to Applications

Notes from the Vector Databases: from Embeddings to Applications course on DeepLearning.AI — Sebastian Witalec (Weaviate).

Introduction
How to Obtain Vector Representations of Data
Search for Similar Vectors
Approximate Nearest Neighbors
Vector Databases in LLM Applications
Sparse vs Dense Vectors & Hybrid Search
Application: Multilingual Search

Introduction

What is and why do we need RAGs?

AI models aren’t trained on recent or prioriterary data. To tackle this problem, you can use retrieval augmented generation or RAGs

Why vector databases?

Traditional databases store exact values and search with exact matches or range queries. AI applications need something different — the ability to search by meaning, not exact text.

A vector database stores data as high-dimensional numerical vectors (embeddings) and enables fast similarity search over those vectors. This is the foundation of semantic search, RAG pipelines, recommendation systems, and more.

How to Obtain Vector Representations of Data

What Is an Embedding?

An embedding is a fixed-length numerical vector that encodes the semantic meaning of a piece of data (text, image, audio, etc.). Semantically similar items end up close together in vector space. Vector embedding captures the underlying meaning of the data.

Encoding-decoding architecture

Its build on encoder-decoder architecture. For e.g. if an MNIST image contains 28 pixels in one dimension, its 28 * 28 = 784 dimensions. Encoder goes from 784 -> 256 -> 128 -> 2, and decoder goes from 2 -> 128 -> 256 -> 784, and tries to re-create the image. Image won’t be perfect in first go, so this process repeats and adjustments are made to achieve perfect image which is called weights.

2 similar images or text will be closer:

"dog" → [0.12, -0.84, 0.23, ..., 0.55]  # 1536-dim vector
"puppy" → [0.13, -0.81, 0.25, ..., 0.52]  # close to "dog"
"car" → [-0.72, 0.44, -0.11, ..., 0.08]  # far from "dog"

Embedding Models

Model	Dimensions	Notes
`text-embedding-ada-002` (OpenAI)	1536	General purpose, widely used
`all-MiniLM-L6-v2` (Sentence Transformers)	384	Fast, open-source
`multi-qa-mpnet-base-dot-v1`	768	Optimized for QA retrieval
`CLIP` (OpenAI)	512	Multimodal — text + images

How to calculate disctance (similarity) between 2 embeddings:

4 most popular distance metrics thats used in context of vector databases:

Euclidean Distance: The length of the shortest path between two points or vectors.
Manhattan Distance: Distance between two points if one was constrained to move only along one axis at a time.
Dot Product: Measures the magnitude of the projection of one vector onto the other.
Cosine: Measure the difference in directionality between vectors.

Dot Product and Cosine Distance are commonly used in the field of NLP, to evaluate how similar two sentence embeddings are

Search for Similar Vectors

Distance Metrics

The core of similarity search is a distance function between two vectors:

Metric	Formula	Best For
Euclidean (L2)	$\sqrt{\sum (a_i - b_i)^2}$	When magnitude matters
Cosine Similarity	$\frac{a \cdot b}{\|a\| \|b\|}$	Normalized text embeddings
Dot Product	$\sum a_i b_i$	When vectors are pre-normalized

Cosine similarity is the most common choice for text embeddings — it measures the angle between vectors and ignores magnitude, so “dog” and “puppy” are close regardless of how many times they appear in training data.

Brute-Force (Exact) Search

Also called k-Nearest Neighbor (kNN) — compare query vector against every stored vector and return the $k$ closest.

Accuracy: 100% exact
Complexity: $O(n \cdot d)$ per query — linear in dataset size $n$ and vector dimension $d$
Problem: Infeasible for millions of vectors at low latency

Approximate Nearest Neighbors

ANN algorithms trade a small accuracy loss for dramatic speed gains — instead of finding the exact nearest neighbors, they find very close neighbors in milliseconds.

Key ANN Algorithms

Algorithm	Approach	Notes
HNSW (Hierarchical Navigable Small World)	Graph-based	Best accuracy/speed tradeoff; default in most vector DBs
IVF (Inverted File Index)	Clustering	Quantizes space into Voronoi cells; scales well
PQ (Product Quantization)	Compression	Reduces memory by compressing vectors
FAISS	Meta’s library combining the above	Open source, widely used

HNSW Intuition

HNSW builds a layered graph where higher layers have coarser connections and lower layers have fine-grained connections — like zooming in on a map. Search starts at the top (few nodes, big jumps) and drills down to find the actual nearest neighbors.

Layer 2 (sparse):  A ——— C ——— F
Layer 1:           A — B — C — D — E — F
Layer 0 (dense):   A-a-B-b-C-c-D-d-E-e-F-f
                           ↑ query lands here

Recall vs. Speed Tradeoff

Recall@k — fraction of true top-k neighbors returned by ANN vs. exact search
Typical production target: Recall > 0.95 while achieving < 10ms latency
Tuning HNSW: ef_construction (build quality) and ef (search quality)

Vector Databases in LLM Applications

What Is a Vector Database?

A vector database is purpose-built for storing, indexing, and querying embeddings at scale. Beyond raw ANN, vector databases add:

CRUD operations (add, update, delete objects with metadata)
Metadata filtering (pre/post-filter by structured fields)
Multi-tenancy and access control
Persistence (unlike in-memory FAISS)
Horizontal scaling

Popular Vector Databases

Database	Notes
Weaviate	Open source; schema-based; strong GraphQL API; built-in ML modules
Pinecone	Managed; simple API; popular for production
Chroma	Lightweight; great for local dev and notebooks
Qdrant	Open source; Rust-based; flexible filtering
pgvector	Postgres extension — adds vector search to existing Postgres
Milvus	Open source; designed for billion-scale datasets

RAG with a Vector Database

The retrieval step in a RAG pipeline:

User Query
    │
    ▼
Embed query → query vector
    │
    ▼
Vector DB similarity search → top-k relevant chunks
    │
    ▼
Inject chunks into LLM prompt as context
    │
    ▼
LLM generates grounded answer

Weaviate Quickstart (Python)

import weaviate

client = weaviate.Client("http://localhost:8080")

# Define schema
class_obj = {
    "class": "Article",
    "vectorizer": "text2vec-openai",
    "properties": [{"name": "content", "dataType": ["text"]}]
}
client.schema.create_class(class_obj)

# Add data
client.data_object.create({"content": "Vector databases enable semantic search."}, "Article")

# Semantic search
result = (
    client.query
    .get("Article", ["content"])
    .with_near_text({"concepts": ["similarity search"]})
    .with_limit(3)
    .do()
)

Sparse vs Dense Vectors & Hybrid Search

Dense vs Sparse Embeddings

	Dense	Sparse
Representation	All dimensions have values (e.g., 768 floats)	Most dimensions are 0 (e.g., BM25 term weights)
Captures	Semantic meaning	Exact keyword matches
Model type	Neural (sentence transformers)	TF-IDF, BM25
Strength	Handles synonyms, paraphrases	High precision for exact terms
Weakness	May miss rare keywords	Misses semantic meaning

Hybrid Search

Combine dense + sparse retrieval to get the best of both worlds:

Dense score (semantic) + α × Sparse score (keyword) → final ranking

Reciprocal Rank Fusion (RRF) is a common fusion method — ranks results from both systems and merges them without needing to tune a weight $\alpha$.

When to use hybrid: When queries may contain specific product names, codes, or jargon that dense-only search would miss, but you still want semantic generalization.

Application: Multilingual Search

Neural embedding models like paraphrase-multilingual-MiniLM-L12-v2 embed text from 50+ languages into the same vector space — enabling cross-lingual search:

Query in English → retrieve French, Spanish, German documents
No translation step needed
Works because the model is trained on parallel multilingual corpora

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

en_query = model.encode("What is a vector database?")
es_doc = model.encode("Una base de datos vectorial almacena embeddings.")

# These will be semantically close despite different languages

Key Takeaways

Concept	Key Idea
Embedding	Fixed-length vector encoding semantic meaning; similar items are nearby in vector space
Cosine Similarity	Most common metric for text; measures angle, not magnitude
Exact kNN	100% accurate but $O(n)$ — impractical at scale
ANN (HNSW)	Near-exact results in milliseconds; tunable recall/speed tradeoff
Vector Database	Manages embeddings at scale: indexing, CRUD, filtering, persistence
Hybrid Search	Dense + sparse retrieval; handles semantic meaning AND exact keywords
Multilingual	Single multilingual model enables cross-language search without translation

Course: Vector Databases: from Embeddings to Applications — DeepLearning.AI

Building and Evaluating Advanced RAG

Table of Contents

Advanced RAG Pipeline Overview

Why Naive RAG Falls Short

Advanced RAG Landscape

LlamaIndex Building Blocks

RAG Triad of Metrics

1. Context Relevance

2. Groundedness

3. Answer Relevance

Why All Three Together?

Sentence-Window Retrieval

The Problem

The Technique

LlamaIndex Implementation

When to Use

Auto-Merging Retrieval

The Problem

The Technique

LlamaIndex Implementation

When to Use

Evaluation with TruLens

Why Automated Evaluation Matters

Setup

Running Eval

Reading the Leaderboard

TruLens Dashboard

Summary & Choosing the Right Technique

Comparison

Iterative Improvement Loop

Key Design Decisions

Key Takeaways

Vector Databases: from Embeddings to Applications

Table of Contents

Introduction

How to Obtain Vector Representations of Data

What Is an Embedding?

Encoding-decoding architecture

Embedding Models

How to calculate disctance (similarity) between 2 embeddings:

Search for Similar Vectors

Distance Metrics

Brute-Force (Exact) Search

Approximate Nearest Neighbors

Key ANN Algorithms

HNSW Intuition

Recall vs. Speed Tradeoff

Vector Databases in LLM Applications

What Is a Vector Database?

Popular Vector Databases

RAG with a Vector Database

Weaviate Quickstart (Python)

Sparse vs Dense Vectors & Hybrid Search

Dense vs Sparse Embeddings

Hybrid Search

Application: Multilingual Search

Key Takeaways