Home Concept Explainers Retrieval-Augmented Generation How a RAG System Answers a Question, Step by Step

Retrieval-Augmented Generation MCP handshake 3 sliders

How a RAG System Answers a Question, Step by Step

Five stages turn a user question into a grounded answer. Adjust top-k, chunk size, and similarity threshold to see retrieval shape the result.

Apr 29, 2026 · 3 min lezen

Naar het lab Geen registratie · Voor altijd gratis

▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

Spatie voor play · ←/→ om te scrubben

MCP handshake

FR /100 SN-312

SPACE · ◄ ►

¶ De analogie

The research-assistant analogy

Imagine a sharp research assistant with no memory of your company. Hand them a question and they will hallucinate. Hand them the question plus the three most relevant pages from your filing cabinet, and they will write an excellent answer that cites the pages.

RAG — Retrieval-Augmented Generation — is exactly that workflow, automated. The model is the assistant. Your filing cabinet is a vector store. The "find the three relevant pages" step is retrieval.

The five stages

┌────────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ 1. Ingest  │→ │ 2. Embed │→ │ 3. Store │→ │4.Retrieve│→ │5.Generate│
└────────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘

Ingest — pull source docs (PDFs, wiki, support tickets), strip boilerplate, split into chunks of 200–800 tokens.
Embed — pass each chunk through an embedding model. You get a vector (a list of ~384–3072 numbers) that captures meaning.
Store — write (id, vector, original_text, metadata) into a vector database (pgvector, Pinecone, Qdrant, Weaviate…).
Retrieve — at query time, embed the user's question and find the top-k chunks with the highest cosine similarity.
Generate — stuff those chunks into the LLM prompt as context, ask the question, and read the cited answer.

The knobs that matter most

Chunk size — too small, you fragment ideas; too large, you waste context. 300–500 tokens is a sane default for prose, smaller for code.
Top-k — how many chunks to retrieve. 3–8 is typical. More chunks = more context cost and more chance of distracting the model.
Similarity threshold — drop chunks below a score floor instead of always returning k results. Lets the system say "I do not have anything relevant" gracefully.
Reranking — a second model re-scores the top 20 to pick the best 5. Big quality lift, modest cost.

Why RAG beats fine-tuning for facts

Need	Better choice
Up-to-date information	RAG (just re-embed)
Private/changing knowledge	RAG
New tone, style, format	Fine-tune
New skills (e.g., function-calling style)	Fine-tune

Most "make the LLM know our docs" problems are RAG problems.

Common failure modes

Bad chunks — splitting mid-sentence loses context. Use semantic or recursive splitters.
Embedding drift — querying with one embedding model, indexing with another. The vectors live in different spaces. Always match.
Stale index — docs change, vectors do not. Build re-embedding into your CI or a nightly job.
No citations — if the LLM can't say which chunk it used, you cannot debug it. Always force citations.

From the field

Almost every "the RAG bot is dumb" complaint I've debugged turned out to be retrieval, not the model: the right chunk was never fetched, so the model answered from thin air. That's why I eval the two halves separately — first ask "did we retrieve the chunk that actually contains the answer?", a number you can measure without the LLM at all — and only then judge the generation. Teams that conflate the two burn weeks tuning prompts when their real problem is chunking or the embedding query. Fix retrieval first; the generation is usually fine once it's handed the right context.

→ Wilt u dit in uw stack?

Custom AI Customer-Support Agent Development

Your team stops re-answering the same questions, and customers get accurate replies in seconds instead of waiting in a queue. I build a custom AI support agent — grounded in your help docs, FAQs, and...

Zie hoe ik kan helpen