Skip to main content
Retrieval-Augmented Generation MCP handshake 3 sliders

How a RAG System Answers a Question, Step by Step

Five stages turn a user question into a grounded answer. Adjust top-k, chunk size, and similarity threshold to see retrieval shape the result.

· 2 min lezen
Naar het lab
▸ Probeer het zelf

Sleep een slider — het diagram reageert in real time.

FR /100
¶ De analogie

The research-assistant analogy

Imagine a sharp research assistant with no memory of your company. Hand them a question and they will hallucinate. Hand them the question plus the three most relevant pages from your filing cabinet, and they will write an excellent answer that cites the pages.

RAG — Retrieval-Augmented Generation — is exactly that workflow, automated. The model is the assistant. Your filing cabinet is a vector store. The "find the three relevant pages" step is retrieval.

The five stages

┌────────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│ 1. Ingest  │→ │ 2. Embed │→ │ 3. Store │→ │4.Retrieve│→ │5.Generate│
└────────────┘  └──────────┘  └──────────┘  └──────────┘  └──────────┘
  1. Ingest — pull source docs (PDFs, wiki, support tickets), strip boilerplate, split into chunks of 200–800 tokens.
  2. Embed — pass each chunk through an embedding model. You get a vector (a list of ~384–3072 numbers) that captures meaning.
  3. Store — write (id, vector, original_text, metadata) into a vector database (pgvector, Pinecone, Qdrant, Weaviate…).
  4. Retrieve — at query time, embed the user's question and find the top-k chunks with the highest cosine similarity.
  5. Generate — stuff those chunks into the LLM prompt as context, ask the question, and read the cited answer.

The knobs that matter most

  • Chunk size — too small, you fragment ideas; too large, you waste context. 300–500 tokens is a sane default for prose, smaller for code.
  • Top-k — how many chunks to retrieve. 3–8 is typical. More chunks = more context cost and more chance of distracting the model.
  • Similarity threshold — drop chunks below a score floor instead of always returning k results. Lets the system say "I do not have anything relevant" gracefully.
  • Reranking — a second model re-scores the top 20 to pick the best 5. Big quality lift, modest cost.

Why RAG beats fine-tuning for facts

Need Better choice
Up-to-date information RAG (just re-embed)
Private/changing knowledge RAG
New tone, style, format Fine-tune
New skills (e.g., function-calling style) Fine-tune

Most "make the LLM know our docs" problems are RAG problems.

Common failure modes

  • Bad chunks — splitting mid-sentence loses context. Use semantic or recursive splitters.
  • Embedding drift — querying with one embedding model, indexing with another. The vectors live in different spaces. Always match.
  • Stale index — docs change, vectors do not. Build re-embedding into your CI or a nightly job.
  • No citations — if the LLM can't say which chunk it used, you cannot debug it. Always force citations.
Engr Mejba Ahmed

Engr Mejba Ahmed

Claude Code Expert · Online

👋

Hey there!

Quick Actions

WhatsApp Instant reply

Chat on WhatsApp

+880 1723 741224 · Instant reply

Popular Questions

Engr Mejba Ahmed is connected
Engr Mejba Ahmed is typing...
Engr Mejba Ahmed avatar

✉ Want me to follow up? Drop your email

Engr Mejba Ahmed avatar

📞 Connect Directly

Choose how you'd like to reach me

WhatsApp

+880 1723 741224

Email

[email protected]

✓ Details sent! I'll get back to you shortly.

Powered by OpenAI

335+

Blog Posts

25

AI Courses

63

Projects

Services & Expertise

Pricing & Process

Learning & Resources

Connect & Support