The research-assistant analogy
Imagine a sharp research assistant with no memory of your company. Hand them a question and they will hallucinate. Hand them the question plus the three most relevant pages from your filing cabinet, and they will write an excellent answer that cites the pages.
RAG — Retrieval-Augmented Generation — is exactly that workflow, automated. The model is the assistant. Your filing cabinet is a vector store. The "find the three relevant pages" step is retrieval.
The five stages
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ 1. Ingest │→ │ 2. Embed │→ │ 3. Store │→ │4.Retrieve│→ │5.Generate│
└────────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
- Ingest — pull source docs (PDFs, wiki, support tickets), strip boilerplate, split into chunks of 200–800 tokens.
- Embed — pass each chunk through an embedding model. You get a vector (a list of ~384–3072 numbers) that captures meaning.
- Store — write
(id, vector, original_text, metadata)into a vector database (pgvector, Pinecone, Qdrant, Weaviate…). - Retrieve — at query time, embed the user's question and find the top-k chunks with the highest cosine similarity.
- Generate — stuff those chunks into the LLM prompt as context, ask the question, and read the cited answer.
The knobs that matter most
- Chunk size — too small, you fragment ideas; too large, you waste context. 300–500 tokens is a sane default for prose, smaller for code.
- Top-k — how many chunks to retrieve. 3–8 is typical. More chunks = more context cost and more chance of distracting the model.
- Similarity threshold — drop chunks below a score floor instead of always returning k results. Lets the system say "I do not have anything relevant" gracefully.
- Reranking — a second model re-scores the top 20 to pick the best 5. Big quality lift, modest cost.
Why RAG beats fine-tuning for facts
| Need | Better choice |
|---|---|
| Up-to-date information | RAG (just re-embed) |
| Private/changing knowledge | RAG |
| New tone, style, format | Fine-tune |
| New skills (e.g., function-calling style) | Fine-tune |
Most "make the LLM know our docs" problems are RAG problems.
Common failure modes
- Bad chunks — splitting mid-sentence loses context. Use semantic or recursive splitters.
- Embedding drift — querying with one embedding model, indexing with another. The vectors live in different spaces. Always match.
- Stale index — docs change, vectors do not. Build re-embedding into your CI or a nightly job.
- No citations — if the LLM can't say which chunk it used, you cannot debug it. Always force citations.