Your team needs to build a system where an LLM answers questions about your company's internal documentation, which is updated weekly. A colleague proposes fine-tuning the model on the docs. What are the tradeoffs between fine-tuning and RAG here, and what would you recommend?

Question

Accepted Answer

What each approach does

Fine-tuning: continue training on your domain data to update model weights. Knowledge is encoded into parameters. Inference is identical to the base model — no retrieval step.

RAG (Retrieval-Augmented Generation): at query time, retrieve relevant document chunks from a vector store and inject them into the prompt as context. Model weights are unchanged. Knowledge lives in an external, queryable store.

Why fine-tuning is often the wrong default for knowledge

Fine-tuning teaches style, format, and behavior well. It teaches facts unreliably. Weights encode information in a distributed, superimposed way — there is no dedicated "slot" for "our refund policy is 30 days." Facts from different documents interfere in the weight space. The model can hallucinate fine-tuned facts with false confidence because the training signal looks the same whether the fact was learned correctly or not.

Fine-tuning on 1000 documents is not like adding them to memory — it nudges billions of parameters slightly toward patterns in those documents. At evaluation time, the model interpolates from those patterns rather than retrieving them.

For knowledge that updates weekly, fine-tuning requires retraining on each update cycle — expensive, slow, and prone to catastrophic forgetting of previous knowledge.

When fine-tuning wins

- Teaching output format: JSON schema adherence, structured citations, specific response templates
- Domain vocabulary and tone: medical, legal, or code-heavy domains where the base model's prior is weak and format matters as much as content
- Behavioral changes: how to handle follow-up questions, when to refuse, how to use tools
- Stable knowledge that requires internal reasoning: structured reference tables the model needs to reason over deeply, not just retrieve

When RAG wins

- Knowledge changes frequently (weekly doc updates → just re-index, no retraining)
- Corpus is too large to encode in model weights reliably
- Answers must be verifiable with citations (retrieval provides provenance; weight-encoded facts cannot)
- Multiple knowledge domains with different access controls (filter at retrieval time, not training time)
- Latency budget allows retrieval step (~50-200ms for vector search)

Combined approach

Production systems often use both: fine-tune for behavior and format; RAG for knowledge. Example: fine-tune to learn citation style and structured output format; RAG to retrieve the actual content to cite.

RAG failure modes to engineer around

Retrieval misses: query embedding doesn't semantically match document embedding even when the answer is there. Fix: hybrid search (BM25 + dense retrieval), query rewriting, HyDE (generate a hypothetical ideal answer and embed that for retrieval).

Context overwhelm: 20 retrieved chunks = 8k tokens of context → model loses track of content in the middle. Fix: rerank before injecting (cross-encoder scores query-chunk relevance), take top-3 instead of top-20.

Hallucination despite retrieval: model answers from weights rather than context, or confabulates from partially relevant chunks. Fix: explicit grounding instructions ("answer only from the provided context"), abstention training, semantic similarity check between answer and retrieved context.

Decision framework

| Question | Points to |
|---|---|
| Knowledge updates frequently? | RAG |
| Need citations/provenance? | RAG |
| Want to change output format/behavior? | Fine-tuning |
| Strict latency budget (no retrieval step)? | Fine-tuning |
| Corpus too large for context window? | RAG |
| Facts are stable, reasoning is the hard part? | Fine-tuning |