When fine-tuning makes sense (and when RAG is enough)

When a company decides to deploy its own model, sooner or later this question arises: is feeding knowledge via RAG enough, or does the model need fine-tuning? Both approaches have existed for years, but in 2026 the boundary between them is sharper than ever—and confusing them costs weeks of work and tens of thousands of zlotys.

What the difference looks like in practice#

RAG doesn’t change the model. It retrieves relevant fragments from your knowledge base and injects them into the context before each response. The model reads these fragments and answers based on them, with citations. Knowledge lives outside the model, so tomorrow you can update the database without any retraining.

Fine-tuning changes the model’s weights. You train it on your own input-output examples and save the resulting changes in the model itself. After fine-tuning, the model generates text differently even without additional context. This is permanent and can’t be undone without retraining.

Key sentence: RAG changes what the model knows, fine-tuning changes how the model behaves.

Three situations where fine-tuning is justified#

Below are three concrete cases where fine-tuning delivers value that RAG can’t replicate:

1. Permanent output style and format. When your system must generate reports in a strictly defined template (e.g., specific XML, legal format, industry notation) and no prompt maintains consistency across thousands of calls—fine-tuning embeds the format in the weights. Example: a system generating technical descriptions according to ISO standards, where deviations cause regulatory issues.

2. Specialized jargon and domain terminology. A general model knows the word “dekretacja” only in an accounting context. If your process uses abbreviations, acronyms, and terminology the model didn’t see in pretraining, a few hundred fine-tuning examples will teach it to interpret and generate them correctly. RAG can provide definitions, but it won’t change deep contextual understanding.

3. Cost and latency reduction through specialization. A small model (7B-14B) trained for a specific task (e.g., only customer service intent classification) is many times cheaper in inference than a large general model. If your system performs millions of calls monthly on one narrow task—fine-tuning a smaller model can pay off within months. Calculate this using the inference calculator.

Four situations where fine-tuning is a mistake#

It’s worth knowing when NOT to choose fine-tuning, as this is the more common error:

1. “We want the model to know our documents.” This is exactly what RAG is for. Fine-tuning on documents isn’t factual memory—the model can still hallucinate facts, just in your specific style. RAG with a vector database and source citations is the right answer.

2. Knowledge changes frequently. If your data updates weekly (pricing, regulations, offers), fine-tuning is unsuitable—every change would require retraining. RAG updates by adding new documents to the database.

3. You have little training data. Fine-tuning without enough high-quality examples leads to overfitting or regression in the model’s general abilities. The minimum is a few hundred good input-output pairs; realistically, several thousand for repeatable results. If you don’t have that much data—RAG plus prompt engineering is a cheaper start.

4. Budget and time are limited. Fine-tuning requires GPU infrastructure, training data, experiments, evaluation, and maintaining multiple model versions. It’s not a one-time cost. A RAG pilot can launch in weeks for a fraction of the effort.

Decision table: RAG or fine-tuning#

Criterion	RAG	Fine-tuning
Fresh or frequently updated data	yes	no
Permanent output style and format	partially (prompt)	yes
Specialized domain jargon	partially	yes
Implementation cost	low	high
Time to first results	weeks	months
Update without retraining	yes	no
Citable sources in response	yes	no
Latency reduction for narrow tasks	no	yes
Risk of factual hallucinations	low (with threshold)	medium
Required data volume	little (documents)	large (training pairs)

Practical rule: start with RAG, measure results. If after two or three weeks the problem isn’t “what the model knows” but “how the model behaves”—revisit fine-tuning.

How fine-tuning works in practice#

If after the above analysis fine-tuning is the right decision, the process looks like this:

Collect training pairs. Each example is an input (prompt, context) and output (correct answer). Quality matters more than quantity—three hundred precise examples beat three thousand mediocre ones.
Choose a base model. Smaller models (7B, 13B) train faster and cost less. Large 70B+ models are rarely used for fine-tuning outside the biggest organizations.
LoRA / QLoRA technique. Full fine-tuning of all weights is wasteful. LoRA trains only a small adapter matrix, reducing GPU costs by an order of magnitude while preserving most of the effect.
Evaluation. A test set (hold-out) must be separate from training data from the start. Measure task-specific metrics (F1 for classification, ROUGE for generation), not just subjective impressions.
Versioning. Every trained checkpoint is a new model version with a date, dataset, and evaluation results. Without this, you won’t know which model to deploy or how to revert to a previous one.
Maintenance. Model drift occurs as the factual base grows. Establish a retraining policy—e.g., quarterly or when evaluation results fall below a threshold.

It’s easier to plan this after filling out the agent blueprint—it helps visualize where fine-tuning and RAG fit into the architecture.

Hybrid: fine-tuning plus RAG#

The best production deployments often combine both approaches. The most common pattern we see:

Fine-tuning handles style, format, and voice (the model speaks like your brand, generates in your template).
RAG provides fresh facts with each call (the model doesn’t hallucinate current pricing because it simply receives it in context).

The hybrid requires careful router architecture to decide when to enrich context and when to rely on fine-tuned knowledge. This is one of the patterns we build as part of a custom AI assistant for clients.

Cost and regulatory considerations#

Fine-tuning and inference of a trained model have implications for security and regulations. A few key facts to consider before deciding:

If you train a model on personal data, GDPR applies, and a DPIA is likely required. Training data “enters” the model’s weights in a way that’s hard to audit—you can’t easily exercise the right to erasure as with RAG, where deleting a document from the database suffices.

Under the AI Act, high-risk systems must document training data and methodology. Fine-tuning on customer data in classification systems (e.g., credit scoring, recruitment) requires additional controls and auditability.

For sensitive data, we prefer self-hosting—the model trains and runs in your infrastructure, and PII never leaves the organization.

Try it live#

Describe your use case—the model will help assess whether it’s a task for RAG, fine-tuning, or a hybrid (playground: PII masked, zero retention):

▶RAG or fine-tuning for my casesandbox · reasoning

FAQ#

When does fine-tuning make sense, and when is RAG enough?#

Fine-tuning makes sense when the problem is a fixed output style, specialized domain jargon, or the need for cheaper inference on a narrow task. RAG is enough when the problem is access to fresh factual knowledge—and this is the most common case in Polish companies. Before starting training, check if a good prompt with RAG context doesn’t solve the problem cheaper.

How much does fine-tuning a model cost?#

Cost depends on model size, number of examples, and chosen technique. Training a small 7B model with LoRA on a few hundred examples takes hours on a GPU and has relatively low cloud costs. Large 70B+ models and full fine-tuning require weeks of engineering work plus infrastructure costs. Calculate your case using the inference calculator or discuss it as part of a pilot.

Does fine-tuning eliminate hallucinations?#

No. Fine-tuning embeds style and behavior but doesn’t provide reliable factual memory. The model may “acquire” facts from training data but still hallucinates when asked about things outside it. RAG with citations and a confidence threshold (escalation to human-handoff when no relevant fragment is found) is the main defense against hallucination in production systems.

Can I train a model on customer data?#

You can, but it requires legal caution. Personal data in the training set is subject to GDPR and likely requires a DPIA. After training, removing specific data from the model’s weights is technically difficult, complicating the right to erasure. We recommend a data audit with a lawyer before starting and choosing an architecture where PII for training stays in your infrastructure. The article AI Act and GDPR 2026 details obligations.

Where do I start if I want to implement fine-tuning?#

Start by collecting good training pairs, not by choosing infrastructure. Identify 200-500 specific input-output examples that illustrate the expected model behavior. Immediately set aside 10-20% as a hold-out for evaluation. Only with this data ready should you plan infrastructure and timelines. The agent blueprint is helpful for mapping out the entire system architecture before diving into training details.