This is one of the first questions when deploying AI in a company: how to make the model respond based on your knowledge, not general information. There are two paths—and most often, they’re confused or the more expensive one is chosen unnecessarily.
RAG: search first, then answer
#RAG (retrieval-augmented generation) first retrieves relevant snippets from your database, then instructs the model to answer only based on them, with citations. Knowledge lives outside the model—in a vector database—so:
- you update data without retraining the model,
- answers include citable sources (fewer hallucinations),
- if the match is weak, the system escalates to a human instead of fabricating.
Embeddings are computed locally with the BGE-M3 model, so the content to be embedded never leaves your infrastructure.
Fine-tuning: change the model’s behavior
#Fine-tuning further trains the model on your examples and alters its weights—it cements style, tone, and output format. This is powerful when you need a consistent "voice" or a highly specific format that prompting alone can’t enforce. But it’s costly and not suited for fresh facts: new knowledge would require another training cycle.
When to use which
#| Criterion | RAG | Fine-tuning |
|---|---|---|
| Fresh/up-to-date data | yes | no |
| Deployment cost | low | high |
| Update without retraining | yes | no |
| Style/behavior control | partial | full |
| Hallucination risk | low | medium |
| Citable sources | yes | no |
Rule of thumb: if the problem is access to knowledge (customers can’t find answers)—use RAG. If the problem is consistent style/format—use fine-tuning. Often, the optimal solution is a hybrid: RAG brings in facts, light fine-tuning cements the voice. Walk through the specifics in the decision tree.
What we build with RAG
#RAG is the foundation of Concierge RAG—an assistant on your knowledge with citations, multilingual support, and escalation to humans. The same pattern powers multilingual help desks and document intelligence.
Try it live
#Paste your own text and ask a question—you’ll see RAG with live citations (same sandbox as in the playground: zero retention, PII masked).
FAQ
#RAG or fine-tuning—what to choose at the start?
#Most often, RAG. It’s cheaper, updatable without retraining, and provides citable sources. Fine-tuning makes sense when you need a permanent change in style or format, not fresh facts. Many deployments start with RAG and add light fine-tuning only when a consistent "voice" is critical.
Does RAG require sending data to the cloud?
#Not necessarily. We keep embeddings and search local (BGE-M3 + Qdrant), and only a masked prompt—without PII—goes to the cloud. Sensitive data and entire on-prem deployments never leave your infrastructure.
Does fine-tuning reduce hallucinations?
#Not like RAG. Fine-tuning cements style, but the model can still "fabricate" when it lacks facts. RAG with citations and a confidence threshold (escalation to a human when the match is weak) is the primary defense against hallucination.