We regularly see the same pattern: a company hooks up one large cloud-based model to every operation—from simple address concatenation to contract analysis. It works until the bill is small. Then traffic grows, latency accumulates, and costs scale linearly with query volume. The problem is almost never that the model is “too weak.” It’s that the same model handles both tasks that could be done by a model ten times cheaper and those that actually require the power. Matching the model to the task is the cheapest lever in architecture—and the most often overlooked.
Three axes of trade-off: size, latency, cost
#Every model choice is a negotiation between three variables that can’t be maximized simultaneously:
- Size (quality) — Larger models handle ambiguity, multi-step reasoning, and long context better. But “better” doesn’t mean “necessarily better”—for tasks with a single correct answer, excess power is wasted budget.
- Latency — Larger models respond slower. In live chat, users feel every second; in overnight batch processing, latency is irrelevant. Where the task runs completely changes the priority of this axis.
- Cost of inference — In the cloud, it grows with input/output tokens and model class; with self-hosting, it’s hardware amortization spread over volume. A small model often makes a difference of an order of magnitude in unit cost.
The principle we apply: Don’t ask “which model is best,” but “what’s the lowest quality threshold at which the task still works correctly”—and choose the cheapest model above that threshold.
Task-model matrix: where to start
#Most business tasks fit into a few archetypes. The matrix below is a starting point, not gospel—the thresholds depend on your data and required accuracy. Cost ranges are relative (small model = 1× baseline):
| Task Type | Required Size | Latency Priority | Relative Cost | Comment |
|---|---|---|---|---|
| Field extraction (data-extraction) | Small | Medium | 1× | Output has structure—small model + schema validation suffices |
| Classification / ticket routing | Small | High | 1× | Categories are finite; a large model is overpaying |
| Chat / customer service | Medium | Very High | 2–4× | Fluidity and tone matter; speed trumps maximum quality |
| Document summarization | Medium | Low | 2–4× | Often batched, so latency is secondary |
| Multi-step reasoning / analysis | Large | Low | 8–15× | Here, a large model pays off—errors cost more than tokens |
| Code generation / complex logic | Large | Medium | 8–15× | Requires consistency over long context |
Practical observation from deployments: The first two rows—extraction and classification—often account for 60–80% of all calls in a typical system. If everything goes through a large model, most of the bill funds work that doesn’t need that power.
When a small model is truly enough
#Small models are often treated as a “budget” choice. That’s a misunderstanding. A small model is the right choice when the task has a narrowly defined goal and verifiable output:
- Output has a verifiable structure. If you expect JSON with a known schema, structured output plus validation catches errors regardless of model size. A small model + validator + one fix often yields results cheaper and just as reliable as a large model without validation.
- Domain is narrow. Classifying tickets into 8 categories doesn’t require world knowledge—it requires distinguishing 8 patterns. A small model fine-tuned with prompts does this well. More in the article on classification and ticket routing.
- Task is repetitive and batched. Processing 50,000 records with a small model instead of a large one can mean the difference between an acceptable bill and one that kills the project.
Where small models fail: Tasks requiring inference from multiple premises, maintaining consistency over long documents, or handling true ambiguity. There, saving on the model backfires with errors that someone must fix manually. The boundary is set by measurement, not intuition—see monitoring AI agent quality.
Self-hosted vs. cloud
#This is a question of economics and data, not ideology. Both make sense—under different conditions:
| Criterion | Cloud (API) | Self-hosted |
|---|---|---|
| Entry cost | Zero | Hardware + setup upfront |
| Unit cost at low volume | Lower | Higher (hardware doesn’t amortize) |
| Unit cost at high, steady volume | Higher | Lower and predictable |
| Data residency | Data leaves the company (unless masked) | Data stays in the corporate network |
| Access to flagship models | Immediate | Limited to open-weight models |
At low or variable traffic, the cloud wins—no entry cost, pay for what you use. At steady, high load or with sensitive data, self-hosting starts to win on cost and ensures data doesn’t leave the infrastructure. The crossover point depends on volume, so we calculate it based on real load, not peak hardware. We expand on data privacy in self-hosted LLM and RODO, and on choosing a vector database in how to choose a vector database.
In practice, most mature deployments are hybrid: a small local model for extraction and classification (cheap, private, fast), a flagship cloud model only for tasks that truly need it. On combining this with a knowledge base, see company GPT on a knowledge base.
Router pattern: single entry, per-task decision
#Instead of assigning models rigidly, we route all calls through one LLM router. This single entry point to the model layer decides for each task where to direct it—and delivers three things at once:
- Model selection per task. Classification goes to a small model, contract analysis to a large one. The decision is declarative (task→model matrix), not scattered across code.
- Fallback and degradation. If the target model is unavailable or overloaded, the router switches to a backup model instead of returning an error to the user.
- Telemetry and cost control. Since everything passes through one point, we measure cost and latency per task, see where money actually goes, and can set budgets and overload limits.
The router is also the natural place for masking personal data before the cloud and enforcing the rule “structured output always validated.” Without this layer, every model change means rewriting code in multiple places; with it—changing one rule. That’s why the router is the first thing we set up in any multi-model system, not an afterthought.
FAQ
#Isn’t it simpler to use one large model for everything?
#Simpler at the start, more expensive at scale. One large model for every operation yields a bill that grows linearly with traffic and higher latency where it’s not needed. A router matching the model to the task is usually the biggest single cost saver, as it shifts 60–80% of calls to a model an order of magnitude cheaper.
How do you know a small model is enough?
#By measurement, not gut feeling. We collect a set of real task examples, run them through small and large models, and compare accuracy on a blind sample. If the small model meets the quality threshold for a task with verifiable output (extraction, classification), it’s sufficient—and the cost and latency difference is in its favor.
Is self-hosting always cheaper?
#No. At low or variable volume, the cloud is cheaper because there’s no entry cost—self-hosting hardware must amortize on traffic. Self-hosting wins at steady high load or when data can’t leave the company. The crossover point is calculated on real volume, not peak hardware.
How does the router decide where to direct the task?
#The simplest way is declaratively: a task→model matrix assigns each task type to a model class, and the router reads the type from call metadata. More advanced variants assess the difficulty of a specific query and escalate to a larger model only for hard cases. We start with the declarative variant—it’s predictable and easy to audit.
Will changing the model in the future be costly?
#If models are hardcoded in multiple places—yes. If everything goes through a router, changing the model is a single rule change, not a refactor. That’s why we set up the router layer from the start: the model market changes faster than your architecture should, and the router isolates the system from those changes.