How to choose an AI model for the task (not the hype)

The most common question is “which AI model is best?” That’s the wrong question—like “which car is best?” without specifying whether you’re hauling cement or racing on a track. The better question: which model for this specific task, given my cost and my data?

There’s no single best model#

Models differ by profile, not “intelligence in general.” One starts responding in 0.4 s but is smaller. Another has a million-token context window but answers slowly. A third excels at writing code but struggles with summarization. Choosing “one for everything” means overpaying for simple tasks and lacking quality for complex ones.

That’s why we don’t bet on one model—we use a router that has an entire fleet at its disposal and picks the right tool for the problem.

Start with the task, not the model#

First, define the task, then pick the model class. In practice, a few categories suffice:

Chat / knowledge assistant — an instruct model with a good balance of quality and latency.
Reasoning — a “thinking” model (see below), deployed intentionally where decision accuracy matters.
Code — a programming-specialized model; throughput matters because responses can be long.
Fast / cheap / classification — a small, lightning-fast model for intent routing, tagging, field extraction.
Vision — a multimodal model that understands images and text together.
Summarization — a non-“thinking” model that condenses without philosophizing.

We maintain this task→model map as a concrete routing matrix—each task has a primary and backup model. See which model handles what in the model atlas, and how we assemble them into ready systems in the how we build it section.

Names mislead—measure#

Model names suggest speed and quality that don’t exist. “Flash,” “pro,” “large” are marketing, not measurement. From our own benchmarks: a model with “flash” in its name can deliver 0.6 tokens per second (very slow), while a large “671B” model hits 4.5 tokens per second—several times faster. If we trusted the name, we’d choose the opposite.

That’s why we select every model by measurement: time to first token (TTFT), throughput (tokens/s), real context window, and whether the model returns content in a given mode. The numbers on the model pages come from a live router, not datasheets.

Will the model “hold up”—test it on a golden set#

Technical metrics (TTFT, tokens/s, context window) tell you whether a model is fast and stable, but not whether it’s substantively good for your task. To settle that, build a golden set: a few dozen representative cases from your own data plus a clear acceptance metric (e.g., answer accuracy with a citation, correctness of field extraction). A model only qualifies as “can handle it” once it passes that gate—and you repeat the same gate on every model change to catch regressions. How to measure this we cover in AI agent evaluation and in our methodology.

“Thinking” models—when they’re worth it#

Some modern models are “thinking” models: they conduct internal reasoning before responding. This is powerful for tough decisions—and costly and slow for simple ones. Worse, forced into regular chat, they can burn the entire budget on reasoning and return empty responses.

The rule is simple: enable reasoning mode only for tasks that truly require it (analysis, agent step planning, tough choices). For chat, translations, code, and summarization, keep it off—faster, cheaper, and guaranteed content. The router handles this automatically.

Cost and data also drive model choice#

Selection isn’t just about quality:

Cost — the cloud bills GPU runtime, so a slower or larger model means a pricier response. An order of magnitude from our own measurements: the same output computed with a small model (about 59 tok/s) is produced roughly 13× faster than with a flagship one (about 4.5 tok/s) — that is, it takes about 13× less GPU time and therefore costs proportionally less. The cheapest model that can handle the task wins.
Sensitive data — if processing regulated data (GDPR), keep some processing local: compute embeddings in-house, and mask PII before anything goes to the cloud. Masking reduces risk, but full compliance also depends on the legal basis, the processing location (transfers outside the EEA), and the data processing agreement with the provider—and for especially sensitive content (contracts, health data) the context itself can be sensitive, not just names. How to set this up we discuss in self-hosted LLM and GDPR.
Reliability — a single model can be temporarily overloaded; that’s why every task has a fallback chain, not a single point of failure.

Quick decision table#

Your problem	Model class	What matters most
Customers can’t find answers	chat + RAG	quality, naturalness, citations
Need to make a tough decision	reasoning (thinking)	accuracy, context window
Code generation / refactoring	code	throughput, long output
Routing, tagging, extraction	fast / small	TTFT and tokens/s, low cost
Image/document analysis	vision (multimodal)	image + text understanding
Shortening long content	summarization	speed, no “philosophizing”
Is this model good enough?	any class	golden-set result + acceptance metric

If you want to walk through this with specifics for your case, we have an interactive stack selector—a few questions and a layer recommendation, including models.

Try it live#

The example below runs a model through our secure sandbox—the same one used in the playground: PII masked, zero retention, same limits. Ask a question about model selection and see the response.

▶Ask about model selectionsandbox · reasoning

FAQ#

Which AI model is best for a business?#

None alone. The best is a router that assigns the cheapest model capable of handling each task—chat, reasoning, code, vision, and summarization have different profiles, so different models. Choosing “one for everything” either overpays for simple tasks or falls short on complex ones.

How do I know if a model fits the task?#

By measurement, not by name. Check time to first token, throughput (tokens/s), real context window, and whether the model returns content in the given mode. Names like “flash” or “large” can be misleading—sometimes “flash” is slower than a large model.

When should I use “thinking” (reasoning) models?#

Only for tasks that truly require reasoning—analysis, planning, tough decisions. For chat, translations, and summarization, disable reasoning mode: it’s slower, more expensive, and can return empty responses when the task doesn’t need it.

Can I use one model to keep it simple?#

You can, but it rarely pays off. One model for everything means overpaying for simple tasks and compromising quality on complex ones. A router with multiple models is cheaper and more reliable, and the complexity is handled by the layer, not you.