The pilot ran for three weeks without major issues. The assistant responded in around 1-2 seconds, customers didn’t complain, and results from the golden set looked promising. Then someone suggested a "minor" model update from the provider and tweaked a few sentences in the system prompt. The deploy went live on Friday afternoon. By Monday morning, the rate of queries escalated to consultants had jumped from 8% to 23%, and token costs surged by 40%. No one knew which change caused it—both went live together without separation.
This is a classic scenario we see regularly at Cashcrown. LLMOps—the operational layer for LLM applications—exists precisely to prevent such Fridays.
Why prototype and production are different systems
#In a prototype, the model is the star. You check if it can answer your target questions, test a dozen prompts, and marvel at the results. In production, the model becomes one component of a system that must run stably for months, handle unexpected queries, and cost what you budgeted.
The difference isn’t in model quality. It’s in the layer that manages it.
| Dimension | Prototype | Production |
|---|---|---|
| Versioning | prompt in a notebook | artifact with version number and regression test |
| Change validation | manual few examples | golden set with automated gate |
| Deployment | prompt swap in code | shadow or canary with pass criteria |
| Monitoring | check when something breaks | alerts before the customer complains |
| Rollback | "we’ll revert manually" | version switch in under a minute |
| Deploy decision | intuition | gate result plus human approval |
You can skip any of these in a pilot. In production, every skip is a risk. We break down the difference between pilot and production in detail in the article from AI pilot to production.
What to version and how to bundle into one artifact
#The biggest mistake we see is versioning the prompt separately from the model. If the prompt changes by two sentences and the model updates from gpt-4o-2024-05-13 to a newer identifier, you effectively have two changes at once. When something breaks, you don’t know what to revert.
The solution is simple: the entire configuration gets one version number. Change anything inside, and it’s a new version, a new gate run, a new human decision. The version artifact includes:
- full text of the system prompt and templates (not a file link, but the content)
- exact model identifier (no
latest, no generic name) - generation parameters: temperature, top-p, token limit
- output schema if using structured output
- version of the knowledge base or RAG index the system works with
- routing rules if the model is selected dynamically by an inference router
Every production log response carries this version number. This lets you answer in minutes whether degradation started before or after a given deploy. Without it, the question turns into a multi-hour investigation. We detail this pattern in the article on versioning prompts and models.
Golden set and regression gate
#The golden set is a collection of representative queries with labeled criteria for good responses. The regression gate is the rule: no version artifact reaches production until it passes this set with results no worse than the previous version.
We start building the golden set with 80-150 real cases from logs, not made-up examples. It’s critical to include edge cases, difficult queries, and those that historically caused issues—they regress first after a prompt change.
For each case, we define acceptance criteria. Rarely is it exact text matching; more often, it’s a set of facts that must appear, a format to adhere to, or a list of things that shouldn’t be in the response. Automated evaluation runs on schema, regex, or an LLM-as-judge calibrated against human ratings.
Gate results aren’t just an aggregate number. The report shows scores per query category, examples that worsened compared to the previous version, and delta against the baseline. A 3-4 percentage point drop in one category can indicate real degradation for that customer group, even if the overall score looks stable.
We cover the method for building golden sets and selecting metrics in the article on how to evaluate a RAG system. The principles are identical for agentic systems.
Human decision is mandatory at this stage. The automated gate says "scores didn’t drop," but it doesn’t assess whether prompt changes align with product intent. That’s a human call, not a rule.
Shadow and canary: safe deployment
#Offline regression testing is necessary but insufficient. Real traffic has distributions the golden set doesn’t fully capture. That’s why major changes—model updates or prompt structure overhauls—go through shadow or canary.
Shadow runs the new version in parallel with the old one on real traffic. The customer gets the old version’s response; the new one generates results only for comparison. Zero exposure, full data on behavior with real queries. Shadow runs for at least 500-1000 queries or a few days, depending on traffic volume.
Canary routes a portion of traffic (usually 5-10%) to the new version and compares quality and business metrics between the two groups. Pass thresholds are defined upfront: the new version takes all traffic only if it doesn’t degrade key metrics within a set time window. The decision rests with a human, not a script.
Both methods cost because you’re generating double the calls. For small prompt tweaks, offline regression testing is usually enough; shadow and canary are reserved for changes where the risk of silent degradation is real. The same logic applies to guardrails: changes to safety rules should also go through shadow before full deployment, as side effects are hard to predict offline.
Production monitoring: quality, cost, latency
#After deployment, continuous oversight begins. You measure three dimensions together because each alone is incomplete.
Quality is the percentage of responses meeting golden set criteria on a production sample, plus indirect signals: escalation rate to humans, the share of "I don’t know" responses, and user ratings where collected. A few percentage points increase in escalation is usually the first sign of drift before anything else flags an issue.
Cost is measured per event, not as an aggregated monthly bill. The monthly bill grows with traffic and won’t tell you if one channel started generating twice as expensive responses without added value. Cost per event is the number where anomalies are immediately visible. We detail the approach to LLM cost optimization in the article on LLM cost monitoring.
Latency is the p50 and p95 distribution, not the average. The average hides the long tail of slow responses that ruin the experience for customers with tough questions. Observability in an LLM system means logging every call with timestamps, model, input/output token counts, and artifact version number.
How to choose metrics and alert thresholds to avoid drowning in false alarms is covered in a separate article on monitoring AI agent quality.
Rollback: predefined return path
#Rollback isn’t an emergency plan. It’s a mandatory part of every deploy. Before the new version goes live, you have the answer ready: "What do we do if metrics drop below the threshold in an hour?"
The answer is simple because the version artifact is one package. Reverting is switching the pointer to the previous version, not rewriting the prompt from memory. It takes under a minute. The previous version must be ready to run, not just saved in git history.
We define three rollback triggers upfront:
- escalation rate exceeds the threshold by more than N percentage points for M minutes
- cost per event rises above the limit for a defined window
- p95 latency exceeds SLA for a continuous period
When a trigger fires, the rollback decision is human, but the mechanism is ready. No silent model swaps, no deploys without gates, no "we’ll see how it goes." Every change has an owner and a return path.
FAQ
#What’s the difference between MLOps and classic DevOps for AI applications?
#Classic DevOps ensures code correctness: unit tests, integration tests, code review. MLOps for LLM ensures model behavior correctness, which doesn’t directly follow from code. A prompt can be syntactically correct yet behave differently after a one-sentence change. That’s why, alongside code tests, you need an evaluation gate on a golden set, shadow and canary instead of a simple deploy, and continuous quality monitoring on production traffic.
How large should the golden set be at the start?
#Start with 80-150 cases from real logs or submissions, not made-up examples. Representativeness matters more than size: the set should cover main query classes, edge cases, and those that historically caused issues. Every new production incident is a candidate for the golden set. The set grows with the system.
Can you implement LLMOps without a cloud MLOps platform?
#Yes. The basic version is versioning artifacts in git or a simple registry, a test script for the golden set in CI, call logs in Postgres or JSONL files, and a dashboard with a few charts. Commercial MLOps platforms speed things up but aren’t mandatory. The key is process discipline: a gate before deploy, human decision, and a ready rollback.
What to do when a model provider silently updates the model under the same name?
#Pin the exact model version identifier instead of a generic name or latest tag. If the provider doesn’t offer identifiers, run the golden set periodically even without changes on your side. Silent drift from the provider is one of the most common sources of regression that goes undetected for weeks.
When is shadow or canary needed, and when is offline regression testing enough?
#For minor prompt changes—like tweaking format or tone—offline regression testing on the golden set is usually sufficient. Shadow and canary are reserved for model updates, prompt structure overhauls, or large-scale knowledge base updates. Rule of thumb: the larger the change scope and the harder it is to predict side effects, the more justified the cost of shadow over a simple offline gate.