LLM as a judge: how (not) to automate quality assessment

When a team asks us how to verify if their AI assistant improves responses after each prompt rebuild, the most common idea is: “let’s ask GPT-4 to evaluate.” This makes intuitive sense. The problem arises when the judge starts favoring longer responses regardless of their accuracy, or when the same rubric with different phrasing yields results differing by 20 percentage points. At Cashcrown, we test LLM-as-a-judge on every new project and see the same error patterns regardless of which model acts as the judge. Below, we describe what actually works and what fails.

What LLM-as-a-judge is and when it makes sense

The principle is simple: instead of asking a human to evaluate a hundred model responses, you write a prompt with a rubric (faithfulness, conciseness, factual correctness) and delegate the assessment to another model. The judge returns a score or label, which you can aggregate into a metric.

This approach has real advantages. Evaluating hundreds of responses daily by humans is costly and slow. An LLM judge works instantly, costs a fraction of that, and is consistent within a single session: if response A is better than B according to the rubric, it will say so every time with the same prompt. This is sufficient for comparing system variants. That’s precisely what LLM-as-a-judge is best suited for: A/B testing prompt variants, regression testing after changing the base model, and daily quality trend monitoring. Not for certifying the quality of a specific response.

Use Case	Does LLM-as-a-judge work?	Note
Comparing prompt variant A vs B	Yes	Pairwise, not absolute scoring
Daily quality trend tracking	Yes	Calibration on human labels monthly
Evaluating faithfulness in online RAG	Yes (with caution)	Log context to verify flags
Certifying quality of a specific response	No	Requires human review
Legal, medical, HR decisions	No	Human-only

Four systematic biases that distort results

Research on LLM-as-a-judge (Meta/Stanford, 2023-2024) documented four recurring biases. Each alters the evaluation independently of the actual response quality.

Verbosity bias (favoring length). LLM judges tend to score longer, more elaborate responses higher, even when a shorter, more precise answer is objectively better. In practice: a system generating “fluff” instead of a relevant answer gets higher scores. Mitigation: the rubric must explicitly penalize unnecessary verbosity, or the judge should evaluate a response-question pair rather than the response alone.

Self-preference (preferring own outputs). A model used as a judge favors responses similar to what it would generate itself. GPT-4 as a judge gives higher scores to GPT-4 outputs than to other models. Claude as a judge behaves analogously. Mitigation: use a judge from a different model family than the evaluated model, or cross-verify pairwise evaluations.

Position bias (order effect). When a judge evaluates a pair (A, B), it tends to prefer the one seen first or last. Reversing the order on the same dataset yields different results. Mitigation: evaluate each pair in both orders and average, or use absolute scoring per response instead of pairwise.

Prompt sensitivity (phrasing sensitivity). A minor change in the rubric, e.g., switching from “rate 1 to 10” to “rate 1 to 5,” or adding the word “briefly” to the instruction, shifts score distributions by 15-25 percentage points. This means results from different rubric versions aren’t comparable. Mitigation: version the judge’s prompt like code and never compare results from different versions without recalibration.

How to build a judge you can trust

Calibration on human labels is the only hard anchor. Before deploying the judge, collect 100-200 question-response pairs with manual expert evaluations. Then check the Pearson correlation between the judge’s scores and human scores. A correlation below 0.70 means the judge measures something different than intended. Recalibrate the rubric or change the judge.

Pairwise comparisons are more reliable than absolute scoring. Instead of asking “rate this response 1 to 10,” ask “which of these two responses better meets the criteria below.” Pairwise is less sensitive to rubric phrasing and yields more stable relative rankings, though it won’t tell you how good a response is in absolute terms.

Structured rubrics beat open-ended questions. Instead of “rate the quality of this response,” define specific dimensions: faithful to facts in context (yes/no), answers the question (yes/no), unnecessarily long (yes/no). Each dimension separately, with definitions for positive and negative cases. A judge configured via structured output enforces this format and prevents the evaluation from devolving into arbitrary text.

▶Design a judge rubric for a customer service assistantsandbox · reasoning

Calibration and maintenance over time

A judge isn’t a static component. As user query distributions change, its consistency with human labels declines. Treat recalibration like regular maintenance: every 4-6 weeks, sample 50 random production evaluations, assess them manually, and recalculate correlation. If it drops below the acceptance threshold, rebuild the rubric or collect a new calibration sample.

Maintain a fixed control set with manual labels. This is 50-100 pairs you don’t modify or show the judge as examples in the prompt. They serve solely to measure drift. When performance on the control set drops, it’s a call to action, not to ignore. How this fits into a broader observability system for the assistant is covered in our article on monitoring AI agent quality.

Log the judge’s reasoning alongside scores. Textual justifications are the only way to understand what the judge actually measures when results surprise. Reading a few dozen justifications weekly often reveals systematic bias faster than correlation alone. Also check if the judge isn’t hallucinating justifications—i.e., citing things not present in the evaluated response.

Where human review remains mandatory

LLM-as-a-judge is a tool for scale, not for final verdicts. Here are boundaries we don’t cross:

High-stakes decisions (employee termination, credit denial, medical diagnosis, legal opinion) require manual review regardless of judge quality. Guardrails in the system should automatically exclude such cases from automated paths and route them to humans. How to architect these boundaries is covered in our article on LLM output validation.

New domains without calibration data. If you lack a set of human labels for a new content category, you don’t know if the judge measures what you intend. Deploying a judge without calibration means accepting an unknown systematic bias.

Evaluating the judge itself. An LLM judge shouldn’t evaluate variants of its own prompt or configuration. This creates a self-fulfilling loop, won by the variant closest to the judge’s style, not the objectively best one.

For how these boundaries work in a comprehensive assistant evaluation, see our article on AI agent evaluation, testing, and benchmarks.

Integration with a broader evaluation pipeline

LLM-as-a-judge is one layer in an evaluation pipeline, not the whole. In RAG systems we build, it works alongside search metrics (recall@k, MRR) and specialized faithfulness evaluation. How these layers fit together is covered in our article on RAG quality evaluation. An LLM judge is particularly suited for assessing dimensions deterministic metrics don’t cover: tone, style, explanation completeness, and business context adequacy.

We treat the judge’s score as one of several signals, not the only one. If the judge flags a response as poor but users don’t escalate and CSAT is high, human signals win. Conversely, a high judge score with low CSAT means the judge measures the wrong dimension. Then we revisit the rubric.

FAQ

Does LLM-as-a-judge replace human review?

No. It replaces manual labeling at scale for variant comparisons and trend monitoring. For high-stakes decisions, new domains without calibration, and evaluations with legal or ethical consequences, human review remains mandatory. An automated judge complements, not eliminates, manual review.

Which model works best as a judge?

There’s no one answer—it depends on the domain and evaluated models. General rule: the judge should be from a different family than the evaluated model to avoid self-preference. A stronger model as a judge isn’t always better, as prompt sensitivity is an architectural trait, not a size one. Calibration on human labels matters more than model choice.

How often should the judge be recalibrated?

From our practice, recalibration every 4-6 weeks suffices for stable query distributions. When rolling out new features, changing the knowledge base, or adding a new content category, recalibrate immediately before the judge returns to production.

Is pairwise always better than absolute scoring?

Pairwise is more stable for comparing two system variants and less sensitive to rubric phrasing. Absolute scoring is needed when measuring absolute quality over time (weekly trends) or flagging responses below a threshold regardless of comparison. In practice, we use both: pairwise for A/B testing, absolute scoring for continuous monitoring.

What does a Pearson correlation below 0.70 mean during calibration?

It means the judge measures a different dimension than the human expert. This isn’t always the judge’s fault—it may indicate the rubric poorly describes what the team cares about. Below 0.70, we don’t deploy the judge to production. Between 0.70 and 0.80, we deploy with limited scope and weekly justification audits. Above 0.80, the judge can serve as the primary quality signal with monthly recalibration.