// 00Tag · evaluation

#evaluation

4 posts

17/06/2026

How to Evaluate a RAG System: Retrieval Metrics, Faithfulness, and Golden Set

How to evaluate an end-to-end RAG system in 2026: recall@k and precision for retrieval, faithfulness and source attribution, building a golden set, and offline vs. online evaluation.

17/06/2026

How to measure embedding quality: recall@k, MRR, and domain benchmarks

How to evaluate embedding model quality on your own data in 2026: recall@k, MRR, nDCG, building a golden set, and pitfalls of offline and online evaluation.

17/06/2026

LLM as a judge: how (not) to automate quality assessment

LLM-as-a-judge in 2026: when automated quality assessment works, what systematic biases it introduces, and how to calibrate the judge before trusting it with production decisions.

17/06/2026

Prompt and Model Versioning: Regression Testing and Change Control in AI

How to manage changes in an AI system in 2026: prompt and model versioning, regression testing on a golden set, safe updates, changelog, and rollback.

← all posts