# LLM evaluation frameworks: how to evaluate an LLM app

> Offline vs online eval, LLM-as-judge pitfalls, golden sets, and regression eval — plus which framework (DeepEval, Ragas, MLflow, Arize Phoenix, OpenAI Evals) to pick, with our defaults.

**HTML version:** https://www.paiteq.com/blog/llm-eval-frameworks/
**Published:** 2026-06-11T02:50:18.084Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~10 min


---

Every team that ships an LLM feature eventually says the same sentence in a standup: "it passed our tests." Then a prompt tweak goes out on a Friday, the model provider rolls a silent update, and on Monday the support queue fills with answers that are confidently, fluently wrong. The tests passed because the tests were three hand-picked examples someone ran once in a notebook. That's not LLM evaluation. That's a vibe check with extra steps.
This guide is about how to actually evaluate an LLM application, not how to read a public model leaderboard. Those are different problems. Public benchmarks like MMLU and SWE-bench rank models against the field; they tell you almost nothing about whether *your* RAG pipeline, your agent, or your support bot does the job for your users. We've stood up evaluation harnesses across 200+ engagements, and the failure pattern is always the same: teams over-trust an offline test that ran once, skip the online loop entirely, and let an LLM-as-judge quietly grade their work without anyone checking the judge. So we'll walk through the two halves of evaluation, the metrics that actually catch bugs, the LLM-as-judge biases that corrupt your numbers, how to build a golden set you'll maintain, and which framework to reach for when. There's a runnable setup at the end you can copy this week.

## LLM evaluation in one paragraph, and why "it passed our tests" isn't an answer

LLM evaluation is the practice of measuring whether an LLM-powered feature produces outputs that are correct, safe, and useful for a specific task, in a way that's repeatable enough to compare across prompt versions, models, and time. The repeatable part is what separates evaluation from a demo. A demo proves the system can be right once; an eval proves it's right often enough, on inputs that look like production, and it tells you the moment that stops being true. Because LLMs are non-deterministic, the same prompt can produce a clean answer and a hallucinated one on consecutive calls, so a single passing run is evidence of almost nothing. "It passed our tests" usually means it passed a handful of inputs the author already knew the model handled well, which is the textbook definition of a test that can't fail. A real LLM evaluation runs against a representative set of inputs with a defined scoring method, produces a number you can track over time, and breaks loudly when a change makes things worse. So when someone asks how to evaluate LLM behavior in a way they can trust, the honest answer is: build that loop, not a notebook. Everything else in this guide is detail on how to build it.

## The two halves of LLM evaluation: offline eval vs online eval

There are two halves to LLM evaluation, and most teams build one and forget the other. Offline evaluation runs before you ship: you take a fixed set of inputs with known-good answers, run the current prompt and model against them, and score the outputs. It's fast, cheap, deterministic enough to gate a deploy, and it answers the question "is the new version at least as good as the old one on the cases I know about?" Online evaluation runs after you ship: you sample real production traffic, score a slice of it (with a judge model, a heuristic, or human review), and watch for drift. It answers a different and harder question: "is the system still good on inputs nobody anticipated?"
The loop matters more than either half alone. Online eval surfaces the failure modes your golden set never imagined; you triage those failures, label them, and fold them back into the offline set so the next deploy is gated on the bug you just found. Skip the online half and your offline eval slowly drifts away from reality, passing forever while users suffer. The honest rule of thumb: offline eval tells you the prompt works on yesterday's data, and only online eval tells you it works on tomorrow's users. If you're building an agent, the online half is non-negotiable, because agents fail in trajectories rather than single answers and you can only see that on real traces. The [multi-agent orchestration patterns](/blog/multi-agent-orchestration-patterns/) that production agents use only become observable when you evaluate at the trace level, which is squarely an online-eval problem.

## What you actually measure: task metrics, RAG metrics, and safety checks

"Measure quality" is not a metric. Before you pick a framework you need to know which specific signals catch your specific bugs, because an LLM evaluation metric that's perfect for a classifier is useless for a RAG answer and irrelevant for an agent. There are three families worth separating. Task metrics score whether the output did the job: exact match or F1 for extraction, a graded rubric for open-ended generation, tool-call accuracy for agents. RAG metrics score the retrieval-plus-generation chain specifically: faithfulness (is the answer grounded in the retrieved context, or did the model invent it?), context relevance, and answer relevance. Safety checks are pass/fail gates that run regardless of task: refusal of disallowed requests, absence of leaked PII, no prompt-injection compliance. Here's the set worth wiring up, with the framework that ships each one out of the box.

## Reference-based vs reference-free: golden sets, and when you can't have one

The deepest fork in LLM evaluation is whether you have a known-good answer to compare against. Reference-based evaluation compares the model's output to a labeled ground-truth answer in a golden set; it's the gold standard when you can build one, because the score is grounded in something a human verified. Reference-free evaluation scores an output with no ground truth, usually by judging it against a rubric or checking it against the retrieved context. You reach for reference-free when ground truth is impractical: open-ended generation where many answers are valid, or production traffic where nobody's had time to label the right answer. Most real systems use both, and the boundary is where a lot of teams quietly cheat themselves.

## LLM-as-a-judge, and the four biases that quietly corrupt it

LLM-as-a-judge, using a strong model like Claude Opus 4.7 or GPT-5 to grade outputs against a rubric, is the technique that makes modern LLM evaluation scale. It's also the most abused. A judge model is a model, which means it has biases, and if you don't control for them you're not measuring quality, you're measuring the judge's preferences and calling it quality. There are four biases worth naming because they show up in almost every naive setup, and each has a concrete control.

> [!NOTE] (rich block: callout)

Controlled this way, a judge is genuinely useful. In 2026, a well-built LLM-as-judge with a pinned model and a written rubric reaches roughly 80% agreement with human labels on a clear-cut rubric, which is high enough to scale review across thousands of outputs and low enough that you still keep a human-labeled anchor set as the source of truth. The mistake is treating the judge's number as ground truth rather than as a high-throughput estimate that you periodically check against humans.

## Building a golden set you'll actually maintain

A golden set is the labeled, held-out collection of inputs and known-good answers that your offline LLM evaluation runs against. It's the single highest-leverage asset in your eval program and the one teams most often build once and abandon. The failure mode is treating it as a one-time deliverable instead of a living artifact. A golden set that doesn't grow from production failures slowly stops resembling production, and a green offline score against a stale set is worse than no score because it manufactures false confidence. Treat the golden set like a codebase: version it, review changes, and grow it deliberately.
Start small, twenty to fifty hand-curated items that cover your core happy path plus the obvious edge cases, and resist the urge to generate a thousand synthetic examples that all look alike. A focused set of real, well-labeled cases beats a huge set of plausible-but-fake ones every time. The same discipline that an [AI readiness assessment](/blog/ai-readiness-assessment/) checks for, namely whether a team has the labeled data and the process to evaluate its own systems, is exactly the capability a maintained golden set represents. If you can't name who owns the set and when it last grew, you don't have an eval program, you have a folder.

## Regression evaluation: catching the prompt change that broke production

Regression evaluation is the practice of re-running your offline eval on every change and failing the change if quality drops. It's the discipline that turns a golden set from a one-time report into a guardrail. The trigger is any of three things: a prompt edit, a model swap, or a dependency change in the retrieval or tool layer. All three can silently degrade quality, and none of them show up in a unit test that only checks the code runs. The cheapest way to wire this is to make the eval a CI gate that runs on the pull request, exactly where you'd run unit tests. promptfoo is built for this: you declare your test cases and assertions in a config file, point it at the prompt and providers, and it fails the build when a metric regresses.

## The LLM evaluation framework landscape: DeepEval, Ragas, MLflow, Arize Phoenix, OpenAI Evals

The LLM evaluation framework landscape looks crowded until you sort it on two axes: does it run offline (pre-deploy, in a test harness) or online (on production traces), and does it score with assertion-style metrics (pass/fail unit evals) or with observability-style aggregates (dashboards over traffic)? Almost every LLM evaluation tool sits somewhere on that grid, and knowing where saves you from adopting a tracing platform when you wanted a unit-test library, or vice versa.
A quick read on the named tools, because the marketing pages blur together. DeepEval, from Confident AI, is the pytest-for-LLMs library: you write assertions with metrics like G-Eval and run them in your test suite. Ragas is the RAG specialist, with faithfulness and context-relevance metrics designed for retrieval pipelines. MLflow brings LLM evaluation into the same experiment-tracking workflow teams already use for classical ML. Arize Phoenix is open-source LLM observability with tracing and online evals; Galileo, Langfuse, and Helicone live in the same online/observability space with different emphases. OpenAI Evals is a registry-style harness for defining and running evals, strongest when you're inside the OpenAI ecosystem. Evidently AI and TruLens round out the open-source options, and UK AISI's Inspect is the rigorous choice when you need auditable, research-grade eval definitions.

## Which LLM evaluation framework to pick for which job (and our defaults)

Here's the opinion the vendor pages won't give you: pick the narrowest LLM evaluation framework that covers your job, not the platform that promises to cover everything. A single do-everything tool usually means you adopt none of its features well, because the surface area is too large to operationalize. Map the job to the tool, default to the obvious choice, and only reach for a platform when you genuinely need cross-team observability. This is the mapping we default to across engagements.

## Wiring LLM evaluation into CI and the production loop

A framework is only useful once it runs without anyone remembering to run it. The offline half lives in CI: the regression eval is a required check on every pull request that touches a prompt, a model identifier, or the retrieval layer, and it fails the build on a quality drop the same way a broken test does. The online half lives in the production pipeline: you sample a percentage of real traffic, attach traces, score a slice with a pinned judge model, and alert on drift. The two connect through the golden set, which grows from the failures the online loop surfaces. Here's what a trustworthy 2026 setup looks like once it's wired, and these are the dials worth checking before you trust a single dashboard.

## A reproducible LLM evaluation setup you can run this week

Enough theory. Here are three concrete entry points into LLM evaluation, each runnable on your own data. The first is a Ragas faithfulness check for a RAG answer, the second is a DeepEval G-Eval rubric metric you can drop into pytest, and the third is the promptfoo CI config that gates a prompt change. Start with whichever matches the job you have today; you don't need all three on day one. A RAG faithfulness eval run with Ragas in 2026-Q1 flags unsupported claims at meaningfully higher recall than exact-match, at a per-1k-eval API spend in the single-digit dollars, which is cheap enough that there's no excuse to skip it.

## FAQ — LLM evaluation, in the practitioner's vocabulary

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/
