benchmarks · model-agnostic

Model benchmarks
Dated, reproducible, eval-first.

We publish dated benchmarks on RAG retrieval and agent reliability, with LLM-selection runs alongside. Every result is reproducible with the open-source paiteq/ai-eval-harness on a corpus you can inspect. Model-agnostic on principle: Claude and GPT alongside Gemini and open-source models, all scored on the same rubric. Our pilot RAG run logged 88% faithfulness on a 1,840-document corpus (2026-Q1); the agent harness logged 71% pass@1 across 100 tool-using tasks (2026-Q1).

methodology

How we benchmark.
Four rules. No exceptions.

Every benchmark we publish meets four rules. Anything that fails them doesn't get published.

Dated in the URL and H1

Undated benchmarks rot. Each slug carries the publish quarter so readers can tell what's current at a glance.

Reproducible by anyone

Code lives in paiteq/ai-eval-harness (MIT). Corpora mirror to huggingface.co/paiteq-ai. Run the harness; you should land inside our published confidence intervals.

Model-agnostic on one rubric

Claude, GPT, Gemini, and open-source models score on identical prompts and corpora. No model-family favouritism. Where one shines, we say why.

Cost on the same axis

Recall@5 and pass@1 are meaningless without $/1k queries on the same dated run. Every benchmark reports cost alongside quality.

for buyers

Why we publish these.
Eval-first delivery, in public.

Most agencies pick a model because the founder likes it. We pick a model because the eval said so. These benchmarks are how we work — published so you can audit the methodology before you hire us. Engagements run as a discovery audit, then a 4-6 week pilot with weekly eval gates, then continuous delivery against the same rubric in production.