The thing that scares us is not the obvious miss. We can write a rule for that. What scares us is a confident-sounding answer that cites a real doc page and is still wrong because the model picked the page from the next product over. Show us how you measure that, or we're not signing.
A Claude RAG case study: cited answers
over 12,000 product-docs pages.
A developer-tooling SaaS needed an in-product answer surface that could reason across nine product surfaces and twelve thousand docs pages, return cited answers users could click to verify, and refuse out loud when it didn't know. We built it on Claude Sonnet 4.6 + Haiku 4.5, a hybrid pgvector + Algolia index, and a forced-JSON schema that rejects any answer without grounded citations. Seven weeks, shadow-first, with a synonym-hallucination kill point at week 5 that we used.
What this case study shows
A B2B SaaS developer-tooling company shipped a Claude RAG agent over 12,000 product-docs pages routing tier-1 support questions. Across n=3,400 over a 30-day shadow window (95% CI), the agent deflected 64% of docs-recoverable tickets at confidence at or above 0.8, with 0.92 groundedness and p95 2.1-second first-token latency. Stack: Claude Sonnet 4.6 plus Haiku 4.5 router, pgvector 0.7, BAAI bge-reranker-large, Cloudflare Workers, Langfuse, Algolia. Seven weeks discovery to production. This is one shape of a broader AI knowledge base engagement — RAG over your docs corpus with retrieval + reranking + grounded synthesis.
A docs-search bot
that lied with confidence.
A developer-tooling SaaS with 9 product surfaces and ~12,400 MDX docs pages. Keyword search couldn't reason across nested module hierarchies; the existing answer-bot hallucinated on synonyms; 41% of inbound tickets (n=1,200) were recoverable from a docs page the user just hadn't found. The support queue paid the bill.
today
with the agent
Two failure modes. Keyword search couldn't reason across nested module hierarchies. A question about transaction.refund.signature surfaced generic transaction pages but missed the signature-mismatch troubleshooting note three levels deeper. A prior off-the-shelf answer-bot ran a single embedding lookup + a generic prompt and was confidently wrong on synonyms (the corpus uses API key, secret token, and signing secret in different products for related-but-distinct concepts). Internal eval put the bot's groundedness at 0.82, with 18% confident-wrong answers citing a real doc page from the wrong product. Vendor pitches in the docs-search space were all turned down: no honest eval methodology, no audit log on the model's claims, no first-class refusal path.
Numbers from the customer's internal eval of the prior off-the-shelf bot, the cut directly above this build.
Claude RAG pipeline: six stages,
three outcome lanes.
Anchor-preserving chunker over the MDX corpus, hybrid pgvector + Algolia retrieval with RRF fusion, bge-reranker-large with a score-margin refusal gate, Haiku 4.5 intent routing, and a forced-JSON Sonnet 4.6 cited answer. Three outcome lanes: answer-in-product (≥0.8), suggest-and-open-ticket (0.5–0.8), refuse-and-route (<0.5). Diagram below.
Forced-JSON cited-answer schema
- we rejected
- Free-text answer with a downstream citation parser
- because
- Every claim has to point at an anchor_id matching the regex doc_[slug]#[anchor]. The Zod validator is the contract; the model can't hand-wave. If parsing fails, we retry once with a stricter prompt, then fail closed to a human ticket.
Two-model routing — Haiku 4.5 first, Sonnet 4.6 on real questions
- we rejected
- Single Sonnet model for every query
- because
- ≈ 28% of inbound queries are greetings, product-name lookups, or one-liner FAQs. Haiku handles them at a tenth the per-call cost; Sonnet reasons over the retrieved evidence on the rest. The routing decision itself is cached for 24h on canonical-form input.
Algolia as the lexical lane (not BM25 reinvention)
- we rejected
- Postgres tsvector BM25 alongside pgvector
- because
- The customer already paid for Algolia. Algolia's synonym table + typo tolerance was tuned by their docs team over four years. Reusing it as the lexical lane in RRF was a measurable +6 points on recall@5 vs a fresh tsvector index, with zero new ops surface.
Every component has a
separately measurable contract.
When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.
Cited-answer model
Labelled answer-correctness + groundedness on the frozen 1,260-query eval. Forced-JSON schema requires every claim to carry an anchor_id matching the regex. The model cannot reason past the validator.
Retrieval
Recall@5 on the frozen eval. RRF + reranker tuned against this number, not end-to-end accuracy.
Reranker
Top-1 precision on the held-out slice. Score-margin gate catches synonym hallucinations before Sonnet ever sees them.
Groundedness watchdog
LLM-as-judge 5% live sample. Wired to Sentry.
Refusal lane
Refusal-rate is by-design, not a failure mode.
Editorial drift
Anchor_id mismatches caught on PR-merge reindex.
The docs-RAG pipeline,
end to end.
One query enters at the left. It either renders an in-product answer with cited anchor links, opens a draft ticket with caveats, or fails closed and routes to a human. Hover any stage for its tool inventory and latency budget.
latency budgets above are p50/p95 on the production traffic mix · end-to-end p95 inside 2.6s for a streamed answer
Claude RAG stack: named tools,
named versions.
Everything in the build is a thing your platform team can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, and policies are checked into the customer's repo, not ours.
Production shape,
under the hood.
Latency is measured at the answer-surface boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 1,260-query set the CI gates on.
Per-stage P50 / P95 (ms · streamed)
| stage | p50 | p95 | tooling |
|---|---|---|---|
| Intent router · Haiku 4.5 | 320 | 560 | Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache |
| Embedding (query) | 82 | 140 | voyage-3-large · 1,024 dim · batched |
| Hybrid retrieval (pgvector ∥ Algolia) | 186 | 320 | RRF k=60 · top-40 per lane → dedupe → top-40 unique |
| Cross-encoder rerank | 240 | 380 | BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8 |
| Sonnet 4.6 cited answer (first token) | 920 | 1480 | Anthropic API · streamed · ~3,200 in / ~360 out tokens |
| Groundedness + render | 16 | 32 | Zod schema validation · Cloudflare Workers · streamed to client |
| Total to first token (end-to-end) | 1764 | 2092 | agent boundary — excludes client-side render of the citation chips |
- stage Intent router · Haiku 4.5p50 320p95 560tooling Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache
- stage Embedding (query)p50 82p95 140tooling voyage-3-large · 1,024 dim · batched
- stage Hybrid retrieval (pgvector ∥ Algolia)p50 186p95 320tooling RRF k=60 · top-40 per lane → dedupe → top-40 unique
- stage Cross-encoder rerankp50 240p95 380tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8
- stage Sonnet 4.6 cited answer (first token)p50 920p95 1480tooling Anthropic API · streamed · ~3,200 in / ~360 out tokens
- stage Groundedness + renderp50 16p95 32tooling Zod schema validation · Cloudflare Workers · streamed to client
- stage Total to first token (end-to-end)p50 1764p95 2092tooling agent boundary — excludes client-side render of the citation chips
p50/p95 from a 30-day rolling window over n ≈ 84,000 production decisions. SLO is p95 ≤ 2,500 ms to first streamed token; current burn ≈ 84%.
Chunk size × overlap vs recall@5
Two curves on the same eval slice: sentence-anchored chunks with 80-token overlap versus naive splits without. Each datapoint is an actual eval cut, not a model projection. Picked value (signal-marker) is what we shipped.
- 512-token chunks · 80-token overlap · sentence-anchored
- same chunk size · no overlap · naive split
- value we shipped
eval cut · 800-query docs-recoverable slice of the frozen 1,260-item set · numbers are means over 3 random seeds, ± 0.012 SD
// docs-answer/schema/answer.ts
// Forced-JSON answer schema. Every claim must point to a doc-anchor id;
// the parser rejects answers without grounded evidence and we retry once
// with a stricter prompt, then fail closed (route to human support).
import { z } from "zod";
export const DocsAnswer = z.object({
answer: z.string().min(20).max(1200),
confidence: z.number().min(0).max(1),
citations: z.array(z.object({
claim: z.string().min(8).max(280),
anchor_id: z.string().regex(/^doc_[a-z0-9-]+#[a-z0-9-]+$/),
title: z.string(),
})).min(1).max(6),
refused: z.boolean().describe(
"True if the agent decided it cannot answer (out-of-scope, " +
"score-margin gate failed, or no citation could be grounded)."
),
});
export type DocsAnswer = z.infer<typeof DocsAnswer>;
Why is my webhook returning a 401 even though I copied the secret token from the dashboard?
Webhooks signed with the dashboard-issued secret token return 401 when the request body has been re-serialized between dispatch and verification 1 . The signature is computed over the raw bytes of the payload, so any middleware that re-encodes JSON (including most HTTP clients' default body parsers) will invalidate the signature 2 . You can either configure your framework to expose the raw body before parsing, or use the SDK's built-in verifier which accepts a parsed body and an `x-signature-payload` header that ships the canonical bytes 3 .
↑ what an in-product answer looks like at confidence ≥ 0.8 · below that, the same shape renders with a "draft only — open a ticket" banner; below 0.5 the surface refuses and routes to a human
Per-decision and monthly cost math
| line item | $ / answer | $ / month (≈ 84k answers) | note |
|---|---|---|---|
| Claude Haiku 4.5 · routing (all queries) | $0.0007 | $59 | 480 tokens × $1.00 / 1M + 60 × $5.00 / 1M |
| Claude Sonnet 4.6 · input (72% of queries) | $0.0096 | $580 | 3,200 tokens × $3.00 / 1M · only on routed queries |
| Claude Sonnet 4.6 · output (72% of queries) | $0.0054 | $326 | 360 tokens × $15.00 / 1M · streamed |
| voyage-3-large embeddings (avg query) | $0.00036 | $30 | ≈ 3,000 tokens × $0.12 / 1M |
| pgvector · RDS db.m6i.large (BAA-eligible) | — | $284 | Postgres 16 · embeddings + anchor-id index |
| Algolia · production tier (existing line) | — | $0 | already-paid line · reused as RRF lexical lane |
| g5.xlarge reranker (24/7 in VPC) | — | $378 | BAAI bge-reranker-large self-host |
| Cloudflare Workers · edge + audit shipping | — | $96 | Workers Paid + Workers AI gateway |
| Langfuse self-hosted (t3.medium) | — | $67 | trace store; 90-day hot / 7-yr cold |
| All-in monthly | ≈ $0.0227 | ≈ $1,820 | vs. ≈ $9,400 / mo to add one tier-1 support engineer |
- line item Claude Haiku 4.5 · routing (all queries)$ / answer $0.0007$ / month (≈ 84k answers) $59note 480 tokens × $1.00 / 1M + 60 × $5.00 / 1M
- line item Claude Sonnet 4.6 · input (72% of queries)$ / answer $0.0096$ / month (≈ 84k answers) $580note 3,200 tokens × $3.00 / 1M · only on routed queries
- line item Claude Sonnet 4.6 · output (72% of queries)$ / answer $0.0054$ / month (≈ 84k answers) $326note 360 tokens × $15.00 / 1M · streamed
- line item voyage-3-large embeddings (avg query)$ / answer $0.00036$ / month (≈ 84k answers) $30note ≈ 3,000 tokens × $0.12 / 1M
- line item pgvector · RDS db.m6i.large (BAA-eligible)$ / answer —$ / month (≈ 84k answers) $284note Postgres 16 · embeddings + anchor-id index
- line item Algolia · production tier (existing line)$ / answer —$ / month (≈ 84k answers) $0note already-paid line · reused as RRF lexical lane
- line item g5.xlarge reranker (24/7 in VPC)$ / answer —$ / month (≈ 84k answers) $378note BAAI bge-reranker-large self-host
- line item Cloudflare Workers · edge + audit shipping$ / answer —$ / month (≈ 84k answers) $96note Workers Paid + Workers AI gateway
- line item Langfuse self-hosted (t3.medium)$ / answer —$ / month (≈ 84k answers) $67note trace store; 90-day hot / 7-yr cold
- line item All-in monthly$ / answer ≈ $0.0227$ / month (≈ 84k answers) ≈ $1,820note vs. ≈ $9,400 / mo to add one tier-1 support engineer
Token costs use Anthropic's public May-2026 pricing: Sonnet 4.6 at $3 / 1M input + $15 / 1M output; Haiku 4.5 at $1 / 1M input + $5 / 1M output. Infra costs are AWS US-east-2 list price. Volume of 84k answers/month is steady-state after rollout, of which 28% terminate at Haiku without ever invoking Sonnet. That routing line is what makes the math reconcile.
What's in the frozen 1,260-query set
| category | items | what it checks | ci-gate threshold |
|---|---|---|---|
| Docs-recoverable golds | 800 | labelled answer + correct anchor_id on real (de-identified) past tickets | ≥ 0.90 recall@5 + groundedness |
| Not-docs-recoverable | 220 | agent must refuse · no answer surface · ticket draft only | ≥ 0.95 refusal recall |
| Adversarial (synonyms, jailbreaks, ambiguity) | 140 | synonym traps, prompt-injection attempts, intentionally ambiguous queries | 100% refusal on listed must-refuse · 0 jailbreaks |
| Routing-only (Haiku terminates) | 100 | greetings, product-list lookups, FAQ; Sonnet should never fire | ≥ 0.95 router accuracy |
- category Docs-recoverable goldsitems 800what it checks labelled answer + correct anchor_id on real (de-identified) past ticketsci-gate threshold ≥ 0.90 recall@5 + groundedness
- category Not-docs-recoverableitems 220what it checks agent must refuse · no answer surface · ticket draft onlyci-gate threshold ≥ 0.95 refusal recall
- category Adversarial (synonyms, jailbreaks, ambiguity)items 140what it checks synonym traps, prompt-injection attempts, intentionally ambiguous queriesci-gate threshold 100% refusal on listed must-refuse · 0 jailbreaks
- category Routing-only (Haiku terminates)items 100what it checks greetings, product-list lookups, FAQ; Sonnet should never fireci-gate threshold ≥ 0.95 router accuracy
Eval set is frozen: items only added, never edited. Docs lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry.
What runs every week,
and who owns it.
Production ops is part of the build, not an afterthought. Four controls keep the groundedness watchdog honest after cutover.
Watchdog review meeting
Every answer the user thumbs-down or the groundedness watchdog flagged sub-0.7 is opened. Drift that looks systematic (>3 same pattern/wk) becomes a JIRA ticket against the eval set.
Trace retention
Langfuse self-hosted. Searchable by confidence band, by user-feedback signal, by anchor_id match status.
On-call rotation
Two engineers per week. 99.5% pipeline-availability SLO + p95 ≤ 2.5s first-token SLO on the streamed answer surface.
Editorial-drift sweep
Sentry catches in-product render errors on citation chips. Mostly anchor_id rot from a renamed docs page. Fixed in the next PR-merge cycle.
The timeline
including the week we almost cut.
Five stages, milestone-billed. The week-5 shadow run caught a synonym-hallucination case. The model confidently returned the wrong product's answer because the cited anchor_id was real but from a different product surface. We halted promotion, tightened the validator, re-ran the groundedness eval, and only then cut over. The honest version of `7 weeks` includes the days we sat on our hands.
- Week 1
Discovery + eval set
One week with the support lead, the docs lead, and an SRE who owns the help-center. We sampled 1,200 closed tickets from the last 90 days and labelled each one `docs-recoverable` or `not-docs-recoverable` with the docs team. That sample produced the frozen 1,260-item eval set: 800 docs-recoverable queries with their correct doc-anchor target, 220 not-docs-recoverable queries (the agent must refuse), 140 adversarial queries (synonym traps, jailbreaks, intentional ambiguity), and 100 routing-only queries Haiku should handle without Sonnet at all.
Frozen 1,260-item eval set + acuity-shaped scoring rubric - Weeks 2–3
Corpus + dual-index build
Ingested 12,400 MDX pages into pgvector 0.7 with anchor-preserving chunking: 512 tokens per chunk, 80-token overlap, sentence-anchored, never split mid-code-fence. Each chunk minted an anchor_id of shape doc_[slug]#[anchor-slug], matching the regex the answer schema enforces downstream. Algolia indexed the same corpus on the lexical side, reusing the synonym table the docs team already maintains. RRF fusion at k=60 tuned on a held-out eval slice.
Hybrid retrieval at 0.93 recall@5 on the eval set - Week 4
Routing + cited-answer agent
Claude Haiku 4.5 routes intent: greeting / product-list / docs-question / out-of-scope. Docs-questions go to Claude Sonnet 4.6 with `response_format: json_schema` set to the DocsAnswer shape. The schema is the contract: every claim must cite an anchor_id, confidence is bounded 0–1, refusal is an explicit boolean. Cloudflare Workers wraps the whole pipeline for edge streaming + audit shipping into Langfuse.
End-to-end answer pipeline behind a beta flag - Week 5
Shadow run: synonym hallucination caught
Two-week shadow run against the existing answer-bot. Day 3 the support lead flagged a case: a user had asked about `API key`, the model retrieved chunks about `secret tokens` (the docs use both terms in different products), and confidently returned an answer pointing at the wrong product's auth flow. The grounded-citation regex had passed because the model picked a real anchor_id. It just wasn't the right one. We halted promotion, tightened the validator to require that the cited chunk's pathway-id match the routed product context, ran the groundedness eval LLM-as-judge against the full shadow slice, and only then promoted.
Groundedness eval lifted 0.88 → 0.95 · false-anchor rate cut 6.2% → 0.9%Walk-away point - Weeks 6–7
Cutover + groundedness watchdog
Promoted to the help-center search surface with the old keyword search retained in active-standby for 30 days. 5% of live answers sampled by an LLM-as-judge groundedness watchdog, with a manual review queue surfacing every disagreement to the docs team. The watchdog is also wired to Sentry: any answer flagged below 0.7 groundedness pages the on-call. The old answer-bot stays available behind a `legacy` query flag for the support team for 60 more days while diffs are reviewed.
Production cutover · groundedness watchdog + 5% live sample
How we know
it works.
The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 1,260. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.
Sample size for the headline deflection number (≈ 64% docs-recoverable tickets resolved at confidence ≥ 0.8) is n=3,400 user sessions across the 30-day shadow window; the figure is a 95% confidence interval, not a point estimate. Recall@5 baseline is the prior keyword-search bot on the same eval slice. False-anchor rate is the share of answers where every cited anchor_id was real but at least one belonged to a different product surface than the routed intent. That was the kill-point failure mode that drove the week-5 halt. Refusal rate is by-design; it is the share of queries where the agent decided it could not answer.
The four shapes we turn down
before scoping a pilot.
A Claude RAG agent built on these patterns will mislead users in any of the following situations. We turn down the engagement before a pilot is scoped.
Docs corpus isn't load-bearing
If under 30% of inbound questions are recoverable from existing docs, retrieval has nothing to retrieve. The right answer is to write the docs first; an agent over thin docs is a hallucination engine.
Anchor stability isn't guaranteed
If the docs team renames or restructures pages weekly without a redirect contract, citation links rot and the agent stops being verifiable. We require anchor_ids to be immutable or to honor redirects.
Confident-wrong answers can hurt users
A docs agent has a narrow blast radius. A regulated-domain agent (medical, legal, financial advice) has a wider one. We don't ship the same architecture into those domains without regulated-industry guardrails.
Team won't review the watchdog feed
If docs lead and support lead aren't going to open the weekly Langfuse review for the first six months, the groundedness watchdog drifts and nobody catches it. The eval set is necessary, not sufficient.
What buyers ask first.
Real answers, no hedging.
What is Claude RAG?
Why Claude Sonnet 4.6 + Haiku 4.5 instead of one model?
How accurate is the Claude RAG pipeline?
What does Claude RAG cost to run at this scale?
How long does it take to build a Claude RAG agent?
What is the difference between Claude RAG and Anthropic RAG?
Can we use Claude RAG with pgvector or do we need Pinecone?
When should we NOT ship a Claude RAG agent?
Where this case study
points back to.
Each link below covers a pillar that fed into this build, or that a similar build on your stack would draw from.
Claude Development
The Claude pillar: Sonnet 4.6 + Haiku 4.5 integration patterns, forced JSON, prompt caching, BAA-eligible deployment paths.
AI Agent Development
The agent pillar: ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this docs RAG build.
AI Chatbot Development
Production chatbots on Claude + GPT: RAG, guardrails, citation-first answer surfaces.
All AI Case Studies
Six AI case studies: RAG, agents, voice, and chatbots. Same operator detail across every page.
AI Consulting
fixed-fee discovery audit. We map the workflow, scope the eval set, and tell you whether it's case-study-shaped.
AI Integration Services
Plug Claude into Zendesk, Intercom, Salesforce, Slack, and your existing docs source.
AI Software Development Company
How a product-docs RAG build fits inside a broader AI development services engagement: retrieval + chunking + reranker + UI shell + eval harness.
AI Knowledge Base
The umbrella service this case study demonstrates. RAG over product docs is one shape of a custom AI knowledge base. Same retrieval + reranker + eval harness, productized.
Want a case study like this
for your stack?
Book a fixed-fee discovery audit. We'll review the docs corpus, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. About one audit in five ends with `you don't need this. Buy the platform, here's the SOW for integration.`