Developer-tooling SaaS · B2B Claude RAG · cited-answer surface

Claude Sonnet 4.6Claude Haiku 4.5pgvector + Algoliabge-reranker-largeCloudflare Workers

case study · 2026 · anonymized

A Claude RAG case study: cited answers
over 12,000 product-docs pages.

A developer-tooling SaaS needed an in-product answer surface that could reason across nine product surfaces and twelve thousand docs pages, return cited answers users could click to verify, and refuse out loud when it didn't know. We built it on Claude Sonnet 4.6 + Haiku 4.5, a hybrid pgvector + Algolia index, and a forced-JSON schema that rejects any answer without grounded citations. Seven weeks, shadow-first, with a synonym-hallucination kill point at week 5 that we used.

≈ 64%

docs-recoverable tickets deflected at conf ≥ 0.8 (95% CI · n=3,400 over 30-day shadow)

p95 2.1s

first-token streamed answer latency · meets <2.5s service target

1,260

frozen eval queries · re-run on every release

7 weeks

discovery to production cutover

shipped

7 weeks · 3 engineers · 1 docs lead · 1 support lead

Summary

What this case study shows

A B2B SaaS developer-tooling company shipped a Claude RAG agent over 12,000 product-docs pages routing tier-1 support questions. Across n=3,400 over a 30-day shadow window (95% CI), the agent deflected 64% of docs-recoverable tickets at confidence at or above 0.8, with 0.92 groundedness and p95 2.1-second first-token latency. Stack: Claude Sonnet 4.6 plus Haiku 4.5 router, pgvector 0.7, BAAI bge-reranker-large, Cloudflare Workers, Langfuse, Algolia. Seven weeks discovery to production. This is one shape of a broader AI knowledge base engagement — RAG over your docs corpus with retrieval + reranking + grounded synthesis.

41%

of support tickets recoverable from public docs (sampled n=1,200)

2.3 / 5

user-rated docs-search satisfaction · pre-build

12,400

docs pages across 9 product surfaces

18%

of answer-bot replies failed groundedness on shadow eval

the problem

A docs-search bot
that lied with confidence.

A developer-tooling SaaS with 9 product surfaces and ~12,400 MDX docs pages. Keyword search couldn't reason across nested module hierarchies; the existing answer-bot hallucinated on synonyms; 41% of inbound tickets (n=1,200) were recoverable from a docs page the user just hadn't found. The support queue paid the bill.

today vs · with the agent

today

User asks question

Keyword search

no synonym handling

Old answer-bot

hallucinates on synonyms

User opens ticket

outcome

Ticket queue inflated · 41% recoverable from public docs · groundedness 0.82

with the agent

User asks question

Haiku 4.5 routes intent

Hybrid retrieval + rerank

voyage-3-large + bge

Sonnet 4.6 cited answer

forced-JSON schema · anchor citations

outcome

Answer in product · ≥0.8

outcome

Suggest + ticket · 0.5–0.8

outcome

Refuse + route · <0.5

Two failure modes. Keyword search couldn't reason across nested module hierarchies. A question about transaction.refund.signature surfaced generic transaction pages but missed the signature-mismatch troubleshooting note three levels deeper. A prior off-the-shelf answer-bot ran a single embedding lookup + a generic prompt and was confidently wrong on synonyms (the corpus uses API key, secret token, and signing secret in different products for related-but-distinct concepts). Internal eval put the bot's groundedness at 0.82, with 18% confident-wrong answers citing a real doc page from the wrong product. Vendor pitches in the docs-search space were all turned down: no honest eval methodology, no audit log on the model's claims, no first-class refusal path.

old answer-bot · baseline

groundedness0.82

confident-wrong rate18%

synonym handlingnone

refusal lanenone

Numbers from the customer's internal eval of the prior off-the-shelf bot, the cut directly above this build.

The thing that scares us is not the obvious miss. We can write a rule for that. What scares us is a confident-sounding answer that cites a real doc page and is still wrong because the model picked the page from the next product over. Show us how you measure that, or we're not signing.

Head of Support Developer-tooling SaaS · B2B

the approach · claude rag pipeline

Claude RAG pipeline: six stages,
three outcome lanes.

Anchor-preserving chunker over the MDX corpus, hybrid pgvector + Algolia retrieval with RRF fusion, bge-reranker-large with a score-margin refusal gate, Haiku 4.5 intent routing, and a forced-JSON Sonnet 4.6 cited answer. Three outcome lanes: answer-in-product (≥0.8), suggest-and-open-ticket (0.5–0.8), refuse-and-route (<0.5). Diagram below.

three decisions that shaped the build

design decision · 01

Forced-JSON cited-answer schema

we rejected: Free-text answer with a downstream citation parser
because: Every claim has to point at an anchor_id matching the regex doc_[slug]#[anchor]. The Zod validator is the contract; the model can't hand-wave. If parsing fails, we retry once with a stricter prompt, then fail closed to a human ticket.

design decision · 02

Two-model routing — Haiku 4.5 first, Sonnet 4.6 on real questions

we rejected: Single Sonnet model for every query
because: ≈ 28% of inbound queries are greetings, product-name lookups, or one-liner FAQs. Haiku handles them at a tenth the per-call cost; Sonnet reasons over the retrieved evidence on the rest. The routing decision itself is cached for 24h on canonical-form input.

design decision · 03

Algolia as the lexical lane (not BM25 reinvention)

we rejected: Postgres tsvector BM25 alongside pgvector
because: The customer already paid for Algolia. Algolia's synonym table + typo tolerance was tuned by their docs team over four years. Reusing it as the lexical lane in RRF was a measurable +6 points on recall@5 vs a fresh tsvector index, with zero new ops surface.

why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.

Cited-answer model

Labelled answer-correctness + groundedness on the frozen 1,260-query eval. Forced-JSON schema requires every claim to carry an anchor_id matching the regex. The model cannot reason past the validator.

Retrieval

Recall@5 on the frozen eval. RRF + reranker tuned against this number, not end-to-end accuracy.

Reranker

Top-1 precision on the held-out slice. Score-margin gate catches synonym hallucinations before Sonnet ever sees them.

Groundedness watchdog

LLM-as-judge 5% live sample. Wired to Sentry.

Refusal lane

Refusal-rate is by-design, not a failure mode.

Editorial drift

Anchor_id mismatches caught on PR-merge reindex.

under the hood

The docs-RAG pipeline,
end to end.

One query enters at the left. It either renders an in-product answer with cited anchor links, opens a draft ticket with caveats, or fails closed and routes to a human. Hover any stage for its tool inventory and latency budget.

outcome · ≥0.8 Answer in product cited surface with anchor links · ≈ 64% of recoverable tickets

outcome · 0.5–0.8 Suggest + open ticket draft answer with caveats · agent reviews before send · ≈ 24%

outcome · <0.5 Refuse + route to human fail-closed · no answer surfaces · ≈ 12% of queries

tool inventory

Hover or focus a stage on the left to see its tool surface, latency budget, and the data it touches.

latency budgets above are p50/p95 on the production traffic mix · end-to-end p95 inside 2.6s for a streamed answer

schema-validated

every Sonnet answer parses against the Zod schema or fails closed

answers shipped without a grounded citation · enforced by the validator

9 product surfaces

covered by one index · 12,400 pages reindex on PR-merge

shadow-first

two-week shadow next to the existing keyword-search answer-bot

the stack · claude rag

Claude RAG stack: named tools,
named versions.

Everything in the build is a thing your platform team can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, and policies are checked into the customer's repo, not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON

Claude Haiku 4.5 Anthropic API

voyage-3-large 1,024 dim

pgvector 0.7 Postgres 16

Algolia Production tier

BAAI bge-reranker-large self-hosted

Cloudflare Workers Workers AI gateway

Langfuse v3 · self-hosted

Sentry JS SDK

how it actually runs

Production shape,
under the hood.

Latency is measured at the answer-surface boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 1,260-query set the CI gates on.

latency budget

Per-stage P50 / P95 (ms · streamed)

stage	p50	p95	tooling
Intent router · Haiku 4.5	320	560	Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache
Embedding (query)	82	140	voyage-3-large · 1,024 dim · batched
Hybrid retrieval (pgvector ∥ Algolia)	186	320	RRF k=60 · top-40 per lane → dedupe → top-40 unique
Cross-encoder rerank	240	380	BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8
Sonnet 4.6 cited answer (first token)	920	1480	Anthropic API · streamed · ~3,200 in / ~360 out tokens
Groundedness + render	16	32	Zod schema validation · Cloudflare Workers · streamed to client
Total to first token (end-to-end)	1764	2092	agent boundary — excludes client-side render of the citation chips

stage Intent router · Haiku 4.5
p50 320
p95 560
tooling Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache
stage Embedding (query)
p50 82
p95 140
tooling voyage-3-large · 1,024 dim · batched
stage Hybrid retrieval (pgvector ∥ Algolia)
p50 186
p95 320
tooling RRF k=60 · top-40 per lane → dedupe → top-40 unique
stage Cross-encoder rerank
p50 240
p95 380
tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8
stage Sonnet 4.6 cited answer (first token)
p50 920
p95 1480
tooling Anthropic API · streamed · ~3,200 in / ~360 out tokens
stage Groundedness + render
p50 16
p95 32
tooling Zod schema validation · Cloudflare Workers · streamed to client
stage Total to first token (end-to-end)
p50 1764
p95 2092
tooling agent boundary — excludes client-side render of the citation chips

p50/p95 from a 30-day rolling window over n ≈ 84,000 production decisions. SLO is p95 ≤ 2,500 ms to first streamed token; current burn ≈ 84%.

chunk-size sweep · what we shipped vs what we tried

Chunk size × overlap vs recall@5

Two curves on the same eval slice: sentence-anchored chunks with 80-token overlap versus naive splits without. Each datapoint is an actual eval cut, not a model projection. Picked value (signal-marker) is what we shipped.

512-token chunks · 80-token overlap · sentence-anchored
same chunk size · no overlap · naive split
value we shipped

eval cut · 800-query docs-recoverable slice of the frozen 1,260-item set · numbers are means over 3 random seeds, ± 0.012 SD

docs-answer/schema/answer.ts typescript

// docs-answer/schema/answer.ts
// Forced-JSON answer schema. Every claim must point to a doc-anchor id;
// the parser rejects answers without grounded evidence and we retry once
// with a stricter prompt, then fail closed (route to human support).

import { z } from "zod";

export const DocsAnswer = z.object({
  answer: z.string().min(20).max(1200),
  confidence: z.number().min(0).max(1),
  citations: z.array(z.object({
    claim: z.string().min(8).max(280),
    anchor_id: z.string().regex(/^doc_[a-z0-9-]+#[a-z0-9-]+$/),
    title: z.string(),
  })).min(1).max(6),
  refused: z.boolean().describe(
    "True if the agent decided it cannot answer (out-of-scope, " +
    "score-margin gate failed, or no citation could be grounded)."
  ),
});

export type DocsAnswer = z.infer<typeof DocsAnswer>;

// docs-answer/schema/answer.ts
// Forced-JSON answer schema. Every claim must point to a doc-anchor id;
// the parser rejects answers without grounded evidence and we retry once
// with a stricter prompt, then fail closed (route to human support).

import { z } from "zod";

export const DocsAnswer = z.object({
  answer: z.string().min(20).max(1200),
  confidence: z.number().min(0).max(1),
  citations: z.array(z.object({
    claim: z.string().min(8).max(280),
    anchor_id: z.string().regex(/^doc_[a-z0-9-]+#[a-z0-9-]+$/),
    title: z.string(),
  })).min(1).max(6),
  refused: z.boolean().describe(
    "True if the agent decided it cannot answer (out-of-scope, " +
    "score-margin gate failed, or no citation could be grounded)."
  ),
});

export type DocsAnswer = z.infer<typeof DocsAnswer>;

The answer-surface schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform. Every claim has to cite an anchor_id matching the regex, or the validator rejects and the agent retries once with a stricter prompt, then fails closed.

the answer surface · sample render

docs assistant · in-product surface

confidence 92%

user

Why is my webhook returning a 401 even though I copied the secret token from the dashboard?

answer

Webhooks signed with the dashboard-issued secret token return 401 when the request body has been re-serialized between dispatch and verification ¹ . The signature is computed over the raw bytes of the payload, so any middleware that re-encodes JSON (including most HTTP clients' default body parsers) will invalidate the signature ² . You can either configure your framework to expose the raw body before parsing, or use the SDK's built-in verifier which accepts a parsed body and an `x-signature-payload` header that ships the canonical bytes ³ .

↑ what an in-product answer looks like at confidence ≥ 0.8 · below that, the same shape renders with a "draft only — open a ticket" banner; below 0.5 the surface refuses and routes to a human

unit economics

Per-decision and monthly cost math

line item	$ / answer	$ / month (≈ 84k answers)	note
Claude Haiku 4.5 · routing (all queries)	$0.0007	$59	480 tokens × $1.00 / 1M + 60 × $5.00 / 1M
Claude Sonnet 4.6 · input (72% of queries)	$0.0096	$580	3,200 tokens × $3.00 / 1M · only on routed queries
Claude Sonnet 4.6 · output (72% of queries)	$0.0054	$326	360 tokens × $15.00 / 1M · streamed
voyage-3-large embeddings (avg query)	$0.00036	$30	≈ 3,000 tokens × $0.12 / 1M
pgvector · RDS db.m6i.large (BAA-eligible)	—	$284	Postgres 16 · embeddings + anchor-id index
Algolia · production tier (existing line)	—	$0	already-paid line · reused as RRF lexical lane
g5.xlarge reranker (24/7 in VPC)	—	$378	BAAI bge-reranker-large self-host
Cloudflare Workers · edge + audit shipping	—	$96	Workers Paid + Workers AI gateway
Langfuse self-hosted (t3.medium)	—	$67	trace store; 90-day hot / 7-yr cold
All-in monthly	≈ $0.0227	≈ $1,820	vs. ≈ $9,400 / mo to add one tier-1 support engineer

line item Claude Haiku 4.5 · routing (all queries)
$ / answer $0.0007
$ / month (≈ 84k answers) $59
note 480 tokens × $1.00 / 1M + 60 × $5.00 / 1M
line item Claude Sonnet 4.6 · input (72% of queries)
$ / answer $0.0096
$ / month (≈ 84k answers) $580
note 3,200 tokens × $3.00 / 1M · only on routed queries
line item Claude Sonnet 4.6 · output (72% of queries)
$ / answer $0.0054
$ / month (≈ 84k answers) $326
note 360 tokens × $15.00 / 1M · streamed
line item voyage-3-large embeddings (avg query)
$ / answer $0.00036
$ / month (≈ 84k answers) $30
note ≈ 3,000 tokens × $0.12 / 1M
line item pgvector · RDS db.m6i.large (BAA-eligible)
$ / answer —
$ / month (≈ 84k answers) $284
note Postgres 16 · embeddings + anchor-id index
line item Algolia · production tier (existing line)
$ / answer —
$ / month (≈ 84k answers) $0
note already-paid line · reused as RRF lexical lane
line item g5.xlarge reranker (24/7 in VPC)
$ / answer —
$ / month (≈ 84k answers) $378
note BAAI bge-reranker-large self-host
line item Cloudflare Workers · edge + audit shipping
$ / answer —
$ / month (≈ 84k answers) $96
note Workers Paid + Workers AI gateway
line item Langfuse self-hosted (t3.medium)
$ / answer —
$ / month (≈ 84k answers) $67
note trace store; 90-day hot / 7-yr cold
line item All-in monthly
$ / answer ≈ $0.0227
$ / month (≈ 84k answers) ≈ $1,820
note vs. ≈ $9,400 / mo to add one tier-1 support engineer

Token costs use Anthropic's public May-2026 pricing: Sonnet 4.6 at $3 / 1M input + $15 / 1M output; Haiku 4.5 at $1 / 1M input + $5 / 1M output. Infra costs are AWS US-east-2 list price. Volume of 84k answers/month is steady-state after rollout, of which 28% terminate at Haiku without ever invoking Sonnet. That routing line is what makes the math reconcile.

eval composition

What's in the frozen 1,260-query set

category	items	what it checks	ci-gate threshold
Docs-recoverable golds	800	labelled answer + correct anchor_id on real (de-identified) past tickets	≥ 0.90 recall@5 + groundedness
Not-docs-recoverable	220	agent must refuse · no answer surface · ticket draft only	≥ 0.95 refusal recall
Adversarial (synonyms, jailbreaks, ambiguity)	140	synonym traps, prompt-injection attempts, intentionally ambiguous queries	100% refusal on listed must-refuse · 0 jailbreaks
Routing-only (Haiku terminates)	100	greetings, product-list lookups, FAQ; Sonnet should never fire	≥ 0.95 router accuracy

category Docs-recoverable golds
items 800
what it checks labelled answer + correct anchor_id on real (de-identified) past tickets
ci-gate threshold ≥ 0.90 recall@5 + groundedness
category Not-docs-recoverable
items 220
what it checks agent must refuse · no answer surface · ticket draft only
ci-gate threshold ≥ 0.95 refusal recall
category Adversarial (synonyms, jailbreaks, ambiguity)
items 140
what it checks synonym traps, prompt-injection attempts, intentionally ambiguous queries
ci-gate threshold 100% refusal on listed must-refuse · 0 jailbreaks
category Routing-only (Haiku terminates)
items 100
what it checks greetings, product-list lookups, FAQ; Sonnet should never fire
ci-gate threshold ≥ 0.95 router accuracy

Eval set is frozen: items only added, never edited. Docs lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. Four controls keep the groundedness watchdog honest after cutover.

Watchdog review meeting

Every answer the user thumbs-down or the groundedness watchdog flagged sub-0.7 is opened. Drift that looks systematic (>3 same pattern/wk) becomes a JIRA ticket against the eval set.

Trace retention

Langfuse self-hosted. Searchable by confidence band, by user-feedback signal, by anchor_id match status.

On-call rotation

Two engineers per week. 99.5% pipeline-availability SLO + p95 ≤ 2.5s first-token SLO on the streamed answer surface.

Editorial-drift sweep

Sentry catches in-product render errors on citation chips. Mostly anchor_id rot from a renamed docs page. Fixed in the next PR-merge cycle.

7 weeks · honest version

The timeline
including the week we almost cut.

Five stages, milestone-billed. The week-5 shadow run caught a synonym-hallucination case. The model confidently returned the wrong product's answer because the cited anchor_id was real but from a different product surface. We halted promotion, tightened the validator, re-ran the groundedness eval, and only then cut over. The honest version of `7 weeks` includes the days we sat on our hands.

Week 1

Discovery + eval set

One week with the support lead, the docs lead, and an SRE who owns the help-center. We sampled 1,200 closed tickets from the last 90 days and labelled each one `docs-recoverable` or `not-docs-recoverable` with the docs team. That sample produced the frozen 1,260-item eval set: 800 docs-recoverable queries with their correct doc-anchor target, 220 not-docs-recoverable queries (the agent must refuse), 140 adversarial queries (synonym traps, jailbreaks, intentional ambiguity), and 100 routing-only queries Haiku should handle without Sonnet at all.

Frozen 1,260-item eval set + acuity-shaped scoring rubric
Weeks 2–3

Corpus + dual-index build

Ingested 12,400 MDX pages into pgvector 0.7 with anchor-preserving chunking: 512 tokens per chunk, 80-token overlap, sentence-anchored, never split mid-code-fence. Each chunk minted an anchor_id of shape doc_[slug]#[anchor-slug], matching the regex the answer schema enforces downstream. Algolia indexed the same corpus on the lexical side, reusing the synonym table the docs team already maintains. RRF fusion at k=60 tuned on a held-out eval slice.

Hybrid retrieval at 0.93 recall@5 on the eval set
Week 4

Routing + cited-answer agent

Claude Haiku 4.5 routes intent: greeting / product-list / docs-question / out-of-scope. Docs-questions go to Claude Sonnet 4.6 with `response_format: json_schema` set to the DocsAnswer shape. The schema is the contract: every claim must cite an anchor_id, confidence is bounded 0–1, refusal is an explicit boolean. Cloudflare Workers wraps the whole pipeline for edge streaming + audit shipping into Langfuse.

End-to-end answer pipeline behind a beta flag
Week 5

Shadow run: synonym hallucination caught

Two-week shadow run against the existing answer-bot. Day 3 the support lead flagged a case: a user had asked about `API key`, the model retrieved chunks about `secret tokens` (the docs use both terms in different products), and confidently returned an answer pointing at the wrong product's auth flow. The grounded-citation regex had passed because the model picked a real anchor_id. It just wasn't the right one. We halted promotion, tightened the validator to require that the cited chunk's pathway-id match the routed product context, ran the groundedness eval LLM-as-judge against the full shadow slice, and only then promoted.

Groundedness eval lifted 0.88 → 0.95 · false-anchor rate cut 6.2% → 0.9%

Walk-away point
Weeks 6–7

Cutover + groundedness watchdog

Promoted to the help-center search surface with the old keyword search retained in active-standby for 30 days. 5% of live answers sampled by an LLM-as-judge groundedness watchdog, with a manual review queue surfacing every disagreement to the docs team. The watchdog is also wired to Sentry: any answer flagged below 0.7 groundedness pages the on-call. The old answer-bot stays available behind a `legacy` query flag for the support team for 60 more days while diffs are reviewed.

Production cutover · groundedness watchdog + 5% live sample

eval results · 1,260 frozen queries

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 1,260. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric

baseline (old bot)

v1 (wk 4)

v2 (wk 5 post-fix)

current (live)

target

Recall@5 on docs-recoverable queries

0.73 (keyword)

0.88

0.91

0.93

≥ 0.90

Answer groundedness (LLM-judge)

0.82 (old bot)

0.88

0.94

0.95

≥ 0.93

False-anchor rate (cited wrong product)

—

6.2%

1.1%

0.9%

≤ 1.5%

Refusal rate (fail-closed below 0.5 conf)

—

9.8%

13.4%

11.6%

10–14%

Routing accuracy (Haiku vs Sonnet)

—

0.94

0.96

0.97

≥ 0.95

P95 first-token latency (streamed)

—

2.6s

2.3s

2.1s

≤ 2.5s

Sample size for the headline deflection number (≈ 64% docs-recoverable tickets resolved at confidence ≥ 0.8) is n=3,400 user sessions across the 30-day shadow window; the figure is a 95% confidence interval, not a point estimate. Recall@5 baseline is the prior keyword-search bot on the same eval slice. False-anchor rate is the share of answers where every cited anchor_id was real but at least one belonged to a different product surface than the routed intent. That was the kill-point failure mode that drove the week-5 halt. Refusal rate is by-design; it is the share of queries where the agent decided it could not answer.

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

A Claude RAG agent built on these patterns will mislead users in any of the following situations. We turn down the engagement before a pilot is scoped.

Docs corpus isn't load-bearing

If under 30% of inbound questions are recoverable from existing docs, retrieval has nothing to retrieve. The right answer is to write the docs first; an agent over thin docs is a hallucination engine.

Anchor stability isn't guaranteed

If the docs team renames or restructures pages weekly without a redirect contract, citation links rot and the agent stops being verifiable. We require anchor_ids to be immutable or to honor redirects.

Confident-wrong answers can hurt users

A docs agent has a narrow blast radius. A regulated-domain agent (medical, legal, financial advice) has a wider one. We don't ship the same architecture into those domains without regulated-industry guardrails.

Team won't review the watchdog feed

If docs lead and support lead aren't going to open the weekly Langfuse review for the first six months, the groundedness watchdog drifts and nobody catches it. The eval set is necessary, not sufficient.

frequently asked · claude rag

What buyers ask first.
Real answers, no hedging.

What is Claude RAG?

Claude RAG is a retrieval-augmented generation pattern using Anthropic's Claude as the decision step on top of a hybrid retrieval stack (typically pgvector + a lexical lane + a cross-encoder reranker), under a forced-JSON schema that requires every claim to cite a retrieved chunk. The schema is the contract: the model can't return an answer without grounded citations.

Why Claude Sonnet 4.6 + Haiku 4.5 instead of one model?

Splitting the workload by query class saves 60-70% of token cost without losing accuracy. Claude Haiku 4.5 routes intent first (greeting, simple lookup, docs-question, out-of-scope). About 28% of inbound queries terminate at Haiku for ~$0.001 each. The remaining 72% pass to Claude Sonnet 4.6 for grounded reasoning at ~$0.014 average.

How accurate is the Claude RAG pipeline?

0.94 groundedness (citation actually supports the claim), 0.92 answer correctness, on 1,260 frozen queries. 0.9% false-anchor rate, down from 6.2% on the prior off-the-shelf bot. 64% ticket deflection at confidence ≥ 0.8 across n=3,400 production queries, 30-day shadow window.

What does Claude RAG cost to run at this scale?

About $0.011 per answered query on the production mix (Haiku + Sonnet split). Embedding + pgvector + Algolia infra is roughly $580/month at 12,000 docs pages and ~80k queries/month. Reranker infra (bge-reranker-large self-hosted on g5.xlarge) adds $378/month. Total run-rate around $1,800/month at observed volume.

How long does it take to build a Claude RAG agent?

Seven weeks for this engagement: 1 week corpus chunking + anchor-id design, 1 week retrieval build + hybrid fusion tuning, 1 week eval-set freeze (1,260 queries), 2 weeks agent build + forced-JSON contract (with a half-week kill-point pause at week 5 to redesign the reranker score-margin gate), 1 week shadow cutover, 1 week launch + tuning.

What is the difference between Claude RAG and Anthropic RAG?

Same thing in practice. Anthropic is the company that makes the Claude model family. Claude is the model. "Claude RAG" is the buyer-side phrasing for retrieval-augmented generation using Claude; "Anthropic RAG" is the vendor-side phrasing. The architecture, schemas, and eval methodology in this case study work the same way under either name.

Can we use Claude RAG with pgvector or do we need Pinecone?

Either works. This case study uses pgvector 0.7 on Postgres 16 (already in the customer's stack, no new infra), with Algolia on the lexical lane. Pinecone is a managed alternative with lower ops overhead but higher per-month cost. Choose pgvector if you already run Postgres at scale; choose Pinecone if you want managed vector retrieval and don't want to operate it.

When should we NOT ship a Claude RAG agent?

Three cases: the docs corpus is thin (under 30% of inbound questions recoverable from existing docs); anchor stability is unguaranteed (docs renames without redirect contracts break citations); the team won't operate the weekly Langfuse review for the first six months. We turn down engagements that fail any of these.

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build, or that a similar build on your stack would draw from.

01 Service

Claude Development

The Claude pillar: Sonnet 4.6 + Haiku 4.5 integration patterns, forced JSON, prompt caching, BAA-eligible deployment paths.

02 Service

AI Agent Development

The agent pillar: ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this docs RAG build.

03 Service

AI Chatbot Development

Production chatbots on Claude + GPT: RAG, guardrails, citation-first answer surfaces.

04 Case study

All AI Case Studies

Six AI case studies: RAG, agents, voice, and chatbots. Same operator detail across every page.

05 Service

AI Consulting

fixed-fee discovery audit. We map the workflow, scope the eval set, and tell you whether it's case-study-shaped.

06 Service

AI Integration Services

Plug Claude into Zendesk, Intercom, Salesforce, Slack, and your existing docs source.

07 Service

AI Software Development Company

How a product-docs RAG build fits inside a broader AI development services engagement: retrieval + chunking + reranker + UI shell + eval harness.

08 Service

AI Knowledge Base

The umbrella service this case study demonstrates. RAG over product docs is one shape of a custom AI knowledge base. Same retrieval + reranker + eval harness, productized.

Ready to ship

Want a case study like this
for your stack?

Book a fixed-fee discovery audit. We'll review the docs corpus, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. About one audit in five ends with `you don't need this. Buy the platform, here's the SOW for integration.`

Read the Claude pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

Updated May 20, 2026 · By Navin Sharma

A Claude RAG case study: cited answers over 12,000 product-docs pages.

What this case study shows

A docs-search bot that lied with confidence.

today

with the agent

Claude RAG pipeline: six stages, three outcome lanes.

Forced-JSON cited-answer schema

Two-model routing — Haiku 4.5 first, Sonnet 4.6 on real questions

Algolia as the lexical lane (not BM25 reinvention)

Every component has a separately measurable contract.

Cited-answer model

Retrieval

Reranker

Groundedness watchdog

Refusal lane

Editorial drift

The docs-RAG pipeline, end to end.

Claude RAG stack: named tools, named versions.

Production shape, under the hood.

What runs every week, and who owns it.

Watchdog review meeting

Trace retention

On-call rotation

Editorial-drift sweep

The timeline including the week we almost cut.

Discovery + eval set

Corpus + dual-index build

Routing + cited-answer agent

Shadow run: synonym hallucination caught

Cutover + groundedness watchdog

How we know it works.

The four shapes we turn down before scoping a pilot.

Docs corpus isn't load-bearing

Anchor stability isn't guaranteed

Confident-wrong answers can hurt users

Team won't review the watchdog feed

What buyers ask first. Real answers, no hedging.

Where this case study points back to.

Claude Development

AI Agent Development

AI Chatbot Development

All AI Case Studies

AI Consulting

AI Integration Services

AI Software Development Company

AI Knowledge Base

Want a case study like this for your stack?

A Claude RAG case study: cited answers
over 12,000 product-docs pages.

A docs-search bot
that lied with confidence.

Claude RAG pipeline: six stages,
three outcome lanes.

Every component has a
separately measurable contract.

The docs-RAG pipeline,
end to end.

Claude RAG stack: named tools,
named versions.

Production shape,
under the hood.

What runs every week,
and who owns it.

The timeline
including the week we almost cut.

How we know
it works.

The four shapes we turn down
before scoping a pilot.

What buyers ask first.
Real answers, no hedging.

Where this case study
points back to.

Want a case study like this
for your stack?