RAG Development

RAG development services for systems that cite their sources.

Paiteq builds retrieval augmented generation systems from corpus audit to live retrieval, production RAG development services on Pinecone, Qdrant, Weaviate, and pgvector. Hybrid retrieval, reranking, and citation-enforced answers. Refusal when the context is thin, not confident guesses dressed up as evidence.

Talk to engineering See pipeline architecture

Stack LlamaIndex · Pinecone · Cohere

Eval Recall · Faithfulness · Latency

Engage Pilot · Build · Rescue · Migrate

Compliance PII-scrubbed · SOC-2-ready

001 / SURFACES

Eight RAG surfaces we ship.

Each surface below is a workload we've shipped to production, with the corpus shape, the eval methodology, and the failure modes already worked out. See three in detail →

RAG pipeline development sorts cleanly by corpus shape, not by industry. A contracts-Q&A pipeline for a law firm reuses the same hierarchical chunking, hybrid retrieval, and citation-enforcement prompts as an enterprise rag deployment over policy documents at a health-tech company. The integrations, freshness rules, and residency posture change, the retrieval pipeline shape doesn't. Sorting by corpus shape lets us reuse the eval harness and reranker tuning across clients. Sorting by industry, the way most competing pages organize themselves, hides where the engineering actually lives.

01 / DOCS ↗

Document Q&A

Ask your contracts, policies, manuals. Cited answers with paragraph-level provenance. Refusal when context is thin.

CitedRefusal

02 / SUPPORT ↗

Support knowledge agent

RAG over docs + ticket archive. Deflects tier-1 cleanly, escalates with full context, not just a transcript dump.

Tier-1Tickets

03 / SEARCH ↗

Enterprise semantic search

BM25 + dense hybrid retrieval across your full corpus. Answers, not ten blue links, and an llm knowledge base behind it.

HybridRerank

04 / CODE ↗

Code & repo Q&A

Repo-aware RAG for engineering teams. Symbol-graph + dense hybrid. Answers carry file:line citations.

Repo-aware

05 / LEGAL ↗

Contracts & compliance

Clause extraction, contract Q&A, redline review. PII-scrubbed at ingestion. A document intelligence services workload at its core.

PII-scrubbed

06 / RESEARCH ↗

Research synthesis

Multi-hop RAG over papers, market reports, internal notes. Citations to primary sources only, never fabricated page numbers.

Multi-hop

07 / EVAL ↗

Evaluation & re-ranking

Retrieval recall, faithfulness, context relevance. Measured weekly with RAGAS + Trulens, not eyeballed.

RAGASTrulens

08 / FRESHNESS ↗

Live + fresh corpora

Incremental embeddings, delete-and-replace, change-data-capture wired into your source systems. Stale-doc rate as a first-class metric.

CDCIncremental

SURFACE × INDUSTRY

Where we've shipped each RAG surface. Strength reflects production volume, not theoretical fit, empty cells mean we either haven't done it yet or the workload didn't justify a retrieval pipeline.

Surface Industry

B2B SaaS

Health-tech

Mfg

Fin-tech

Legal

E-comm

Ed-tech

Logistics

Docs Q&A

Support / KB

Code repo

Legal / contract

Research / memo

Docs Q&A

B2B SaaSHealth-techMfgFin-techLegalE-commEd-techLogistics

Support / KB

B2B SaaSHealth-techMfgFin-techE-commEd-techLogistics Legal

B2B SaaSHealth-techMfgFin-techLegalE-commEd-tech Logistics

Code repo

B2B SaaSFin-techE-commEd-tech Health-techMfgLegalLogistics

Legal / contract

B2B SaaSHealth-techFin-techLegalE-commLogistics MfgEd-tech

Research / memo

B2B SaaSHealth-techFin-techLegalEd-tech MfgE-commLogistics

Possible fit Good fit Primary vertical

Heavier columns: legal, fintech, healthcare. The pattern is unsurprising, those are the industries where citations, refusal, and provenance pay back the hardest. Lighter columns: e-commerce and ed-tech, where workloads tilt toward generation and recommendation more than retrieval. The grid isn't a roadmap; it's a record. We'll talk you out of a RAG-shaped engagement if the workload is actually a generation problem (better fit for our generative AI practice) or a tool-use problem (better fit for autonomous tool-use agents).

One more pattern worth naming: about 25% of the engagements that start as RAG end up being rag consulting work instead. The client comes in wanting a built pipeline; the corpus audit reveals that the source data isn't yet in a shape any pipeline can win against. The right next step is a six-week consulting engagement to get the source data structured, header conventions, OCR cleanup, deduplication, freshness pipeline, before any retrieval code ships. We do that work, sometimes hand it to your team, sometimes we run it ourselves. Either way, we say "not yet" on the build until the corpus is ready, because shipping a RAG against a broken corpus is how you end up calling Rescue six months later.

002 / SERVICES

RAG development services, pick where to start.

Four engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done. Full engagement-model breakdown below →

Choosing a RAG development services partner is mostly about choosing the right starting shape. Buyers who walk in with a scoped corpus and one question type ship to production around 75% of the time. Buyers who walk in with "we want a RAG over everything we have" ship around 30% of the time, usually after a re-scope. The four shapes below map onto the starting points we see most often, pick the one that matches what you actually have, not what you wish you had. Every shape is a RAG development service we've shipped 10+ times; the deliverables and gate criteria are locked in by repetition, not invented for your engagement. Each starts by scoping the llm knowledge base boundary, what the system can and can't answer, before any ingest code ships.

01 / PILOT ↗

RAG Pilot

One corpus, one question shape, graded against a real eval set. Demo-ready in 2–4 weeks. Fixed scope.

2–4 wksFixed scope

02 / BUILD ↗

Production RAG Build

Full pipeline, ingestion, hybrid retrieval, reranker, eval gates, observability. 8–14 weeks. Fixed scope.

8–14 wks

03 / RESCUE ↗

RAG Rescue

Your RAG shipped but hallucinates, retrieves wrong, or can't keep up. We diagnose, fix, re-evaluate against your real data.

4–6 wksAudit + fix

04 / MIGRATION ↗

Vector DB Migration

Pinecone → Qdrant, Weaviate → pgvector, etc. Dual-write, eval-validated cutover, zero downtime. 6–10 weeks.

6–10 wks

A practical decision tree: if the corpus is scoped but the retrieval quality is unproven, start with a Pilot. If you know the corpus works and you need production discipline (CDC, observability, eval gates), start with a Production RAG Build. If you've already shipped RAG and it hallucinates, retrieves wrong, or can't keep up, start with Rescue. If the only thing wrong with the current system is the vector store underneath it, start with Migration. Each of these is a real RAG development service we run on a fixed scope, and our rag consulting practice will tell you to start narrow, moving up is cheap; over-scoping isn't. Week-by-week scope on each, further down →

003 / STACK

The retrieval stack: vector DBs, embedders, rerankers.

Stack choices follow the workload, not house preferences. Vector store, embedding model, and reranker are all benchmarked against your real eval set before we lock anything in.

LlamaIndex
LangChain
Pinecone
Qdrant
Weaviate
pgvector
Chroma
Cohere Rerank
Voyage
OpenAI Emb.
RAGAS
Trulens
BGE
Langfuse
Presidio
Unstructured
LlamaIndex
LangChain
Pinecone
Qdrant
Weaviate
pgvector
Chroma
Cohere Rerank
Voyage
OpenAI Emb.
RAGAS
Trulens
BGE
Langfuse
Presidio
Unstructured

VECTOR DATABASE PICKS

For each store: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production rag solutions on every one of these, the "when we don't" lines come from real builds, not theory. Vector database services selection isn't a one-shot decision; we benchmark two candidates against your eval set on every Production Build.

Pinecone

Strengths

Managed, serverless, predictable. Sub-100ms p95 on most workloads up to ~100M vectors. Hybrid search and namespaces built in.

When We Pick

Mid-size corpora (5M–100M vectors) where the team has zero appetite to run infra. SaaS clients without a platform team. Anywhere predictable cost and SOC 2-ready hosting matter more than knob-tuning.

When We Don't

Cost-sensitive workloads above 100M vectors (the unit economics flip vs Qdrant self-hosted). Workloads needing exotic distance metrics or custom HNSW parameters Pinecone doesn't expose.

Paiteq Pattern

Default for clients without a platform team. We benchmark it against Qdrant on every Production Build, but Pinecone wins ~60% of those head-to-heads on operational simplicity.

ManagedHybridServerless

Qdrant

Strengths

Open-source with a Rust core. Excellent performance per dollar self-hosted. Strong filtering, payload indexing, and quantization options.

When We Pick

Self-hosted requirements (residency, regulated workloads). Large corpora where the dedicated infra cost beats managed pricing. Anywhere we need scalar / product quantization to cut RAM 4×.

When We Don't

Tiny corpora (<1M vectors), the operational tax doesn't pay off. Teams with no Kubernetes capacity.

Paiteq Pattern

Our default for enterprise rag deployments on AWS / GCP where data residency matters. We run Qdrant Cloud for managed convenience and self-hosted for regulated workloads, same API.

Self-hostedQuantizationFilterable

Weaviate

Strengths

Hybrid search and modular reranking are first-class. GraphQL API. Built-in multi-tenancy. Optional ML modules (vectorizer, reader).

When We Pick

Hybrid search heavy workloads where rerank logic and ranking modules need to ship with the store. Multi-tenant SaaS where each customer gets isolated namespaces.

When We Don't

Simpler workloads where pgvector or Pinecone covers it. Teams allergic to GraphQL, Weaviate's REST is fine but the docs lean GraphQL.

Paiteq Pattern

We reach for Weaviate when the client's eval set shows hybrid search lifts recall by ≥10 points over pure vector. Modular reranker plug-ins shorten the build by a week or two.

Hybrid-firstGraphQLMulti-tenant

pgvector

Strengths

Postgres extension. No extra system to operate. Joins between vectors and structured data. HNSW + IVFFlat indexes.

When We Pick

Corpora that fit on one Postgres node (typically <5M vectors). Teams that already run Postgres and don't want another service. Apps where vector + relational filtering matter together.

When We Don't

Above ~5M vectors with high QPS, index rebuild times and connection pool pressure start hurting. Workloads with very high write churn.

Paiteq Pattern

Default for early-stage SaaS clients. We migrate to Qdrant or Pinecone the first time index rebuild crosses 90 seconds on production data, that's the usual tell.

Postgres-nativeJoinsHNSW

EMBEDDERS & RERANKERS

Embedder choice usually matters more than vector store choice for retrieval recall, and it's the easier thing to swap later. A reranker on top of hybrid retrieval is the single highest-ROI add on most of the RAG pipeline development work we do. Hybrid search implementation, when it earns its keep on the eval set, lifts top-k precision by another 8–18 points.

OpenAI text-embedding-3-large

Strengths

Strong general-purpose baseline. 3072 dims (truncatable to 1536 / 512 / 256 via MRL). Stable API.

When We Pick

Default for English-first corpora and mixed-domain workloads. When the client has an OpenAI contract already and procurement won't add a vendor.

When We Don't

Multi-lingual corpora, Voyage and BGE win our evals there. Workloads where data residency rules out hosted embedding APIs.

Paiteq Pattern

Our day-one baseline; we benchmark against Voyage and BGE on every Production Build. About 40% of builds end up swapping to one of those.

3072 dimsMRL

Voyage-3 / voyage-large

Strengths

Consistently tops the MTEB leaderboard for our evals. Strong long-context (32k) embedding. Per-domain models for code, finance, law.

When We Pick

Long documents where 32k context preserves more signal than chunking. Domain-specific corpora (legal, finance, code) where Voyage's domain models lift recall 5–12 points.

When We Don't

Hyper-cost-sensitive batch workloads, OpenAI's batch pricing wins by ~30%. Workloads requiring on-prem embedding (use BGE instead).

Paiteq Pattern

Default for legal and clinical RAG. We've shipped contract-Q&A systems where Voyage-law improved retrieval recall from 78% to 91% over the OpenAI baseline.

MTEB topLong-contextDomain

BGE / GTE (self-hosted)

Strengths

Open-weights. Run on your own GPU or CPU. BGE-large-en is the strongest open embedder we've benchmarked. Free at scale.

When We Pick

Regulated workloads where no data can leave the perimeter. Very high-volume batch embedding where hosted API cost dominates. BAAI-licensed deployments.

When We Don't

Tiny corpora, the GPU isn't worth it. Teams without infrastructure to run an embedding service. Multilingual workloads where Voyage still wins.

Paiteq Pattern

We run BGE on a small CPU pool for healthcare and finance clients with strict residency. Throughput tuning matters, batch size and ONNX runtime get most of the gains.

Open-weightsSelf-hostedBAAI

Cohere Rerank 3

Strengths

Cross-encoder reranker that lifts top-k precision by 8–18 points on our evals over pure vector. Hosted API, sub-200ms.

When We Pick

Almost every Production RAG Build. Reranking on top-50 vector hits down to top-5 is the single highest-ROI add we make.

When We Don't

Very tight latency budgets (Cohere adds ~150–200ms). Pure offline batch where the reranker cost stacks up. Workloads where a self-hosted bge-reranker fits the same role for free.

Paiteq Pattern

Default. We benchmark Cohere Rerank 3 vs bge-reranker-large head-to-head on the eval set; Cohere usually wins but bge-reranker self-hosted is the strong runner-up.

RerankCross-encoderHosted

Two patterns worth flagging on every RAG engagement. First, we benchmark two embedders against the eval set before locking the stack, usually OpenAI text-embedding-3-large against a domain-specific Voyage model. The eval set decides, not the leaderboard. Second, we default to including a reranker on every Production Build. The 150–200ms latency tax is almost always worth the 8–18 point precision lift. Skip it only when latency budget rules it out, voice-RAG, sub-700ms response targets. Our deeper take on chunking, including when fixed-size beats semantic, lives in our chunking strategy deep dive.

CHUNKING DECISIONS

Chunking strategy is the choice that decides retrieval recall more than embedder choice does, and it's the one most teams get wrong on a first build. Three patterns cover ~90% of what we ship.

01 · Fixed-size + overlap

500–1,000 token chunks with 10–20% overlap. Default starting point for narrative text, support docs, knowledge bases, marketing-style content. Fastest to ship; usually the baseline every other chunker has to beat.

When it works: uniform-density text where paragraphs and sections are short. When it fails: structured documents (contracts, manuals) where headers carry semantic weight.

02 · Recursive + title-aware

Splits at heading boundaries first, then by paragraph, then by sentence, preserving document structure in the chunk. Hierarchical: each chunk knows its parent section. Our default for legal, regulatory, technical documentation.

When it works: structured docs with consistent heading conventions. When it fails: badly OCR'd PDFs where the heading detection breaks, falls back to fixed-size with a recall penalty of 8–15 points.

03 · Semantic / embedding-based

Splits on cosine-distance jumps between sentences, chunks form where the topic shifts. Higher ingest cost (one embed per sentence pair), but lifts retrieval recall 5–12 points on conceptually dense content like research papers and analyst memos.

When it works: long-form analytical writing, research synthesis, ed-tech curriculum. When it fails: fragmentary content (chat logs, ticket threads) where the semantic signal is noisy.

We benchmark two chunkers head-to-head against your eval set on every Production RAG Build, typically recursive + title-aware vs semantic, and the winner is locked at week 4. The losing strategy stays in the codebase as a fallback for sources that don't fit the primary pattern (badly OCR'd PDFs, mixed-format archives).

004 / PROCESS

Six steps to build a RAG pipeline, eval-first, every time.

The same process runs across a 4-week Pilot and a 14-week Production Build. The gates change in depth, not in shape. Every step has an explicit deliverable, a named owner, and a gate criterion, pass or rework, no "we'll figure it out next sprint."

WEEK 1

Corpus audit

What you've got, in what shape, how fresh, what's allowed to leave your perimeter. We don't write any ingest code until this is signed off.

WEEK 2

Eval set

30–80 graded questions with reference answers + the supporting passages each answer should retrieve. Lands before any pipeline code.

WEEK 2–4

Ingestion

Chunking strategy, embedding model selection, store choice, freshness pipeline. Each choice benchmarked, not picked by vibe.

WEEK 4–6

Retrieval

Hybrid (BM25 + dense) plus rerank. Query rewriting where it earns its keep. Tuned against your eval set, not a public benchmark.

WEEK 6–8

Eval gates

Retrieval recall, answer faithfulness, context relevance, hallucination rate, p95 latency, all green before any production wire-up.

ONGOING

Running

Weekly eval, freshness monitoring, prompt + chunking iteration based on production logs. The eval set grows from sampled traces.

Corpus audit

We walk every source in your corpus before anyone proposes a chunking strategy. That means listing every source (Confluence, SharePoint, Postgres tables, S3 PDFs, the legacy DMS no one talks about), the rough doc count, the freshness cadence per source, the access policy, and what counts as in-scope vs out. Week-1 output is a corpus map: sources, volumes, formats, access patterns, and what's allowed to embed off-perimeter.

OwnersPaiteq RAG engineer + your data / IT owner. ~5 hours of their time spread across the week.

GateCorpus boundary signed off. If a source is fuzzy on access policy, we exclude it from the pilot, better to start narrow than discover a residency violation in week 7.

Eval set

30–80 graded questions, each with a reference answer and the passage(s) the retriever should pull. Your domain expert grades; we facilitate. Realistic edge cases matter more than easy ones, the eval set is what tells us when chunking strategy needs to change, so it has to surface real failure modes. We build it before any retrieval code ships.

OwnersYour domain expert (~8 hours) + Paiteq engineer facilitating. We've never had a RAG engagement where the eval set wasn't the single most valuable week of work.

GateEval examples graded. If your team can't agree on a reference answer for an example, the spec isn't done, that's a scoping bug, not an eval bug.

Ingestion

Chunking strategy, embedder choice, vector store choice, and the freshness pipeline. Each gets a small head-to-head against the eval set, 2–3 chunking strategies (fixed-size / semantic / title-aware), 2 embedders, sometimes 2 stores. The winner isn't decided by a benchmark blog post; it's decided by your eval set.

OwnersPaiteq RAG engineer; weekly demo of intermediate scores to your team.

GateBaseline retrieval recall hit on the eval set. Below baseline, we revise chunking or embedder before moving to retrieval logic.

Retrieval

Hybrid retrieval (BM25 + dense, scored together), reranker layer (Cohere Rerank 3 or bge-reranker), query rewriting (HyDE, multi-query) where it lifts scores. Each layer is added only if it earns its keep on the eval set. We've shipped RAG systems where the reranker alone moved hallucination rate from 11% to 2%.

OwnersPaiteq RAG engineer; weekly score reviews with your team.

GateRetrieval recall + answer faithfulness both above target on the eval set.

Eval gates

Five thresholds all green before any production wire-up: retrieval recall (did we pull the right passages?), context relevance (are they on-topic?), answer faithfulness (is the answer grounded?), hallucination rate (LLM-as-judge + human spot-check), and p95 latency. Hallucination disputes get human spot-check from your domain expert, we don't let LLM-as-judge stand alone on the hard cases.

OwnersPaiteq RAG engineer + your domain expert on the human spot-check.

GateAll five metrics green or the build doesn't deploy. Period.

Running

Four weeks of post-launch iteration are part of every Production RAG Build. Weekly eval runs, freshness drift checks (stale-document rate as a first-class metric), prompt + chunking iteration on edge cases. The eval set grows from sampled production traces every month; regression alarms fire when an upstream model change drops scores by >5 points.

OwnersPaiteq RAG engineer (decreasing % of time) + your team taking over ownership.

GateOngoing, weekly eval review continues for the duration of the engagement.

Two things that matter across every RAG system implementation we ship. The eval set lands in week 2, before any retrieval code. The eval set is what tells us whether chunking strategy is wrong, whether the embedder needs to change, whether the reranker is earning its latency budget. Without it, you're tuning blind. Running is a real phase, not an afterthought. The first 4 weeks post-launch are part of every Build engagement, weekly eval review, freshness checks, prompt iteration on edge cases, regression alarms wired to your on-call. A rag consultant in the room at week 2 almost never sees the hallucination complaint that arrives at week 12.

005 / DECISION

RAG vs. fine-tuning, when do you need which?

The most common scoping question we get. Most production systems use both: RAG for facts, a small fine-tune for style or domain vocabulary. Picking which to lean on first decides the engagement shape, and our rag consulting practice runs this conversation at week 1 of every project.

	Fine-tuning	RAG
Grounds in	Static training data	Your live corpus
Freshness	Frozen at training	As fresh as your CDC pipeline
Setup cost	$$$$, full fine-tune run	$$, chunk, embed, index
Fine-tune runs on GPT-4o or Llama-3 70B easily reach $2,000–$8,000 for a first pass (compute + data prep + eval iteration). RAG infrastructure, chunking, embedding, and indexing a corpus, is typically under $500 for an initial pilot. The cost delta widens when the corpus changes: RAG re-indexes incrementally; fine-tuning re-runs the whole job.
Latency	Lower per turn	+200–600ms (retrieval + rerank)
Fine-tuning genuinely wins on latency: all knowledge is baked into weights, so the forward pass is the only cost. The RAG overhead is real, +200–600ms is mostly the rerank stage (a cross-encoder scoring 50 candidates), not the ANN lookup itself (typically under 20ms on Qdrant or pgvector). For voice-RAG with a sub-700ms front-to-back budget, you often skip the reranker or run a self-hosted bge-reranker-large on the same box as the embedder.
Hallucination	Higher, no grounding	Lower, refusal is possible
RAG's hallucination advantage comes from two mechanisms: retrieved passages anchor the generation, and the system can refuse when no retrieved chunk clears the relevance threshold. Fine-tuned models have no equivalent circuit-breaker, if the question pattern resembles training data, the model will produce a fluent-sounding answer regardless of factual grounding. For regulated domains (legal, healthcare, financial), that refusal capability is often non-negotiable.
Best for	Style, voice, output format	Facts, lookups, citations
Eval surface	Output quality only	Recall + faithfulness + answer
A richer eval surface sounds like extra work, but it's actually a debugging advantage. When a RAG system regresses, recall vs faithfulness vs answer-quality scores tell you exactly where the pipeline broke. A fine-tuned model that degrades gives you one signal, output quality, and root-causing that to data, prompt, or model is harder. Most teams we work with under-invest in RAG evals initially and then thank themselves for the instrumentation at week 6.
Compose with	RAG, prompting	Fine-tune, prompting

Full breakdown, when RAG beats fine-tuning

Rule of thumb: if the answer needs to cite its source or stay fresh, you want RAG. If the answer needs to speak in your voice or your domain's jargon, you want fine-tuning. Anything in between, the decision tree below walks four diagnostic questions, most projects fit cleanly into one of five outcomes.

DECISION TREE

Answer four questions about the workload. We've used these same questions to right-size scope on every RAG engagement we've run.

Question

Pick one

006 / ARCHITECTURE

Four production RAG patterns.

Most production retrieval augmented generation services reduce to one of four patterns. The taxonomy isn't ours, it's standard across the LlamaIndex and LangChain communities, but the deployment choices are where engineering judgment lives: when to pick which, what fails first, which eval metric becomes the anchor.

Pattern choice matters more than store choice or embedder choice for production retrieval recall. We've watched teams agonize over Pinecone vs Qdrant while running pattern 01 (naive RAG) on a corpus that needed pattern 02 (hybrid + rerank). The store decision was worth 1–3 points of recall; the pattern decision was worth 14. When you build rag system architecture, start with the pattern question, "what shape of retrieval does this corpus need?", then pick the store and embedder that fit. Reverse that order and you'll rebuild at week 9, which is the call we get on most Rescue engagements. The LLM knowledge base scope, what the system answers vs refuses, belongs in week 2.

Naive RAG

The simplest pipeline. Query gets embedded, top-k passages come back from the vector store, the LLM generates an answer grounded in them. No rewriting, no rerank, no reflection. About 20% of our pilots ship in this shape, usually when the corpus is small and well-structured, and the question shape is narrow enough that recall is naturally high. Don't reach for the fancier patterns until the eval set says you need to. The naive pipeline ships in days, not weeks, and it's the baseline every other pattern has to beat.

Pick when

Corpora under ~1M vectors. Narrow, well-defined question shape. Retrieval recall already above 85% on the eval set with a single-strategy retriever. Pilots where the cost of complexity is higher than the precision lift.

Skip when

Recall is below ~80%, try hybrid before adding rerank. Hallucination rate above target. Multi-hop questions that need iterative retrieval. Voice-RAG where you can afford the rerank latency.

Stack

LlamaIndexpgvectorOpenAI text-embedding-3-largeClaude Sonnet 4.6

A common scoping mistake on enterprise rag projects: clients ask for pattern 03 (multi-step) when pattern 02 + a better embedder would have shipped in half the time. Each retrieval loop doubles latency cost and the eval surface widens. Default to pattern 02 (hybrid + rerank). Move up only when the eval set tells you to. About a third of "we need multi-step RAG" requests we audit end up landing back on pattern 02 once the hybrid search implementation is properly tuned, the deeper take on that call lives in our hybrid retrieval breakdown.

FAILURE MODES BY PATTERN

Most RAG systems we audit fail on the same handful of issues, and the symptoms line up with the pattern in use. Quick triage list:

01 · Naive RAG

Low recall (<75%): chunks too large, embedder weak on the domain, or BM25-eligible queries hitting only dense.
High latency: rare; usually the LLM, not retrieval. Check token count first.
Wrong-page citations: chunk boundaries broke mid-section; switch to recursive + title-aware.

02 · Hybrid + Rerank

Reranker latency spike: Cohere rate-limited or self-hosted bge-reranker over-loaded; cache or batch.
Hybrid weighting wrong: sparse pulling too much noise; tune the BM25/dense ratio against your eval set.
Domain drift: embedder trained on general English failing on legal / clinical vocabulary; swap to Voyage-domain.

03 · Multi-step

Loop never terminates: reflection prompt unable to recognise "good enough"; add a max-iterations cap + scoring gate.
Cost blow-up: each loop is another LLM call; budget per query, circuit-break above ceiling.
Latency tax: 2–3× over single-pass; only worth it when faithfulness lift on the eval set crosses 12+ points.

04 · Agentic RAG

Agent retrieves too much: tool-call budget unconstrained; cap retrievals per loop turn.
Citations drift: agent paraphrases passages losing provenance; force structured-output citations.
Scope creep into agent territory: if 80%+ of work is non-retrieval tool-use, the engagement belongs on the AI agent practice instead.

007 / EVAL

Four eval dimensions on every RAG we ship.

Generic LLM eval frameworks miss RAG-specific failure modes. We score retrieval and generation separately, then together, and the eval set lands in week 2 of every engagement, before the first chunk is embedded.

88%

Retrieval recall

Did the retriever pull the passages that contain the answer? Scored against gold-passages in the eval set.

94%

Answer faithfulness

Is every claim grounded in a retrieved passage? RAGAS + LLM-as-judge, with human spot-check on the disputed 5%.

<3%

Hallucination rate

Claims no retrieved passage supports. Hard gate before production. Refusal is preferred over guessing.

<1.6s

P95 latency

Query-to-answer latency across retrieve, rerank, and generate. Reranker is the usual bottleneck. Voice-RAG targets sub-700ms.

Numbers shown are illustrative target ranges for new engagements until production eval data from anonymised builds is published.

EVAL GATES

The four gates aren't suggestions. All four must be green before we wire any RAG into production traffic. Each has an explicit methodology, a target, and a fail-state, codified before the first chunk is embedded.

01 Retrieval recall

≥88%

Gold-passages in the eval set; recall@k for k=10 and k=5. Re-graded weekly. Production traces sampled into the eval set monthly.

If <80%, retrieval logic gets rewritten, chunking, embedder, or hybrid weighting all back on the table.
02 Answer faithfulness

≥94%

RAGAS + LLM-as-judge (Claude Sonnet 4.6) scoring whether every claim is supported by a retrieved passage. Human spot-check on the 5% disputed by the judge.

If <90%, citation-enforcement prompts get rewritten or the model gets demoted to refusal-only on low-confidence retrievals.
03 Hallucination rate

<3%

Claims that no retrieved passage supports. Hard gate before production wire-up. Refusal is the preferred failure mode, not a guess.

If ≥5%, we widen the refusal threshold and rerun. We've never shipped a RAG with hallucination above 3% on the eval set.
04 P95 latency

<1.6s

Full query-to-answer latency across embed, retrieve, rerank, generate. Reranker is the usual bottleneck. Voice-agent RAG targets sub-700ms with bge-reranker self-hosted.

If breached for >72h, we re-evaluate reranker placement or move to streaming generation.

Two methodology notes that matter. We use LLM-as-judge with Claude Sonnet 4.6 as the default faithfulness scorer because it produces the most consistent grades against human ground-truth on the eval sets we've shipped. Hallucination disputes (typically 5–8% of outputs) get human spot-check by your domain expert, we never let LLM-as-judge stand alone for the hard cases. And the eval set grows during production: traces sampled monthly, regression alarms firing when an upstream model swap drops scores. An llm knowledge base scored weekly degrades at a fraction of the rate of one eyeballed quarterly.

Default eval and observability stack we deploy:

RAGAS Trulens Langfuse Promptfoo LangSmith DeepEval

007b / SECURITY · COMPLIANCE · COST

Security, compliance, and cost engineering for RAG.

Three concerns enterprise rag buyers always ask about before procurement. We address each one in the spec, not as a "we'll figure it out at the security review" promise. Most RAG projects we rescue had at least one of these three left for later.

Security & guardrails

Defense in depth for RAG, not a single classifier. Every production pipeline ships with PII scrubbing at ingest, citation enforcement at generation, and an adversarial eval set we re-run on every model or embedder swap.

PII scrubbing at ingest, Microsoft Presidio or your existing DLP runs on text before embedding. Embeddings store no raw PII by default; redaction tokens preserve structure where needed.
Citation enforcement, the LLM is prompted to ground every claim in a retrieved passage; outputs without citations get flagged or refused. We've shipped systems where 8–12% of queries get refused, clients prefer that over confident wrong answers.
Prompt-injection defence, Llama Guard 3 or a custom classifier on inbound queries. Retrieved passages get a separate isolation prompt so a poisoned doc can't override system instructions.
Refusal threshold, if no passage scores above a tuned floor, the answer is "I don't have a grounded answer for that." Refusal is a first-class output, not a degraded one.
Output filtering, Presidio on the LLM's response for PII leakage; we've caught models hallucinating Social Security numbers that weren't in the corpus more than once.

Compliance posture

Default posture covers most enterprise procurement bars. Regulated workloads (clinical, financial, EU) layer in additional controls, scoped into the SOW at week 1, not retrofitted at security review in week 12.

SOC 2-ready

Practices, not certified · default posture

HIPAA-aligned

PII-scrubbed prompts · BAAs · log redaction

GDPR / EU AI Act

EU residency · DPA · model-card disclosures

On-prem / VPC deployment available, BGE embeddings, Qdrant self-hosted, Llama 4 / Mistral on vLLM. Standard pattern for healthcare, financial services, and defence-adjacent engagements where no data can leave the perimeter.

Cost engineering

Embedding cost is usually the second-highest line item on production RAG after engineering time; LLM token cost is the highest. We model expected cost during corpus audit and cut it 40–70% on the average build through routing, caching, and quantization.

40–70%

Token-cost cut

Via routing easy queries to Haiku / 4o-mini after retrieval

85%

Cache hit

On stable system + retrieved-context prefixes (Anthropic prompt cache)

4×

Memory reduction

With Qdrant scalar / product quantization at <1% recall loss

Model routing, a classifier routes by query complexity. Easy lookups go to Haiku or 4o-mini at 1/20th the cost; hard ones to the frontier model. Faithfulness holds via the eval gate.
Prompt caching, Anthropic / OpenAI prompt caching on stable system prompts and retrieved-context prefixes. 85%+ hit rate on most agents within two weeks of launch, our hosted vs self-hosted decision piece breaks down where the savings land.
Quantization, Qdrant scalar / product quantization cuts RAM 4× with under 1% recall loss on most corpora. The single highest-ROI infra optimization on large vector indexes.
Batch embedding, OpenAI / Voyage batch APIs for re-embedding and corpus refresh. 50% cost cut vs sync, 5–10× throughput. The default for any ingest run above ~100k docs.

All three concerns share a pattern: the discipline is in the spec, not in the build. We name the threat model, the compliance posture, and the cost band during corpus audit. The build executes against those targets, security and cost aren't add-on phases that happen after retrieval recall is green. They're how it gets there.

008 / USE CASES

Where teams have shipped RAG.

Engagements anonymised. Industry and segment are real; metrics are real; brand names removed under standard NDA terms.

Use cases below are organised by corpus shape, contracts, tickets, research notes, code repos, regulations, equipment manuals, not by industry. The same hybrid retrieval and citation-enforcement pipeline ships to a law firm and a manufacturer; what changes is the chunking strategy, the freshness pipeline, and the embedder. Below: three flagship engagements (full numbers) plus three function stubs from recent ships. A semantic search implementation is the throughline on most of them; document intelligence services patterns recur across legal and ops.

Legal

Mid-market law · 80+ atty

Contracts Q&A over 11 years of MSAs

Chunked 14,000 contracts with hierarchical headers (recursive + title-aware), hybrid retrieval, Cohere Rerank 3 down to top-5. Voyage-law embedder lifted recall from 78% to 92% over the OpenAI baseline. PII-scrubbed at ingest; the legal team grades the eval set monthly.

0 %

retrieval recall on the eval set

Support

Health-tech · enterprise

Knowledge agent over 18 months of tickets

RAG over product docs and a redacted ticket archive. Refuses cleanly on out-of-corpus questions; escalates clinical to a human with the agent's draft and retrieved passages attached. p95 latency 1.4s; hallucination rate held at <2% across the post-launch quarter.

0 %

p1 ticket volume

Research

VC fund · 35-person

Memo synthesis over public + private corpus

Multi-step RAG over SEC filings, press, and the fund's internal notes. Citations to primary sources only; the agent refuses gracefully when the corpus is thin on a target rather than synthesising plausible nonsense.

0 %

first-pass diligence ~

Code

Dev-tools SaaS · 50–200 emp

Repo-aware code Q&A across a 1.8M-line monorepo

Symbol-graph indexing combined with chunked dense vectors. Engineering team asks 'where does this config flag get read?' and gets file:line citations with surrounding context. Stale-symbol rate stays below 2% via webhook-driven incremental re-indexing.

min → 40s per repo question

Compliance

Fin services · 1,000+ emp

Regulatory Q&A across 6 jurisdictions

Hybrid retrieval over published regulations + the firm's interpretation memos. Citations to the underlying regulation always; refusal when jurisdictions disagree rather than averaging answers. Compliance team graded the eval set; faithfulness held at 96%.

8 days → min per memo

Ops

Mfg · 200+ emp

Equipment-manual Q&A for the maintenance floor

RAG over scanned PDF manuals (OCR via Unstructured + Tesseract), pgvector on a single Postgres node, corpus was 4.2M vectors. Maintenance engineers ask in plain English from a tablet; answers cite manual + page number, with photos when present.

0 %

Mean diagnosis time -

Patterns across all six engagements: the eval set landed in week 2, before retrieval code; the eval set grew during production via sampled traces; citation enforcement was the headline guardrail, not an add-on. The outcome numbers are what each team measured at 90 days post-launch, not at deploy. The rag solutions that hold up at 90 days are the ones where the eval set was graded by a domain expert before the first chunk was embedded, picking a partner that stays for that work is the most underrated criterion in vendor selection.

009 / ENGAGE

Four ways to start a RAG engagement.

Every RAG development services engagement is fixed-scope and fixed-duration. The first phase is small enough that stopping is a real option, about a third of our RAG Pilots end at the pilot for legitimate scoping reasons. Cheap to discover the corpus shape doesn't fit; expensive to discover it 12 weeks in.

RAG Pilot · 2–4 weeks

RAG Pilot · 4 weeks 4 phases

WEEK 1 Corpus audit

Corpus map + eval scope agreed

Corpus boundary signed off

WEEK 2 Eval set

30–80 graded examples + reference passages

Domain-expert grading complete

WEEK 3 Pipeline

Chunk + embed + retrieve baseline against eval

Baseline retrieval recall hit

WEEK 4 Demo + memo

Demo, scores report, next-phase recommendation

Production Build · 8–14 weeks

Production Build · 14 weeks 6 phases

WEEK 1–2 Corpus + eval

Corpus map, eval set, stack lock

WEEK 3–5 Ingestion

Chunking + embedder benchmarked + locked

Baseline retrieval recall hit

WEEK 5–8 Retrieval

Hybrid + rerank + rewrite tuned to eval set

Recall + faithfulness above target

WEEK 8–10 Eval gates

Five metrics green vs target

All five green or no deploy

WEEK 10–12 Deploy

Auth, observability (Langfuse), CDC pipeline live

WEEK 12–14 Iteration

Weekly eval review, runbook, ownership transfer

RAG Rescue · 4–6 weeks

RAG Rescue · 6 weeks 4 phases

WEEK 1 Eval audit

Current-system grading vs your eval set (we build one if absent)

WEEK 2 Failure-mode

Classified failures: chunking / retrieval / rerank / prompt / model

Failure breakdown reviewed

WEEK 3–5 Targeted fix

Each failure-mode addressed in order of recall lift expected

WEEK 6 Validation

Validated against your eval set; runbook updated

01 RAG Pilot Fixed scope

2–4 weeks

Prove the corpus works.

In scope

One corpus, one question shape
Eval set with 30–80 graded examples
Working prototype against your real docs
Demo + recommendation memo for the next phase

Out of scope

Production deploy
CDC / freshness pipeline
Multi-corpus orchestration

02 Production RAG Build Fixed scope

8–14 weeks

Full pipeline with eval gates.

In scope

All Pilot deliverables
Ingestion + CDC for freshness
Hybrid retrieval + reranker
Production wire-up, Langfuse observability, eval gates
Four weeks of post-launch iteration with weekly eval runs
On-call runbook and ownership transfer

03 RAG Rescue Fixed scope

4–6 weeks

Diagnose and fix a struggling RAG.

In scope

Eval audit on the current system (we build an eval set if absent)
Failure-mode classification (chunking · retrieval · rerank · prompt · model)
Targeted fixes in order of expected recall lift
Validated against your eval set; runbook updated

04 Vector DB Migration Fixed scope

6–10 weeks

Move stores, zero downtime.

In scope

Dual-write phase
Index parity checks against your eval set
Cutover playbook with rollback ready
Documented for handover

Two patterns worth flagging on RAG engagements specifically. The eval set is the deliverable, even more than the pipeline. A pipeline you can rebuild; an eval set is institutional knowledge about what your business considers a correct answer. We hand it over in your repo, with grading criteria documented. About 70% of Pilots convert to Build engagements. The 30% that don't either re-scoped based on what the Pilot revealed or decided the workflow wasn't yet ready for retrieval. Both are legitimate outcomes; we'd rather flag it at week 3 than at week 12.

WHO YOU WORK WITH

One Paiteq RAG engineering lead acting as your dedicated rag consultant, one senior RAG developer handling the retrieval pipeline, and a fractional product manager for scope and stakeholder management. On Rescue and Migration engagements we add a platform engineer for the index / CDC work. Two-week iteration cycles with a weekly demo. You have a direct Slack channel with the build team, no account-management buffer between you and the people doing the work.

On the client side, the engagement needs a domain expert to grade the eval set (~6 hours per week during weeks 1–3, then ~2 hours per week running) and an IT or data owner to clear access to source systems. We don't need a project manager on your side, we run that. We do need fast decisions on residency, scope boundaries, and acceptable refusal rates. If you're considering hiring a rag developer or rag consultant rather than a team engagement, the Pilot usually clarifies whether that's the right call.

010 / FAQ

Common RAG questions.

RAG or fine-tuning, how do we decide?

Default to RAG. Fine-tune only when style, output format, or domain language can't be solved at the prompt + retrieval layer. They compose well: most production systems use RAG for facts and a small LoRA fine-tune for output style.

The clearest split: if the answer needs citations, freshness, or refusal-on-thin-context, RAG fits. If the answer is purely stylistic (tone, format, jargon the base model fumbles), fine-tuning fits. Hybrid is common, we scope both at week 2 of any Build.

The interactive picker above walks the decision in 3–4 questions. Our piece on when RAG beats fine-tuning has a deeper breakdown by workload type. Fine-tuning specifically lives in our LLM fine-tuning practice.

Which vector database should we pick?

Depends on five inputs: corpus size, residency requirements, ops capacity, latency budget, and whether hybrid search and reranking are first-class needs. Our usual call:

pgvector, corpora under ~5M vectors, team already runs Postgres. Joins between vectors and structured filters matter.
Pinecone, 5M–100M vectors, no ops appetite, SOC 2 hosted. Default for SaaS clients.
Qdrant, self-hosted residency requirements, very large corpora, want quantization to cut RAM 4×.
Weaviate, hybrid search heavy workloads, multi-tenant SaaS where each tenant gets a namespace.

We benchmark two candidates against your real eval set on every Production Build before locking. The eval set, not vendor marketing, decides. A vector database services capability is part of every engagement, selection isn't a one-shot.

How do you measure RAG quality?

Five dimensions, scored separately because they fail differently:

Retrieval recall, did we pull the right passages? Scored against gold-passages in the eval set.
Context relevance, are the retrieved passages on-topic, or off-topic noise that fits keyword-wise?
Answer faithfulness, is every claim grounded in a retrieved passage? RAGAS + LLM-as-judge, human spot-check on the disputed 5%.
Hallucination rate, claims with no retrieved support. Hard gate before deploy.
P95 latency, query to final token. Reranker is usually the bottleneck.

The default eval stack is RAGAS + Trulens + Langfuse. The eval set grows from production traces every month, with regression alarms if any metric drops by >5 points. Our piece on eval framework comparison covers when to reach for which tool.

What about freshness, our docs change every day.

Incremental ingestion with change-data-capture from your source systems. New and changed documents re-embed and replace in the store; deletes propagate. Stale-document rate is a first-class metric, not an afterthought, we track it weekly and alert above your tolerance. For Confluence / SharePoint / Notion, the standard pattern is webhook-driven; for Postgres or other DBs we use Debezium or the equivalent.

Can you migrate us off Pinecone (or any other store)?

Yes. Vector DB Migration is a fixed-scope engagement, 6–10 weeks depending on corpus size and the retrieval logic that has to come along. Dual-write phase first (writes go to both stores), then index parity checks against your eval set, then read cutover with rollback ready. We've shipped Pinecone → Qdrant migrations of 22M chunks with zero downtime and zero retrieval-recall regression.

How do you handle PII, residency, and compliance?

PII scrubbing happens at ingest via Microsoft Presidio or your existing DLP, embeddings store no raw PII by default. For regulated workloads we deploy fully on your cloud (AWS, GCP, Azure) with no data leaving the perimeter; the embedding model runs on dedicated GPU/CPU (BGE for residency-constrained clients). We follow SOC-2-ready practices (audit logs, least-privilege IAM, key rotation, encryption in transit and at rest) by default; HIPAA-aligned and GDPR / EU AI Act postures are scoped into the SOW for regulated engagements.

How do you prevent hallucination in production?

Three layers. (1) Retrieval threshold: if no passage scores above a tuned floor, the agent refuses rather than guesses. (2) Citation enforcement: every claim points to a retrieved passage, and the LLM is prompted to flag claims it can't ground. (3) Faithfulness scoring: LLM-as-judge with Claude Sonnet 4.6 plus human spot-check on disputed cases. Refusal is a feature, not a failure mode, we've shipped systems where 8–12% of queries get refused and the business is happier with that than with confident wrong answers.

What does a RAG development services engagement cost?

Pilot is fixed-scope at 2–4 weeks; Production RAG Build is 8–14 weeks; Rescue is 4–6 weeks; Migration is 6–10. We hold the price band on the contact call rather than publishing here because corpus size, residency posture, and integration count swing it meaningfully. The Pilot is small enough that stopping is a real option, about a third of RAG Pilots end at the pilot for legitimate scoping reasons.

Do you build the eval set or do we?

Your domain expert grades; we facilitate. The eval set is the most important deliverable of the engagement and it has to reflect your business's failure modes, not ours. We bring the structure (30–80 examples, gold-passages, edge cases over easy cases), the tooling (RAGAS, Trulens, custom harness), and 4–6 hours of facilitation per week. Your domain expert grades the examples and signs off. After launch, we co-curate from sampled production traces monthly.

Where RAG connects.

RAG is the spine under most production LLM systems but rarely the whole engagement. The adjacent practices that usually show up in the same brief: autonomous agent production workloads (where retrieval is one tool the agent calls, not the whole loop), LLM application development (where the chat surface lives), and chatbot development when the deployment surface is conversational rather than agentic.

The single highest-value RAG deployment we ship is for AI healthcare software development (clinical-knowledge RAG, PHI-safe retrieval, citation enforcement on every answer, BAA-backed model hosting). For commerce buyers, AI for ecommerce RAG (product knowledge grounding, policy retrieval) is a strong runner-up. For mobile + RAG (in-app knowledge surfaces inside a Flutter or React Native app), the work flows through the same team that maintains GetWidget (the open-source Flutter UI library used by Flutter teams worldwide), so chunk-size calibration, retrieval latency budgets, and mobile rendering live in one bench. Buyer-facing decision walkthrough lives in the 2026 customer service chatbot guide, which has the RAG-vs-direct-LLM math. The broader engineering context lives on the Paiteq engineering practice page and the full AI development company homepage.

011 / Related practices

Adjacent services.

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

AI CONSULTING

AI Consulting

AI strategy, audits, roadmap.

012 / Start a project

Let's ground your AI in real data.

RAG Pilot in 2–4 weeks. Production Build in 8–14. Rescue in 4–6.

Talk to engineering Architecture review