RAG development services for systems that cite their sources.
Paiteq builds retrieval augmented generation systems from corpus audit to live retrieval — production RAG development services on Pinecone, Qdrant, Weaviate, and pgvector. Hybrid retrieval, reranking, and citation-enforced answers. Refusal when the context is thin, not confident guesses dressed up as evidence.
Eight RAG surfaces we ship.
Each surface below is a workload we've shipped to production — with the corpus shape, the eval methodology, and the failure modes already worked out. See three in detail →
RAG pipeline development sorts cleanly by corpus shape, not by industry. A contracts-Q&A pipeline for a law firm reuses the same hierarchical chunking, hybrid retrieval, and citation-enforcement prompts as an enterprise rag deployment over policy documents at a health-tech company. The integrations, freshness rules, and residency posture change — the retrieval pipeline shape doesn't. Sorting by corpus shape lets us reuse the eval harness and reranker tuning across clients. Sorting by industry, the way most competing pages organize themselves, hides where the engineering actually lives.
Where we've shipped each RAG surface. Strength reflects production volume, not theoretical fit — empty cells mean we either haven't done it yet or the workload didn't justify a retrieval pipeline.
Heavier columns: legal, fintech, healthcare. The pattern is unsurprising — those are the industries where citations, refusal, and provenance pay back the hardest. Lighter columns: e-commerce and ed-tech, where workloads tilt toward generation and recommendation more than retrieval. The grid isn't a roadmap; it's a record. We'll talk you out of a RAG-shaped engagement if the workload is actually a generation problem (better fit for our generative AI practice) or a tool-use problem (better fit for autonomous tool-use agents).
One more pattern worth naming: about 25% of the engagements that start as RAG end up being rag consulting work instead. The client comes in wanting a built pipeline; the corpus audit reveals that the source data isn't yet in a shape any pipeline can win against. The right next step is a six-week consulting engagement to get the source data structured — header conventions, OCR cleanup, deduplication, freshness pipeline — before any retrieval code ships. We do that work, sometimes hand it to your team, sometimes we run it ourselves. Either way, we say "not yet" on the build until the corpus is ready, because shipping a RAG against a broken corpus is how you end up calling Rescue six months later.
RAG development services — pick where to start.
Four engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done. Full engagement-model breakdown below →
Choosing a RAG development services partner is mostly about choosing the right starting shape. Buyers who walk in with a scoped corpus and one question type ship to production around 75% of the time. Buyers who walk in with "we want a RAG over everything we have" ship around 30% of the time, usually after a re-scope. The four shapes below map onto the starting points we see most often — pick the one that matches what you actually have, not what you wish you had. Every shape is a RAG development service we've shipped 10+ times; the deliverables and gate criteria are locked in by repetition, not invented for your engagement. Each starts by scoping the llm knowledge base boundary — what the system can and can't answer — before any ingest code ships.
A practical decision tree: if the corpus is scoped but the retrieval quality is unproven, start with a Pilot. If you know the corpus works and you need production discipline (CDC, observability, eval gates), start with a Production RAG Build. If you've already shipped RAG and it hallucinates, retrieves wrong, or can't keep up, start with Rescue. If the only thing wrong with the current system is the vector store underneath it, start with Migration. Each of these is a real RAG development service we run on a fixed scope, and our rag consulting practice will tell you to start narrow — moving up is cheap; over-scoping isn't. Week-by-week scope on each, further down →
The retrieval stack: vector DBs, embedders, rerankers.
Stack choices follow the workload, not house preferences. Vector store, embedding model, and reranker are all benchmarked against your real eval set before we lock anything in.
- LlamaIndex
- LangChain
- Pinecone
- Qdrant
- Weaviate
- pgvector
- Chroma
- Cohere Rerank
- Voyage
- OpenAI Emb.
- RAGAS
- Trulens
- BGE
- Langfuse
- Presidio
- Unstructured
- LlamaIndex
- LangChain
- Pinecone
- Qdrant
- Weaviate
- pgvector
- Chroma
- Cohere Rerank
- Voyage
- OpenAI Emb.
- RAGAS
- Trulens
- BGE
- Langfuse
- Presidio
- Unstructured
For each store: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production rag solutions on every one of these — the "when we don't" lines come from real builds, not theory. Vector database services selection isn't a one-shot decision; we benchmark two candidates against your eval set on every Production Build.
Managed, serverless, predictable. Sub-100ms p95 on most workloads up to ~100M vectors. Hybrid search and namespaces built in.
Mid-size corpora (5M–100M vectors) where the team has zero appetite to run infra. SaaS clients without a platform team. Anywhere predictable cost and SOC 2-ready hosting matter more than knob-tuning.
Cost-sensitive workloads above 100M vectors (the unit economics flip vs Qdrant self-hosted). Workloads needing exotic distance metrics or custom HNSW parameters Pinecone doesn't expose.
Default for clients without a platform team. We benchmark it against Qdrant on every Production Build, but Pinecone wins ~60% of those head-to-heads on operational simplicity.
Open-source with a Rust core. Excellent performance per dollar self-hosted. Strong filtering, payload indexing, and quantization options.
Self-hosted requirements (residency, regulated workloads). Large corpora where the dedicated infra cost beats managed pricing. Anywhere we need scalar / product quantization to cut RAM 4×.
Tiny corpora (<1M vectors) — the operational tax doesn't pay off. Teams with no Kubernetes capacity.
Our default for enterprise rag deployments on AWS / GCP where data residency matters. We run Qdrant Cloud for managed convenience and self-hosted for regulated workloads — same API.
Hybrid search and modular reranking are first-class. GraphQL API. Built-in multi-tenancy. Optional ML modules (vectorizer, reader).
Hybrid search heavy workloads where rerank logic and ranking modules need to ship with the store. Multi-tenant SaaS where each customer gets isolated namespaces.
Simpler workloads where pgvector or Pinecone covers it. Teams allergic to GraphQL — Weaviate's REST is fine but the docs lean GraphQL.
We reach for Weaviate when the client's eval set shows hybrid search lifts recall by ≥10 points over pure vector. Modular reranker plug-ins shorten the build by a week or two.
Postgres extension. No extra system to operate. Joins between vectors and structured data. HNSW + IVFFlat indexes.
Corpora that fit on one Postgres node (typically <5M vectors). Teams that already run Postgres and don't want another service. Apps where vector + relational filtering matter together.
Above ~5M vectors with high QPS — index rebuild times and connection pool pressure start hurting. Workloads with very high write churn.
Default for early-stage SaaS clients. We migrate to Qdrant or Pinecone the first time index rebuild crosses 90 seconds on production data — that's the usual tell.
Embedder choice usually matters more than vector store choice for retrieval recall — and it's the easier thing to swap later. A reranker on top of hybrid retrieval is the single highest-ROI add on most of the RAG pipeline development work we do. Hybrid search implementation, when it earns its keep on the eval set, lifts top-k precision by another 8–18 points.
Strong general-purpose baseline. 3072 dims (truncatable to 1536 / 512 / 256 via MRL). Stable API.
Default for English-first corpora and mixed-domain workloads. When the client has an OpenAI contract already and procurement won't add a vendor.
Multi-lingual corpora — Voyage and BGE win our evals there. Workloads where data residency rules out hosted embedding APIs.
Our day-one baseline; we benchmark against Voyage and BGE on every Production Build. About 40% of builds end up swapping to one of those.
Consistently tops the MTEB leaderboard for our evals. Strong long-context (32k) embedding. Per-domain models for code, finance, law.
Long documents where 32k context preserves more signal than chunking. Domain-specific corpora (legal, finance, code) where Voyage's domain models lift recall 5–12 points.
Hyper-cost-sensitive batch workloads — OpenAI's batch pricing wins by ~30%. Workloads requiring on-prem embedding (use BGE instead).
Default for legal and clinical RAG. We've shipped contract-Q&A systems where Voyage-law improved retrieval recall from 78% to 91% over the OpenAI baseline.
Open-weights. Run on your own GPU or CPU. BGE-large-en is the strongest open embedder we've benchmarked. Free at scale.
Regulated workloads where no data can leave the perimeter. Very high-volume batch embedding where hosted API cost dominates. BAAI-licensed deployments.
Tiny corpora — the GPU isn't worth it. Teams without infrastructure to run an embedding service. Multilingual workloads where Voyage still wins.
We run BGE on a small CPU pool for healthcare and finance clients with strict residency. Throughput tuning matters — batch size and ONNX runtime get most of the gains.
Cross-encoder reranker that lifts top-k precision by 8–18 points on our evals over pure vector. Hosted API, sub-200ms.
Almost every Production RAG Build. Reranking on top-50 vector hits down to top-5 is the single highest-ROI add we make.
Very tight latency budgets (Cohere adds ~150–200ms). Pure offline batch where the reranker cost stacks up. Workloads where a self-hosted bge-reranker fits the same role for free.
Default. We benchmark Cohere Rerank 3 vs bge-reranker-large head-to-head on the eval set; Cohere usually wins but bge-reranker self-hosted is the strong runner-up.
Two patterns worth flagging on every RAG engagement. First, we benchmark two embedders against the eval set before locking the stack — usually OpenAI text-embedding-3-large against a domain-specific Voyage model. The eval set decides, not the leaderboard. Second, we default to including a reranker on every Production Build. The 150–200ms latency tax is almost always worth the 8–18 point precision lift. Skip it only when latency budget rules it out — voice-RAG, sub-700ms response targets. Our deeper take on chunking — including when fixed-size beats semantic — lives in our chunking strategy deep dive.
Chunking strategy is the choice that decides retrieval recall more than embedder choice does, and it's the one most teams get wrong on a first build. Three patterns cover ~90% of what we ship.
01 · Fixed-size + overlap
500–1,000 token chunks with 10–20% overlap. Default starting point for narrative text — support docs, knowledge bases, marketing-style content. Fastest to ship; usually the baseline every other chunker has to beat.
When it works: uniform-density text where paragraphs and sections are short. When it fails: structured documents (contracts, manuals) where headers carry semantic weight.
02 · Recursive + title-aware
Splits at heading boundaries first, then by paragraph, then by sentence — preserving document structure in the chunk. Hierarchical: each chunk knows its parent section. Our default for legal, regulatory, technical documentation.
When it works: structured docs with consistent heading conventions. When it fails: badly OCR'd PDFs where the heading detection breaks — falls back to fixed-size with a recall penalty of 8–15 points.
03 · Semantic / embedding-based
Splits on cosine-distance jumps between sentences — chunks form where the topic shifts. Higher ingest cost (one embed per sentence pair), but lifts retrieval recall 5–12 points on conceptually dense content like research papers and analyst memos.
When it works: long-form analytical writing, research synthesis, ed-tech curriculum. When it fails: fragmentary content (chat logs, ticket threads) where the semantic signal is noisy.
We benchmark two chunkers head-to-head against your eval set on every Production RAG Build — typically recursive + title-aware vs semantic — and the winner is locked at week 4. The losing strategy stays in the codebase as a fallback for sources that don't fit the primary pattern (badly OCR'd PDFs, mixed-format archives).
Six steps to build a RAG pipeline — eval-first, every time.
The same process runs across a 4-week Pilot and a 14-week Production Build. The gates change in depth, not in shape. Every step has an explicit deliverable, a named owner, and a gate criterion — pass or rework, no "we'll figure it out next sprint."
Corpus audit
What you've got, in what shape, how fresh, what's allowed to leave your perimeter. We don't write any ingest code until this is signed off.
Eval set
30–80 graded questions with reference answers + the supporting passages each answer should retrieve. Lands before any pipeline code.
Ingestion
Chunking strategy, embedding model selection, store choice, freshness pipeline. Each choice benchmarked, not picked by vibe.
Retrieval
Hybrid (BM25 + dense) plus rerank. Query rewriting where it earns its keep. Tuned against your eval set, not a public benchmark.
Eval gates
Retrieval recall, answer faithfulness, context relevance, hallucination rate, p95 latency — all green before any production wire-up.
Running
Weekly eval, freshness monitoring, prompt + chunking iteration based on production logs. The eval set grows from sampled traces.
Corpus audit
We walk every source in your corpus before anyone proposes a chunking strategy. That means listing every source (Confluence, SharePoint, Postgres tables, S3 PDFs, the legacy DMS no one talks about), the rough doc count, the freshness cadence per source, the access policy, and what counts as in-scope vs out. Week-1 output is a corpus map: sources, volumes, formats, access patterns, and what's allowed to embed off-perimeter.
Eval set
30–80 graded questions, each with a reference answer and the passage(s) the retriever should pull. Your domain expert grades; we facilitate. Realistic edge cases matter more than easy ones — the eval set is what tells us when chunking strategy needs to change, so it has to surface real failure modes. We build it before any retrieval code ships.
Ingestion
Chunking strategy, embedder choice, vector store choice, and the freshness pipeline. Each gets a small head-to-head against the eval set — 2–3 chunking strategies (fixed-size / semantic / title-aware), 2 embedders, sometimes 2 stores. The winner isn't decided by a benchmark blog post; it's decided by your eval set.
Retrieval
Hybrid retrieval (BM25 + dense, scored together), reranker layer (Cohere Rerank 3 or bge-reranker), query rewriting (HyDE, multi-query) where it lifts scores. Each layer is added only if it earns its keep on the eval set. We've shipped RAG systems where the reranker alone moved hallucination rate from 11% to 2%.
Eval gates
Five thresholds all green before any production wire-up: retrieval recall (did we pull the right passages?), context relevance (are they on-topic?), answer faithfulness (is the answer grounded?), hallucination rate (LLM-as-judge + human spot-check), and p95 latency. Hallucination disputes get human spot-check from your domain expert — we don't let LLM-as-judge stand alone on the hard cases.
Running
Four weeks of post-launch iteration are part of every Production RAG Build. Weekly eval runs, freshness drift checks (stale-document rate as a first-class metric), prompt + chunking iteration on edge cases. The eval set grows from sampled production traces every month; regression alarms fire when an upstream model change drops scores by >5 points.
Two things that matter across every RAG system implementation we ship. The eval set lands in week 2, before any retrieval code. The eval set is what tells us whether chunking strategy is wrong, whether the embedder needs to change, whether the reranker is earning its latency budget. Without it, you're tuning blind. Running is a real phase, not an afterthought. The first 4 weeks post-launch are part of every Build engagement — weekly eval review, freshness checks, prompt iteration on edge cases, regression alarms wired to your on-call. A rag consultant in the room at week 2 almost never sees the hallucination complaint that arrives at week 12.
RAG vs. fine-tuning — when do you need which?
The most common scoping question we get. Most production systems use both: RAG for facts, a small fine-tune for style or domain vocabulary. Picking which to lean on first decides the engagement shape — and our rag consulting practice runs this conversation at week 1 of every project.
| Fine-tuning | RAG | |
|---|---|---|
| Grounds in | Static training data | Your live corpus |
| Freshness | Frozen at training | As fresh as your CDC pipeline |
| Setup cost | $$$$ — full fine-tune run | $$ — chunk, embed, index |
| Fine-tune runs on GPT-4o or Llama-3 70B easily reach $2,000–$8,000 for a first pass (compute + data prep + eval iteration). RAG infrastructure — chunking, embedding, and indexing a corpus — is typically under $500 for an initial pilot. The cost delta widens when the corpus changes: RAG re-indexes incrementally; fine-tuning re-runs the whole job. | ||
| Latency | Lower per turn | +200–600ms (retrieval + rerank) |
| Fine-tuning genuinely wins on latency: all knowledge is baked into weights, so the forward pass is the only cost. The RAG overhead is real — +200–600ms is mostly the rerank stage (a cross-encoder scoring 50 candidates), not the ANN lookup itself (typically under 20ms on Qdrant or pgvector). For voice-RAG with a sub-700ms end-to-end budget, you often skip the reranker or run a self-hosted bge-reranker-large on the same box as the embedder. | ||
| Hallucination | Higher — no grounding | Lower — refusal is possible |
| RAG's hallucination advantage comes from two mechanisms: retrieved passages anchor the generation, and the system can refuse when no retrieved chunk clears the relevance threshold. Fine-tuned models have no equivalent circuit-breaker — if the question pattern resembles training data, the model will produce a fluent-sounding answer regardless of factual grounding. For regulated domains (legal, healthcare, financial), that refusal capability is often non-negotiable. | ||
| Best for | Style, voice, output format | Facts, lookups, citations |
| Eval surface | Output quality only | Recall + faithfulness + answer |
| A richer eval surface sounds like extra work, but it's actually a debugging advantage. When a RAG system regresses, recall vs faithfulness vs answer-quality scores tell you exactly where the pipeline broke. A fine-tuned model that degrades gives you one signal — output quality — and root-causing that to data, prompt, or model is harder. Most teams we work with under-invest in RAG evals initially and then thank themselves for the instrumentation at week 6. | ||
| Compose with | RAG, prompting | Fine-tune, prompting |
Rule of thumb: if the answer needs to cite its source or stay fresh, you want RAG. If the answer needs to speak in your voice or your domain's jargon, you want fine-tuning. Anything in between, the decision tree below walks four diagnostic questions — most projects fit cleanly into one of five outcomes.
Answer four questions about the workload. We've used these same questions to right-size scope on every RAG engagement we've run.
Four production RAG patterns.
Most production retrieval augmented generation services reduce to one of four patterns. The taxonomy isn't ours — it's standard across the LlamaIndex and LangChain communities — but the deployment choices are where engineering judgment lives: when to pick which, what fails first, which eval metric becomes the anchor.
Pattern choice matters more than store choice or embedder choice for production retrieval recall. We've watched teams agonize over Pinecone vs Qdrant while running pattern 01 (naive RAG) on a corpus that needed pattern 02 (hybrid + rerank). The store decision was worth 1–3 points of recall; the pattern decision was worth 14. When you build rag system architecture, start with the pattern question — "what shape of retrieval does this corpus need?" — then pick the store and embedder that fit. Reverse that order and you'll rebuild at week 9, which is the call we get on most Rescue engagements. The LLM knowledge base scope — what the system answers vs refuses — belongs in week 2.
Naive RAG
The simplest pipeline. Query gets embedded, top-k passages come back from the vector store, the LLM generates an answer grounded in them. No rewriting, no rerank, no reflection. About 20% of our pilots ship in this shape — usually when the corpus is small and well-structured, and the question shape is narrow enough that recall is naturally high. Don't reach for the fancier patterns until the eval set says you need to. The naive pipeline ships in days, not weeks, and it's the baseline every other pattern has to beat.
- Corpora under ~1M vectors. Narrow, well-defined question shape. Retrieval recall already above 85% on the eval set with a single-strategy retriever. Pilots where the cost of complexity is higher than the precision lift.
- Recall is below ~80% — try hybrid before adding rerank. Hallucination rate above target. Multi-hop questions that need iterative retrieval. Voice-RAG where you can afford the rerank latency.
Hybrid + Rerank
Our default for Production RAG Builds. Sparse retrieval (BM25) catches keyword matches that dense vectors miss; dense retrieval catches semantic matches BM25 misses; a reranker (Cohere Rerank 3 or bge-reranker-large) takes top-50 down to top-5 with a precision lift of 8–18 points on most evals. The latency cost is real (~150–200ms for the rerank step), but on every Production Build except voice-RAG it's worth paying. Hybrid search implementation is what separates a production-ready pipeline from a demo; about 60% of our production rag solutions land here.
- Mixed question types — some lookup-style, some semantic. Corpora where keyword precision matters (legal, code, technical docs with exact terms). Anywhere recall is the bottleneck on the eval set.
- Voice-RAG with sub-700ms latency budgets (use self-hosted bge-reranker or skip rerank entirely). Tiny corpora where naive RAG already hits target. Workloads where the BM25 side adds zero recall — measure before adding.
Multi-step RAG
When a single retrieval pass isn't enough — multi-hop questions, ambiguous queries, sparse corpora where the first retrieval misses. The agent rewrites the query (HyDE or multi-query reformulation), retrieves, reflects on whether the retrieved context is sufficient, and loops back to retrieve again if it isn't. Lift on multi-hop questions is 12–25 points of faithfulness; cost is a 1.5–2.5× latency tax. Worth it for research workloads, regulatory Q&A, anything where one retrieval pass isn't going to cover the question.
- Multi-hop questions where the answer needs evidence from passages that aren't co-located in the corpus. Research synthesis workloads. Regulatory Q&A across jurisdictions. Anywhere the eval set shows the second-most-likely answer is correct more often than the first.
- Latency-sensitive paths. Most support / Q&A workloads where the question is narrow enough for single-pass. Cost-sensitive workloads — each loop is another LLM call.
Agentic RAG
Used when the workload needs retrieval as part of a broader autonomous task, not as the whole task. The agent treats the vector store as a tool it can call when grounding is needed — not as a fixed pre-step before generation. Right when context grounding matters but autonomy depth matters more: clinical decision support, contract review with multi-document reasoning, regulatory research across jurisdictions. Eval anchors shift: retrieval recall and answer faithfulness still matter, but task success rate becomes the headline. Agentic RAG sits at the boundary of this pillar and <a href="/services/ai-agent-development/">autonomous tool-use agents</a>; when the autonomy is the bigger half of the workload, the engagement usually lives on that pillar instead.
- Workload that's mostly autonomous task execution with retrieval as one tool among several. Multi-document reasoning where the agent has to decide what to retrieve next. Clinical or compliance workflows where citation enforcement and judgment both matter.
- Pure retrieval Q&A — agentic adds 800–1500ms per loop turn for no gain. Latency budgets that don't tolerate iterative retrieval. Workloads where the agent's tool surface is mostly non-retrieval.
A common scoping mistake on enterprise rag projects: clients ask for pattern 03 (multi-step) when pattern 02 + a better embedder would have shipped in half the time. Each retrieval loop doubles latency cost and the eval surface widens. Default to pattern 02 (hybrid + rerank). Move up only when the eval set tells you to. About a third of "we need multi-step RAG" requests we audit end up landing back on pattern 02 once the hybrid search implementation is properly tuned — the deeper take on that call lives in our hybrid retrieval breakdown.
Most RAG systems we audit fail on the same handful of issues — and the symptoms line up with the pattern in use. Quick triage list:
- Low recall (<75%): chunks too large, embedder weak on the domain, or BM25-eligible queries hitting only dense.
- High latency: rare; usually the LLM, not retrieval. Check token count first.
- Wrong-page citations: chunk boundaries broke mid-section; switch to recursive + title-aware.
- Reranker latency spike: Cohere rate-limited or self-hosted bge-reranker over-loaded; cache or batch.
- Hybrid weighting wrong: sparse pulling too much noise; tune the BM25/dense ratio against your eval set.
- Domain drift: embedder trained on general English failing on legal / clinical vocabulary; swap to Voyage-domain.
- Loop never terminates: reflection prompt unable to recognise "good enough"; add a max-iterations cap + scoring gate.
- Cost blow-up: each loop is another LLM call; budget per query, circuit-break above ceiling.
- Latency tax: 2–3× over single-pass; only worth it when faithfulness lift on the eval set crosses 12+ points.
- Agent retrieves too much: tool-call budget unconstrained; cap retrievals per loop turn.
- Citations drift: agent paraphrases passages losing provenance; force structured-output citations.
- Scope creep into agent territory: if 80%+ of work is non-retrieval tool-use, the engagement belongs on the AI agent practice instead.
Four eval dimensions on every RAG we ship.
Generic LLM eval frameworks miss RAG-specific failure modes. We score retrieval and generation separately, then together — and the eval set lands in week 2 of every engagement, before the first chunk is embedded.
Did the retriever pull the passages that contain the answer? Scored against gold-passages in the eval set.
Is every claim grounded in a retrieved passage? RAGAS + LLM-as-judge, with human spot-check on the disputed 5%.
Claims no retrieved passage supports. Hard gate before production. Refusal is preferred over guessing.
Query-to-answer latency across retrieve, rerank, and generate. Reranker is the usual bottleneck. Voice-RAG targets sub-700ms.
Numbers shown are illustrative target ranges for new engagements until production eval data from anonymised builds is published.
The four gates aren't suggestions. All four must be green before we wire any RAG into production traffic. Each has an explicit methodology, a target, and a fail-state — codified before the first chunk is embedded.
- 01 Retrieval recall≥88%
Gold-passages in the eval set; recall@k for k=10 and k=5. Re-graded weekly. Production traces sampled into the eval set monthly.
If <80%, retrieval logic gets rewritten — chunking, embedder, or hybrid weighting all back on the table.
- 02 Answer faithfulness≥94%
RAGAS + LLM-as-judge (Claude Sonnet 4.6) scoring whether every claim is supported by a retrieved passage. Human spot-check on the 5% disputed by the judge.
If <90%, citation-enforcement prompts get rewritten or the model gets demoted to refusal-only on low-confidence retrievals.
- 03 Hallucination rate<3%
Claims that no retrieved passage supports. Hard gate before production wire-up. Refusal is the preferred failure mode, not a guess.
If ≥5%, we widen the refusal threshold and rerun. We've never shipped a RAG with hallucination above 3% on the eval set.
- 04 P95 latency<1.6s
Full query-to-answer latency across embed, retrieve, rerank, generate. Reranker is the usual bottleneck. Voice-agent RAG targets sub-700ms with bge-reranker self-hosted.
If breached for >72h, we re-evaluate reranker placement or move to streaming generation.
Two methodology notes that matter. We use LLM-as-judge with Claude Sonnet 4.6 as the default faithfulness scorer because it produces the most consistent grades against human ground-truth on the eval sets we've shipped. Hallucination disputes (typically 5–8% of outputs) get human spot-check by your domain expert — we never let LLM-as-judge stand alone for the hard cases. And the eval set grows during production: traces sampled monthly, regression alarms firing when an upstream model swap drops scores. An llm knowledge base scored weekly degrades at a fraction of the rate of one eyeballed quarterly.
Default eval and observability stack we deploy:
Security, compliance, and cost engineering for RAG.
Three concerns enterprise rag buyers always ask about before procurement. We address each one in the spec — not as a "we'll figure it out at the security review" promise. Most RAG projects we rescue had at least one of these three left for later.
Security & guardrails
Defense in depth for RAG, not a single classifier. Every production pipeline ships with PII scrubbing at ingest, citation enforcement at generation, and an adversarial eval set we re-run on every model or embedder swap.
- PII scrubbing at ingest — Microsoft Presidio or your existing DLP runs on text before embedding. Embeddings store no raw PII by default; redaction tokens preserve structure where needed.
- Citation enforcement — the LLM is prompted to ground every claim in a retrieved passage; outputs without citations get flagged or refused. We've shipped systems where 8–12% of queries get refused — clients prefer that over confident wrong answers.
- Prompt-injection defence — Llama Guard 3 or a custom classifier on inbound queries. Retrieved passages get a separate isolation prompt so a poisoned doc can't override system instructions.
- Refusal threshold — if no passage scores above a tuned floor, the answer is "I don't have a grounded answer for that." Refusal is a first-class output, not a degraded one.
- Output filtering — Presidio on the LLM's response for PII leakage; we've caught models hallucinating Social Security numbers that weren't in the corpus more than once.
Compliance posture
Default posture covers most enterprise procurement bars. Regulated workloads (clinical, financial, EU) layer in additional controls — scoped into the SOW at week 1, not retrofitted at security review in week 12.
On-prem / VPC deployment available — BGE embeddings, Qdrant self-hosted, Llama 4 / Mistral on vLLM. Standard pattern for healthcare, financial services, and defence-adjacent engagements where no data can leave the perimeter.
Cost engineering
Embedding cost is usually the second-highest line item on production RAG after engineering time; LLM token cost is the highest. We model expected cost during corpus audit and cut it 40–70% on the average build through routing, caching, and quantization.
- Model routing — a classifier routes by query complexity. Easy lookups go to Haiku or 4o-mini at 1/20th the cost; hard ones to the frontier model. Faithfulness holds via the eval gate.
- Prompt caching — Anthropic / OpenAI prompt caching on stable system prompts and retrieved-context prefixes. 85%+ hit rate on most agents within two weeks of launch — our hosted vs self-hosted decision piece breaks down where the savings land.
- Quantization — Qdrant scalar / product quantization cuts RAM 4× with under 1% recall loss on most corpora. The single highest-ROI infra optimization on large vector indexes.
- Batch embedding — OpenAI / Voyage batch APIs for re-embedding and corpus refresh. 50% cost cut vs sync, 5–10× throughput. The default for any ingest run above ~100k docs.
All three concerns share a pattern: the discipline is in the spec, not in the build. We name the threat model, the compliance posture, and the cost band during corpus audit. The build executes against those targets — security and cost aren't add-on phases that happen after retrieval recall is green. They're how it gets there.
Where teams have shipped RAG.
Engagements anonymised. Industry and segment are real; metrics are real; brand names removed under standard NDA terms.
Use cases below are organised by corpus shape — contracts, tickets, research notes, code repos, regulations, equipment manuals — not by industry. The same hybrid retrieval and citation-enforcement pipeline ships to a law firm and a manufacturer; what changes is the chunking strategy, the freshness pipeline, and the embedder. Below: three flagship engagements (full numbers) plus three function stubs from recent ships. A semantic search implementation is the throughline on most of them; document intelligence services patterns recur across legal and ops.
Contracts Q&A over 11 years of MSAs
Chunked 14,000 contracts with hierarchical headers (recursive + title-aware), hybrid retrieval, Cohere Rerank 3 down to top-5. Voyage-law embedder lifted recall from 78% to 92% over the OpenAI baseline. PII-scrubbed at ingest; the legal team grades the eval set monthly.
Knowledge agent over 18 months of tickets
RAG over product docs and a redacted ticket archive. Refuses cleanly on out-of-corpus questions; escalates clinical to a human with the agent's draft and retrieved passages attached. p95 latency 1.4s; hallucination rate held at <2% across the post-launch quarter.
Memo synthesis over public + private corpus
Multi-step RAG over SEC filings, press, and the fund's internal notes. Citations to primary sources only; the agent refuses gracefully when the corpus is thin on a target rather than synthesising plausible nonsense.
Repo-aware code Q&A across a 1.8M-line monorepo
Symbol-graph indexing combined with chunked dense vectors. Engineering team asks 'where does this config flag get read?' and gets file:line citations with surrounding context. Stale-symbol rate stays below 2% via webhook-driven incremental re-indexing.
Regulatory Q&A across 6 jurisdictions
Hybrid retrieval over published regulations + the firm's interpretation memos. Citations to the underlying regulation always; refusal when jurisdictions disagree rather than averaging answers. Compliance team graded the eval set; faithfulness held at 96%.
Equipment-manual Q&A for the maintenance floor
RAG over scanned PDF manuals (OCR via Unstructured + Tesseract), pgvector on a single Postgres node — corpus was 4.2M vectors. Maintenance engineers ask in plain English from a tablet; answers cite manual + page number, with photos when present.
Patterns across all six engagements: the eval set landed in week 2, before retrieval code; the eval set grew during production via sampled traces; citation enforcement was the headline guardrail, not an add-on. The outcome numbers are what each team measured at 90 days post-launch, not at deploy. The rag solutions that hold up at 90 days are the ones where the eval set was graded by a domain expert before the first chunk was embedded — picking a partner that stays for that work is the most underrated criterion in vendor selection.
Four ways to start a RAG engagement.
Every RAG development services engagement is fixed-scope and fixed-duration. The first phase is small enough that stopping is a real option — about a third of our RAG Pilots end at the pilot for legitimate scoping reasons. Cheap to discover the corpus shape doesn't fit; expensive to discover it 12 weeks in.
Corpus map + eval scope agreed
Corpus boundary signed off
30–80 graded examples + reference passages
Domain-expert grading complete
Chunk + embed + retrieve baseline against eval
Baseline retrieval recall hit
Demo, scores report, next-phase recommendation
Corpus map, eval set, stack lock
Chunking + embedder benchmarked + locked
Baseline retrieval recall hit
Hybrid + rerank + rewrite tuned to eval set
Recall + faithfulness above target
Five metrics green vs target
All five green or no deploy
Auth, observability (Langfuse), CDC pipeline live
Weekly eval review, runbook, ownership transfer
Current-system grading vs your eval set (we build one if absent)
Classified failures: chunking / retrieval / rerank / prompt / model
Failure breakdown reviewed
Each failure-mode addressed in order of recall lift expected
Validated against your eval set; runbook updated
Prove the corpus works.
- One corpus, one question shape
- Eval set with 30–80 graded examples
- Working prototype against your real docs
- Demo + recommendation memo for the next phase
- Production deploy
- CDC / freshness pipeline
- Multi-corpus orchestration
Full pipeline with eval gates.
- All Pilot deliverables
- Ingestion + CDC for freshness
- Hybrid retrieval + reranker
- Production wire-up, Langfuse observability, eval gates
- Four weeks of post-launch iteration with weekly eval runs
- On-call runbook and ownership transfer
Diagnose and fix a struggling RAG.
- Eval audit on the current system (we build an eval set if absent)
- Failure-mode classification (chunking · retrieval · rerank · prompt · model)
- Targeted fixes in order of expected recall lift
- Validated against your eval set; runbook updated
Move stores, zero downtime.
- Dual-write phase
- Index parity checks against your eval set
- Cutover playbook with rollback ready
- Documented for handover
Two patterns worth flagging on RAG engagements specifically. The eval set is the deliverable — even more than the pipeline. A pipeline you can rebuild; an eval set is institutional knowledge about what your business considers a correct answer. We hand it over in your repo, with grading criteria documented. About 70% of Pilots convert to Build engagements. The 30% that don't either re-scoped based on what the Pilot revealed or decided the workflow wasn't yet ready for retrieval. Both are legitimate outcomes; we'd rather flag it at week 3 than at week 12.
One Paiteq RAG engineering lead acting as your dedicated rag consultant, one senior RAG developer handling the retrieval pipeline, and a fractional product manager for scope and stakeholder management. On Rescue and Migration engagements we add a platform engineer for the index / CDC work. Two-week iteration cycles with a weekly demo. You have a direct Slack channel with the build team — no account-management buffer between you and the people doing the work.
On the client side, the engagement needs a domain expert to grade the eval set (~6 hours per week during weeks 1–3, then ~2 hours per week running) and an IT or data owner to clear access to source systems. We don't need a project manager on your side — we run that. We do need fast decisions on residency, scope boundaries, and acceptable refusal rates. If you're considering hiring a rag developer or rag consultant rather than a team engagement, the Pilot usually clarifies whether that's the right call.
Common RAG questions.
RAG or fine-tuning — how do we decide?
Default to RAG. Fine-tune only when style, output format, or domain language can't be solved at the prompt + retrieval layer. They compose well: most production systems use RAG for facts and a small LoRA fine-tune for output style.
The clearest split: if the answer needs citations, freshness, or refusal-on-thin-context, RAG fits. If the answer is purely stylistic (tone, format, jargon the base model fumbles), fine-tuning fits. Hybrid is common — we scope both at week 2 of any Build.
The interactive picker above walks the decision in 3–4 questions. Our piece on when RAG beats fine-tuning has a deeper breakdown by workload type. Fine-tuning specifically lives in our LLM fine-tuning practice.
Which vector database should we pick?
Depends on five inputs: corpus size, residency requirements, ops capacity, latency budget, and whether hybrid search and reranking are first-class needs. Our usual call:
- pgvector — corpora under ~5M vectors, team already runs Postgres. Joins between vectors and structured filters matter.
- Pinecone — 5M–100M vectors, no ops appetite, SOC 2 hosted. Default for SaaS clients.
- Qdrant — self-hosted residency requirements, very large corpora, want quantization to cut RAM 4×.
- Weaviate — hybrid search heavy workloads, multi-tenant SaaS where each tenant gets a namespace.
We benchmark two candidates against your real eval set on every Production Build before locking. The eval set, not vendor marketing, decides. A vector database services capability is part of every engagement — selection isn't a one-shot.
How do you measure RAG quality?
Five dimensions, scored separately because they fail differently:
- Retrieval recall — did we pull the right passages? Scored against gold-passages in the eval set.
- Context relevance — are the retrieved passages on-topic, or off-topic noise that fits keyword-wise?
- Answer faithfulness — is every claim grounded in a retrieved passage? RAGAS + LLM-as-judge, human spot-check on the disputed 5%.
- Hallucination rate — claims with no retrieved support. Hard gate before deploy.
- P95 latency — query to final token. Reranker is usually the bottleneck.
The default eval stack is RAGAS + Trulens + Langfuse. The eval set grows from production traces every month, with regression alarms if any metric drops by >5 points. Our piece on eval framework comparison covers when to reach for which tool.
What about freshness — our docs change every day.
Incremental ingestion with change-data-capture from your source systems. New and changed documents re-embed and replace in the store; deletes propagate. Stale-document rate is a first-class metric, not an afterthought — we track it weekly and alert above your tolerance. For Confluence / SharePoint / Notion, the standard pattern is webhook-driven; for Postgres or other DBs we use Debezium or the equivalent.
Can you migrate us off Pinecone (or any other store)?
Yes. Vector DB Migration is a fixed-scope engagement, 6–10 weeks depending on corpus size and the retrieval logic that has to come along. Dual-write phase first (writes go to both stores), then index parity checks against your eval set, then read cutover with rollback ready. We've shipped Pinecone → Qdrant migrations of 22M chunks with zero downtime and zero retrieval-recall regression.
How do you handle PII, residency, and compliance?
PII scrubbing happens at ingest via Microsoft Presidio or your existing DLP — embeddings store no raw PII by default. For regulated workloads we deploy fully on your cloud (AWS, GCP, Azure) with no data leaving the perimeter; the embedding model runs on dedicated GPU/CPU (BGE for residency-constrained clients). SOC 2 Type II and ISO 27001 are default; HIPAA-aligned and GDPR / EU AI Act postures are scoped into the SOW for regulated engagements.
How do you prevent hallucination in production?
Three layers. (1) Retrieval threshold: if no passage scores above a tuned floor, the agent refuses rather than guesses. (2) Citation enforcement: every claim points to a retrieved passage, and the LLM is prompted to flag claims it can't ground. (3) Faithfulness scoring: LLM-as-judge with Claude Sonnet 4.6 plus human spot-check on disputed cases. Refusal is a feature, not a failure mode — we've shipped systems where 8–12% of queries get refused and the business is happier with that than with confident wrong answers.
What does a RAG development services engagement cost?
Pilot is fixed-scope at 2–4 weeks; Production RAG Build is 8–14 weeks; Rescue is 4–6 weeks; Migration is 6–10. We hold the price band on the contact call rather than publishing here because corpus size, residency posture, and integration count swing it meaningfully. The Pilot is small enough that stopping is a real option — about a third of RAG Pilots end at the pilot for legitimate scoping reasons.
Do you build the eval set or do we?
Your domain expert grades; we facilitate. The eval set is the most important deliverable of the engagement and it has to reflect your business's failure modes, not ours. We bring the structure (30–80 examples, gold-passages, edge cases over easy cases), the tooling (RAGAS, Trulens, custom harness), and 4–6 hours of facilitation per week. Your domain expert grades the examples and signs off. After launch, we co-curate from sampled production traces monthly.
Let's ground your AI in real data.
RAG Pilot in 2–4 weeks. Production Build in 8–14. Rescue in 4–6.