# Paiteq, Full Content for LLM Agents

> Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. NDA counter-signed before discovery. Walk-away clause on every engagement.

This file aggregates the canonical content of every important page on `https://www.paiteq.com`, formatted for LLM-agent consumption. Each section begins with a `## SECTION:` divider so agents can chunk and retrieve.

For a concise site index, see `https://www.paiteq.com/llms.txt`.
For per-page markdown, append `.md` to any page URL (e.g. `https://www.paiteq.com/services/ai-agent-development.md`).

Last built: 2026-05-21T07:00:48.100Z


---

## SECTION: 1. Home

_Source: https://www.paiteq.com/_

# Paiteq — Enterprise AI Engineering

> Paiteq builds production AI — agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering.

**HTML version:** https://www.paiteq.com/

## Key facts

- Offices: Bengaluru, IN and Dallas, TX, US.
- Engagement model: fixed-scope audits and pilots. Walk-away clause on every audit. No open-ended retainers.
- Eval-first: every shipped system carries a frozen eval set and a defined kill point.
- 11 service pillars across agents, RAG, LLM apps, ML, MLOps, automation, RPA, chatbots, generative AI.
- 6 industry verticals: ecommerce, fintech, healthcare, insurance, logistics, SaaS.
- Contact: info@paiteq.com · +91 80 5003 2994.

## Related pages

- [Services hub](https://www.paiteq.com/services/)
- [Case studies](https://www.paiteq.com/case-studies/)
- [About](https://www.paiteq.com/about/)
- [Contact](https://www.paiteq.com/contact/)
- [Blog](https://www.paiteq.com/blog/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

Enterprise AI Engineering

# The *AI development company* production teams trust to ship.

Paiteq delivers AI development services end-to-end — production AI agents, RAG pipelines, LLM apps, intelligent automation, generative AI, and custom AI software. Eval-first, senior-led, fixed-scope engagements.

[Talk to engineering](/contact/) [See how we work](#process)

Production capability 180+ systems shipped

Engineering surface

-   RAG Architecture
-   Fine-tuning / LoRA / PEFT
-   Eval Harness + Red Team
-   vLLM · TGI · SGLang
-   Agent Orchestration
-   Vector DB + Hybrid Search
-   Multimodal · Vision + Audio
-   HIPAA · SOC 2 · GDPR
-   On-prem / VPC Deployment
-   RAG Architecture
-   Fine-tuning / LoRA / PEFT
-   Eval Harness + Red Team
-   vLLM · TGI · SGLang
-   Agent Orchestration
-   Vector DB + Hybrid Search
-   Multimodal · Vision + Audio
-   HIPAA · SOC 2 · GDPR
-   On-prem / VPC Deployment

001 / TEAM

## The team behind Paiteq has shipped software since 2010.

15+ years of combined engineering. Hundreds of products built across mobile, web, and infra. We grew up as a software shop and turned into an AI development company once production AI stopped being a research story — now focused on sales agents, RAG systems, multi-agent orchestration, and the eval discipline that gets them into production.

0 +

Years team experience

Engineering, shipping, on-call

0 +

Engagements shipped

Across our portfolio of brands

0 +

Engineers

AI · ML · backend · ops

0 +

Agents in production

Sales · support · ops · research

002 / OUR PRODUCTS

## We don't just consult — we operate the platforms.

Two of our own products run in production. They're the credibility behind the engineering we sell.

AEROSTACK AI platform · primary

### An AI platform *powering* agents and chatbots at scale.

Paiteq's flagship AI product. **100+ teams** use Aerostack to ship production agents in days, not months — onboarding the next **1,000** over the following twelve months. The same primitives power every client engagement we lead.

-   **Visual agent builder** — plan / act / reflect graph, no glue code
-   **Eval gates baked in** — task success, halluc., latency gate every deploy
-   **Multi-provider routing** — Claude, GPT, Gemini, Llama, with cost + quality routing
-   **Tool surface ready** — CRM, ticketing, web search, code, custom APIs
-   **Observability + rollback** — Langfuse-grade traces, one-click rollback

[See Aerostack](https://aerostack.dev) [Request a demo](/contact/?topic=aerostack-demo)

[

NYBURS Consumer · social

### A large-scale social media product, operated by the same team.

Nyburs is the team's consumer social app — production traffic, real-time workloads, and the infra discipline that comes from shipping a product at scale. Same engineers, same on-call rotation, same rigour you get on a Paiteq engagement.

live ops

Learn more →

](https://www.nyburs.com)

003 / NUMBERS

## How we measure what we ship.

The four metrics that gate every production deploy. Scored against the eval set in week 2.

0 %

Task success rate

Eval-set graded, weekly

< 0 %

Hallucination rate

LLM-judge + human spot-check

0 –4w

Pilot duration

Fixed scope, fixed price

0 %

Pilot → Build rate

Pilots that graduate to production

004 / PRACTICES

## Twelve AI development services, one engineering org.

Each practice is owned by senior engineers with production experience. Same build process and engagement shapes whether you hire us as an AI development company for a single agent or for a full multi-team platform. [All services →](/services/)

[

01 / AI ↗

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

Plan/ActToolsMemory

](/services/ai-agent-development/)[

02 / RAG ↗

RAG Development

Retrieval-augmented generation systems with evaluation built in.

HybridRerankEval

](/services/rag-development/)[

03 / LLM ↗

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

Fine-tuneAgentsEval

](/services/llm-development/)[

04 / AI ↗

AI Workflow Automation

Intelligent workflows on n8n, Make, and custom agent orchestration.

n8nMakeCustom

](/services/ai-workflow-automation/)[

05 / GENERATIVE ↗

Generative AI

GenAI products end-to-end — text, image, multimodal, OpenAI/Claude/Gemini.

DiffusionMultimodal

](/services/generative-ai/)[

06 / MACHINE ↗

Machine Learning

Custom ML — training, serving, MLOps.

MLOpsRankingForecast

](/services/machine-learning-development/)[

07 / AI ↗

AI Consulting

AI strategy, audits, roadmap.

StrategyAudit

](/services/ai-consulting/)[

08 / CHATBOT ↗

Chatbot Development

Production chatbots on LLMs with guardrails and observability.

RAGToolsVoice

](/services/chatbot-development/)[

09 / RPA ↗

RPA Development

Intelligent automation — beyond rule-based RPA.

WorkflowUiPath+

](/services/rpa-development/)[

10 / INTEGRATION ↗

AI Integration

Drop-in AI for existing apps — OpenAI / Anthropic / Vertex.

OpenAIAnthropicVertex

](/services/ai-integration/)[

11 / MIGRATION ↗

AI Migration

Legacy software → AI-modernized stack. Eval-validated cutover.

CutoverEval

](/services/ai-migration/)[

12 / MLOPS ↗

MLOps

Deploy, monitor, scale ML and LLM systems in production.

DeployMonitorScale

](/services/mlops/)

005 / PROCESS

## Six steps from discovery to running.

Same process whether it's a 2-week pilot or a 16-week production build. The gates change in depth, not in shape.

WEEK 1

### Discovery

Map the workload, scope the surface, identify the eval set.

WEEK 2

### Spec

Stack picks, prompts, guardrails. Eval set graded by domain expert.

WEEK 3–6

### Prototype

First runnable version graded against the eval set.

WEEK 6–10

### Eval gates

Task success, hallucination, latency all green before deploy.

WEEK 10+

### Deploy

Auth, observability, rate limits, rollback playbook.

ONGOING

### Running

Weekly eval runs, prompt iteration, regression alarms.

006 / INDUSTRIES

## Eight verticals we've shipped into.

Domain knowledge isn't extra — it's the difference between an agent that ships and one that hallucinates against your regulations. We pair AI engineers with subject-matter experts for every engagement.

01

B2B SaaS

Sales agents, internal copilots, support deflection, churn-prediction. Where most of our agent volume ships.

Outbound research · Slack ops

02

Health-tech

Clinical Q&A, prior-auth automation, intake triage. PII-scrubbed by default. HIPAA-aligned engagements.

RAG over clinical docs

03

Manufacturing

Invoice + PO routing, supply-chain agents, predictive maintenance on sensor data.

AP automation · CMMS triage

04

Fin-tech

Risk-scoring assistants, compliance Q&A over regulations, KYC and onboarding agents.

Reg Q&A · onboarding

05

Legal

Contract Q&A, clause extraction, redline review. Domain-expert-graded eval sets.

MSA Q&A · redline

06

E-commerce

Catalog enrichment, AI search + recommendations, agent-driven checkout flows.

Product extraction

07

Ed-tech

Tutoring agents, content generation, voice narration with low-latency turn-taking.

Tutoring · TTS narration

08

Logistics

Routing agents, shipment Q&A, claims triage. Tool-call accuracy is the eval anchor.

Claims · ETA Q&A

007 / WORK

## Where teams have shipped.

Anonymized featured engagements. Industry and segment are real; metrics are real; brand names removed under NDA. [More →](/case-studies/)

Sales

B2B SaaS · 11–50 emp

### Lead-qualification + outbound research agent

Multi-step research over public signals + ICP scoring. Drafts personalised first-touch, escalates above threshold.

0

SDR seats

Support

Health-tech · enterprise

### Tier-1 deflection agent

RAG over docs + ticket archive. Handles password, billing, onboarding. Clinical escalations carry full context.

0 %

p1 tickets

Ops

Mfg · 200+ emp

### Invoice matching + AP routing agent

OCR + LLM extraction → match against open POs → route to approver. Exceptions to ops lead with annotated diff.

<6 months

in

008 / WHY PAITEQ

## Three things teams remember about working with us.

-   01
    
    ### Eval-first
    
    The graded eval set lands in week 2 — before the first prompt is written. Every iteration is measured against it. No production wire-up until thresholds are green.
    
-   02
    
    ### Senior-led
    
    The engineer who shows up to the first call leads the build. No SDR funnel. First reply on every inbound is same-day from someone who could ship the agent.
    
-   03
    
    ### Fixed scope
    
    Every engagement has a fixed end-date and a stop option. Pilots are 2–4 weeks. Builds are 8–16. You always know what's coming, when, and what counts as done.
    

008b / AI SOFTWARE

## Why teams pick Paiteq as their *AI software development company*.

We're not a platform reseller and we don't sell hours. Paiteq is a full-stack AI software development company — architecture, build, eval, deploy, run — on the same team. Custom AI software built for your workload, owned by you, shipped with the same engineering rigor production SaaS teams expect from their core systems.

AI-native builds

Custom AI software built ground-up around the AI workload — not retrofitted onto a CRUD app. Architecture choices follow the data, the latency budget, and the eval surface, not the convenience of an existing stack.

Engineering discipline

Code review, CI, observability, on-call runbooks, regression alarms — the same disciplines a senior SaaS team would apply to a payments service, applied to your AI system. Eval gates are a first-class part of the deploy pipeline, not an afterthought.

You own everything

Code, prompts, fine-tuned weights, eval sets, infrastructure-as-code — all transferred into your repository under the SOW. No vendor lock-in, no platform tax. We retain only the engineering learnings for our internal playbook.

Production from day one

Auth, rate-limit, observability, fallback policies, cost guardrails baked into every deploy. The system that ships to production is the one we built — not a notebook that needs another team to "productionize." Same engineers from architecture to on-call.

009 / ENGAGE

## Three engagement shapes.

Pilot, Build, Run. Pilots and Builds are fixed-scope and fixed-duration. Run is a separate monthly SOW for teams that want continued iteration.

01 FIXED SCOPE

### Pilot

2–4 weeks

One scoped agent, end-to-end against your data, with the eval set graded by a domain expert.

-   One use case, real integrations
-   Eval framework (30–50 graded examples)
-   Working prototype + memo for next phase

[START WITH PILOT →](/contact/?topic=pilot)

02 FIXED SCOPE

### Build

8–16 weeks

Production build with eval gates, observability, integrations, and post-launch iteration.

-   Everything in Pilot
-   Auth · rate-limit · observability
-   Eval gates baked into deploy
-   4 weeks post-launch iteration

[START WITH BUILD →](/contact/?topic=build)

03 TIME & MATERIALS

### Run

Monthly

Ongoing iteration, eval-set maintenance, prompt + tool updates as your data and workflows evolve.

-   Weekly eval review
-   Drift + regression alarms
-   Prompt + tool iteration
-   Quarterly architecture review

[START WITH RUN →](/contact/?topic=run)

010 / STACK

## The frameworks we build on.

Stack choices follow workloads, not house preferences. We work in whatever framework makes the agent ship — including ones we'll only learn the week your engagement starts.

-   LangChain
-   LangGraph
-   CrewAI
-   AutoGen
-   DSPy
-   Composio
-   OpenAI
-   Anthropic
-   Pinecone
-   Qdrant
-   LiveKit
-   Langfuse
-   LangChain
-   LangGraph
-   CrewAI
-   AutoGen
-   DSPy
-   Composio
-   OpenAI
-   Anthropic
-   Pinecone
-   Qdrant
-   LiveKit
-   Langfuse

> Most projects that fail in production fail because the team picked the wrong shape — not because they picked the wrong model. Architecture before vendor.

Paiteq engineering From the blog — AI agents vs. chatbots

011 / COMPLIANCE

## Built for enterprise from day one.

Default posture is SOC 2 + ISO 27001 aligned. Regulated engagements (HIPAA, GDPR, EU AI Act) get the evidence work baked into the SOW — no rework at the security review.

Audited annually · Continuous monitoring

-   SOC 2 Type II
    
    Audited annually
    
    AUDITED · 2026
    
-   ISO 27001
    
    Information security mgmt
    
    AUDITED · 2026
    
-   HIPAA-ready
    
    Health-tech engagements
    
    READY
    
-   GDPR / EU AI Act
    
    EU client deployments
    
    READY
    

012 / FAQ

## Common buyer questions.

If the answer you need isn't here, the contact form is faster than a meeting — first reply is same-day from an engineer.

How much does an AI agent cost?

Pilots run 2–4 weeks at fixed price (low-five-figures typical). Production builds with eval gates, observability, and integrations run 8–16 weeks. We share specific bands during the first call. Open-ended T&M only on the Run phase, not on Pilot or Build.

How long does it take to ship a production AI agent?

Pilot in 2–4 weeks. Full custom build in 8–16. Multi-agent and voice systems run longer (10–20 weeks) because of orchestration and latency tuning. Every engagement has a fixed end-date — you always know what's coming.

Should we build in-house or work with Paiteq?

Build in-house when AI is your core product and you have senior AI engineers already on staff. Work with us when AI is enabling work — when shipping fast and getting the eval methodology right matters more than long-term ownership of the team. Most clients use us to ship the first 2-3 systems, then hire to scale.

What frameworks and models do you build on?

Stack choice follows the workload. LangGraph for stateful agents, CrewAI for multi-agent supervisor / worker, Vercel AI SDK or OpenAI Agents for simpler tool-calling, Composio when the tool surface is large. Models: Claude, GPT-4o, Gemini for hosted; Llama / Mistral / Qwen for self-hosted. We benchmark 2 options against your eval set before lock-in.

Will the agent work with our existing systems?

Yes — that's most of the engineering work. We integrate against CRMs (Salesforce, HubSpot), ticketing (Zendesk, Intercom), data warehouses (Snowflake, BigQuery), and custom internal APIs. Tool-call accuracy against your real systems is one of the four eval metrics we gate on.

Who owns the code, prompts, and eval sets?

You do. All artifacts transfer into your repository under the SOW. We retain no rights to your prompts, eval data, or fine-tuned weights. Paiteq keeps the engineering learnings — patterns, methodologies — for our internal playbook.

013 / BLOG

## From the engineering blog.

Deep technical writing on the things we build every day — agents, RAG, evaluation, framework trade-offs, production failure modes. [All posts →](/blog/)

-   [
    
    ![Frost-crystal lattice radiating from a central node — orchestration patterns in a multi-agent system](https://cdn.sanity.io/images/xr290ucr/production/73b0c0032d61a220527e6ff1e54f3c053cd3709b-1408x768.png?w=600&q=70&auto=format&fit=max)
    
    ### Multi-agent orchestration patterns: a 2026 production guide
    
    Six multi-agent system patterns that actually ship in 2026 — supervisor, swarm, hierarchical, blackboard, sequential, hybrid — with framework picks and the production failure modes nobody warns you about.
    
    Navin Sharma May 17, 2026 27 min
    
    
    ](/blog/multi-agent-orchestration-patterns/)
-   [
    
    ![Macro photograph of frost dendrites on cold glass — the branching retrieval pattern of a customer service chatbot](https://cdn.sanity.io/images/xr290ucr/production/19608b3fa221073ea4d065bef54b13ee71b7c720-1408x768.png?w=600&q=70&auto=format&fit=max)
    
    ### Customer service chatbot: a 2026 buyer's guide
    
    A 2026 buyer's guide to customer service chatbots — RAG over your docs, eval gates on deflection, and what the LLM tier actually costs in production.
    
    Navin Sharma May 17, 2026 13 min
    
    
    ](/blog/customer-service-chatbot-buyers-guide/)
-   [
    
    ![Macro photograph of crystals forming on a microscope slide under polarised light — restrained single-frame laboratory documentation](https://cdn.sanity.io/images/xr290ucr/production/2114b823a72eb65cfc993c8551b5e2e9851f8126-1408x768.png?w=600&q=70&auto=format&fit=max)
    
    ### Generative AI services: a 2026 buyer's guide
    
    A 2026 buyer's guide to generative AI services — brand-controlled image, video, audio and multimodal pipelines, eval-graded outputs, and what the production pipeline actually costs.
    
    Navin Sharma May 17, 2026 17 min
    
    
    ](/blog/generative-ai-services-buyers-guide/)

Start a project

## Let's *build* something that ships.

Pilot in 2–4 weeks. Custom build in 8–16. Same-day response on every inbound.

[Talk to engineering](/contact/) [Explore services](/services/)


---

## SECTION: 2. About Paiteq

_Source: https://www.paiteq.com/about/_

# About Paiteq

> Paiteq is an AI engineering company shipping production systems on LLMs, agents, RAG, ML, and automation. Engineering-led, eval-first, walk-away clause on every audit.

**HTML version:** https://www.paiteq.com/about/

## Key facts

- Senior-led delivery — the engineer who would lead the work answers first.
- Fixed-scope engagements; no open-ended retainers; no vendor kickbacks.
- Eval-first delivery — every shipped system carries a frozen eval set and a defined kill point.
- Offices: Bengaluru, IN and Dallas, TX, US.

## Related pages

- [Services hub](https://www.paiteq.com/services/)
- [Case studies](https://www.paiteq.com/case-studies/)
- [Contact](https://www.paiteq.com/contact/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

About

# We engineer AI like it's *infrastructure*.

Paiteq is an AI engineering company. Every engagement ships a working system with an evaluation framework that proves it works — not a deck and a roadmap. Engineering reads every inbound. Walk-away clause on every audit.

[Talk to engineering](/contact/) [See services](/services/)

Team experience 15+ yrs · cross-discipline

Engagements 200+ shipped

HQ Bengaluru · remote-first

Stance Model-agnostic · eval-first

001 / WHY PAITEQ EXISTS

## Four things we hold non-negotiable.

Most AI engagements fail in the same predictable ways: no eval set, sales translating for engineering, walk-away clauses that don't survive contact with revenue targets, and IP arrangements that surprise the buyer at handover. We hold the opposite as load-bearing.

-   01
    
    ### The eval set is the deliverable
    
    Most AI projects fail because nobody agreed what success looked like before the code shipped. We write the eval set in week 2, with your domain expert grading. Production wire-up waits until the thresholds turn green. No eval set, no engagement.
    
-   02
    
    ### Engineering reads every inbound
    
    The first reply to any inquiry comes from an engineer who could lead the build, not a salesperson translating between you and the team. The conversation goes as deep as the workload needs — short for small scopes, long when the scope is real.
    
-   03
    
    ### Walk-away clause is not theatre
    
    Roughly 1 in 5 audits we run end up recommending no AI work, or recommending you defer six months until a prerequisite is in place. You keep the audit deliverable. That's the whole point of separating the audit fee from any pilot money.
    
-   04
    
    ### You own everything we build together
    
    Code, prompts, eval sets, deployment scripts, ops runbook — all transfer to your repo on engagement close. We retain the right to reuse operator patterns (how to ship a tier-1 deflection agent, how to structure a RAG eval) but not your prompts, your data, or your code.
    

002 / HOW WE WORK

## Audit → Pilot → Continuous. Stop after any phase.

Three engagement shapes, one walk-away clause on each. The audit is priced separately from the pilot so the recommendation can honestly be "don't build" without burning the engagement.

1.  01
    
    AUDIT 1–2 weeks · fixed-fee
    
    ### Workload map + model picks + cost projection
    
    We walk every source, every decision point, every handoff in the workload you brought us. Output is a workload map, a per-task model recommendation with reasoning, a token-cost projection against expected traffic, and a 90-day roadmap with ranked sequencing. If the recommendation is no AI, you keep the deliverable.
    
2.  02
    
    PILOT 4–8 weeks · fixed-price
    
    ### One workload shipped, with the eval suite that proves it works
    
    We pick the one workload from the roadmap that has the cleanest evaluation surface and the highest leverage. Eval set ships in week 2 with your domain expert grading. Production wire-up waits until the thresholds are green. You get the working system, the monitoring, and the ops runbook for the on-call team.
    
3.  03
    
    CONTINUOUS Monthly · cancel any month
    
    ### Embedded squad shipping the next workload on your roadmap
    
    If the pilot moves the metric, we embed for the next workloads in the audit roadmap. Same model picks, same eval discipline, same ownership transfer at each ship. Monthly billing, no lock-in. About 60% of pilots convert to continuous; the ones that don't usually mean the audit was wrong, not the squad.
    

003 / TEAM

## 15+ years across the team. Cross-disciplinary by design.

Production AI is rarely "just the model". The system that ships is model + retrieval + eval + orchestration + UX + ops. The team carries the full stack so we can move judgment between layers without renegotiating who owns what.

AI ENGINEERING

LLM apps, RAG pipelines, agents, voice systems, fine-tuning. Hands-on across Anthropic Claude, OpenAI, open-weight models on vLLM, and Bedrock for regulated workloads.

MOBILE & FRONTEND

React Native and Flutter shipping production. The AI-app surface lives in someone's app — we ship the surface too, not just the backend.

BACKEND & INFRA

Python, Node, Go, Postgres, vector stores. AWS, GCP, Cloudflare Workers. We pick boring infrastructure for AI workloads — the novelty budget is for the model layer.

DESIGN & UX

Eval rubrics, observability dashboards, agent-facing UX. AI surfaces fail more on UX (latency, fallback, refusal) than on model accuracy. We design for those failure modes upfront.

The team has shipped across DTC retail, fintech, healthcare-adjacent, ed-tech, B2B SaaS, and content / publishing. Sector experience matters more for the discovery call than for the build — the build is mostly the same craft regardless of vertical.

004 / SELECTION

## What we take. What we deflect.

Being honest about scope upfront saves both sides a quarter. The list on the right is not a moral position — it's a list of work where we can't ship a measurable result, or where the harm-to-value ratio is wrong.

### We ship

-   Production LLM apps with eval gates
-   RAG pipelines on Pinecone, Qdrant, Weaviate, pgvector
-   Single-agent and multi-agent systems with tool use
-   Voice agents on OpenAI Realtime, Claude Realtime
-   Workflow automation with LLM-as-judge nodes
-   Generative pipelines on Flux, SDXL, Sora, Runway
-   Classical ML where ML is the right answer
-   Fine-tunes on Llama 4, Mistral, smaller domain models
-   Architecture reviews and eval-set authoring

### We deflect

-   Slide-deck consulting without a build
-   Research POCs with no path to production
-   AGI claims or vague 'AI transformation' work
-   Projects where nobody can grade what good looks like
-   Deepfakes, non-consensual likeness, election content
-   Wholesale outsourcing of judgment-critical decisions
-   Engagements priced on team size, not on workload
-   Custom foundation-model training (use the ones that exist)
-   Anything we can't ship a measurable eval gate on

005 / STACK + COMPLIANCE

## Model-agnostic by stance. Honest about compliance.

The stack pick lives at the workload level, not the vendor level. Compliance posture is the same: we'll tell you what we hold, what we follow but don't hold, and which procurement gates we can clear today versus which need a partner.

DEFAULT STACK PICKS

Model: long-context reasoning

Claude Sonnet 4.6 (default)

Model: realtime voice

OpenAI Realtime · Claude Realtime

Model: cost-sensitive batch

Llama 4 / Mistral on vLLM

Model: regulated workloads

Anthropic on AWS Bedrock + BAA

Vector DBs

Pinecone · Qdrant · Weaviate · pgvector

Eval harness

Inspect AI · RAGAS · Langfuse

Orchestration

n8n · Inngest · Temporal

Cloud

AWS (primary) · GCP · Cloudflare Workers

COMPLIANCE

HIPAA-AWARE Production-ready

Claude on AWS Bedrock with BAA and PrivateLink VPC, audit-logged. Field-level masking on PHI before any model call. We've shipped the pattern.

GDPR / EU RESIDENCY Production-ready

EU-region residency on hosted models (Anthropic EU, OpenAI EU data residency, Azure West Europe). DPA workflow + subject-access-request runbook documented at handover.

SOC 2 Partial — vendor-aware

We follow SOC-2-ready practices (audit logs, least-privilege IAM, key rotation, encryption at rest and in transit) but are not ourselves SOC 2 Type II certified as a vendor. If your procurement requires a SOC 2 report from the agency itself, flag it upfront.

ISO 27001 Aligned — not certified

Practices align with the framework. No third-party certification yet. We're transparent about this — agencies that claim more than they hold burn the trust they were hired for.

006 / PRODUCTS WE OPERATE

## The products on the Paiteq operator stack.

Paiteq owns and operates two production properties — one B2B platform, one consumer social product. The agency side of Paiteq inherits the discipline of running these every day: failure modes you only learn by being on-call for your own systems, infra economics you only feel when the bill is yours.

-   [
    
    Aerostack aerostack.dev · B2B · developer tooling
    
    AI backend platform
    
    Developer platform on Cloudflare's edge for building AI-powered products — pre-built integrations, serverless functions, agent deployment, and MCP wiring. Describe what you need; Aerostack hands back a production URL.
    
    -   **Visual agent builder** — plan / act / reflect graph, no glue code
        
    -   **Eval gates baked in** — task success, halluc., latency gate every deploy
        
    -   **Multi-provider routing** — Claude, GPT, Gemini, Llama with cost + quality routing
        
    -   **Observability + rollback** — Langfuse-grade traces, one-click rollback
        
    
    100+
    
    Active teams
    
    50+
    
    Agents shipped
    
    12+
    
    Industries
    
    1,000+
    
    12-mo goal
    
    
    ](https://aerostack.dev)
-   [
    
    Nyburs nyburs.com · Consumer · real-time
    
    Hyperlocal social network
    
    India's hyperlocal social platform — neighborhood timelines, local-language chat, live streaming, and community problem-solving. Built around the thesis that social value compounds with local proximity, not global scale. This is the team's consumer surface — production traffic, real-time workloads, the infra discipline that comes from shipping a product at scale.
    
    -   **Neighborhood feeds** — geo-scoped timelines, not a follow graph
        
    -   **12+ Indian languages** — first-class local-language chat, not translation
        
    -   **Live streaming** — real-time WebRTC layer, scaled on edge POPs
        
    -   **Community problem-solving** — moderation + civic tooling baked in
        
    
    500K+
    
    Users
    
    12+
    
    Languages
    
    24/7
    
    On-call rotation
    
    Live
    
    Production ops
    
    
    ](https://www.nyburs.com)

007 / Have a workload in mind?

## Talk to *engineering*.

The first reply comes from someone who'd be on your build. Same business day on most inbounds.

[Send an inquiry](/contact/) [Email info@paiteq.com](mailto:info@paiteq.com)


---

## SECTION: 3. Services Hub

_Source: https://www.paiteq.com/services/_

# AI Services — Paiteq

> Paiteq's full AI engineering service surface — agents, RAG, LLM development, workflow automation, generative AI, and more. One build process across all 12 practices.

**HTML version:** https://www.paiteq.com/services/

## Key facts

- 11 AI services: AI Agent Development, AI Consulting, AI Integration, AI Migration, AI Workflow Automation, Chatbot Development, Generative AI, LLM Development, Machine Learning Development, MLOps, RAG Development, RPA Development.
- One build process across all practices: discovery → fixed-scope audit → pilot with frozen eval set → production.
- Senior-led; no kickbacks; walk-away clause on every audit.

## Related pages

- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [LLM Development](https://www.paiteq.com/services/llm-development/)
- [AI Consulting](https://www.paiteq.com/services/ai-consulting/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

Services

# The *practices* we ship from.

Twelve surfaces across the AI engineering stack — from autonomous agents to MLOps. Each is owned by senior engineers and runs the same eval-first build process. The practice changes; the rigor doesn't.

[Talk to engineering](/contact/)

001 / PRACTICES

## What we build.

Every practice runs the same discovery-to-run loop. The framework choices, team shape, and model selection change; the eval gates don't.

[

01 / AI ↗

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

02 / RAG ↗

RAG Development

Retrieval-augmented generation systems with evaluation built in.

](/services/rag-development/)[

03 / LLM ↗

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

04 / AI ↗

AI Workflow Automation

Intelligent workflows on n8n, Make, and custom agent orchestration.

](/services/ai-workflow-automation/)[

05 / GENERATIVE ↗

Generative AI

GenAI products end-to-end — text, image, multimodal, OpenAI/Claude/Gemini.

](/services/generative-ai/)[

06 / MACHINE ↗

Machine Learning

Custom ML — training, serving, MLOps.

](/services/machine-learning-development/)[

07 / AI ↗

AI Consulting

AI strategy, audits, roadmap.

](/services/ai-consulting/)[

08 / CHATBOT ↗

Chatbot Development

Production chatbots on LLMs with guardrails and observability.

](/services/chatbot-development/)[

09 / RPA ↗

RPA Development

Intelligent automation — beyond rule-based RPA.

](/services/rpa-development/)[

10 / AI ↗

AI Integration

Drop-in AI for existing apps — OpenAI / Anthropic / Vertex.

](/services/ai-integration/)[

11 / AI ↗

AI Migration

Legacy software → AI-modernized stack. Eval-validated cutover.

](/services/ai-migration/)[

12 / MLOPS ↗

MLOps

Deploy, monitor, scale ML/LLM systems.

](/services/mlops/)

002 / PROCESS

## One build process. Every practice.

The six stops below run on a 4-week Pilot and a 16-week Production Build alike. The eval set authored in week 2 of a Pilot becomes the regression suite that blocks bad deploys a year later. Skipping it — or deferring it to "after the prototype" — is the single most common reason AI projects fail silently: the system ships, the quality degrades, and nobody notices until a user complaint reaches a Slack channel at week 14.

WEEK 1

### Discovery

Workload shape, success criteria, data residency, cost target. No tooling picked yet — the eval surface comes first.

WEEK 1–2

### Spec

Written scope with measurable exit criteria for every phase. If we can't agree on what done looks like, we don't start.

WEEK 2–4

### Prototype

Minimal build against the eval set — enough to score, not enough to ship. Used to surface the hard unknowns early.

WEEK 3+

### Eval gates

Domain-expert-graded examples re-run on every code change. Gates block a deploy the same way a failing test does.

WEEK 8+

### Deploy

Auth, observability (Langfuse or equivalent), model fallback, cost guardrails, regression alarms baked in before launch.

ONGOING

### Run

Weekly eval re-runs, drift alarms, prompt iteration log, model-upgrade regression checks. The eval set grows with usage.

003 / ENGAGEMENTS

## Four ways to start — any practice.

The four shapes below apply across all twelve practices. A Pilot on an AI agent engagement looks structurally identical to a Pilot on a generative AI engagement — one scoped workload, one eval set, a demo, a build-or-stop memo. The domain-specific detail lives on the pillar pages.

01 PILOT Fixed scope

2–4 weeks

### One workload, scoped and scored in 2–4 weeks.

In scope

-   One scoped workload, agreed in writing before work starts
-   Eval set (30–60 examples) authored with your domain expert
-   Two to three candidate approaches scored against the eval
-   Working demo against your actual data
-   Build-or-stop memo with a cost and timeline estimate for phase two

Out of scope

-   Production deploy
-   Auth, observability, or cost guardrails
-   Multi-workload or multi-practice scope

02 BUILD Fixed scope

8–16 weeks

### Production system with eval gates baked into CI.

In scope

-   All Pilot deliverables, or discovery from scratch if no Pilot ran
-   Eval gates in the deploy pipeline — a failing gate blocks the deploy
-   Auth, rate-limiting, cost guardrails, model fallback
-   Observability (Langfuse or equivalent) from day one of staging
-   Four weeks of post-launch iteration included in scope
-   Runbook and eval-set ownership transferred to your team at close

Out of scope

-   Ongoing retainer support (available separately)
-   Fine-tuning unless the eval set justifies it
-   Infrastructure you already own and operate

03 AUDIT Fixed scope

2–3 weeks

### Read your system, root-cause the gap, no rebuild.

In scope

-   Read of your existing system: code, prompts, retrieval, evals
-   Scored run against your current eval set (or we author a minimal one)
-   Adversarial and edge-case stress test of the live system
-   Root-cause memo: where quality is leaking and why
-   Prioritised fix-list with estimated effort per item

Out of scope

-   Any code changes (Rescue engagement covers fixes)
-   Rebuild or re-architecture
-   Ongoing monitoring

04 RESCUE Time & materials

4–6 weeks

### Diagnose the failing layer, fix it, ship regression tests.

In scope

-   Audit (weeks 1–2) to find the root cause before touching anything
-   Targeted fix of the diagnosed layer — not a full rebuild
-   Regression tests that would have caught the original failure
-   Re-run of the full eval set to confirm the fix held
-   Hand-off memo documenting what failed, why, and what prevents recurrence

Out of scope

-   New features or scope added during the engagement
-   Full architectural rewrite unless the diagnosis specifically requires it

004 / EVAL-FIRST

## Why every practice runs the same eval discipline.

The failure mode we see most often is not a wrong model choice or a bad prompt. It's a system that ships to production without an eval set. The team built the demo, the demo impressed someone, the prototype became the product. Months later, a user complaint surfaces in a Slack channel. The team digs in and discovers the system has been wrong on a specific query class for weeks — quietly, consistently, with no alarm to trip because there was never a measurement in place. The cost of the fix is manageable. The cost to trust is not.

Eval-first means the eval set precedes the prototype, not the other way around. Before we write a prompt or call an API, we agree in writing on what correct output looks like for a representative sample of your real workload — authored by your domain expert, graded against your standards. That set becomes the gate. A deploy that regresses the score doesn't ship. The eval set also becomes the diagnostic tool when something breaks in production: you run the set, the score drops, you bisect the change history, you find the commit.

This discipline applies identically whether the practice is RAG, fine-tuning, workflow automation, or a voice agent. The eval artefacts differ — retrieval recall and precision for RAG, task-completion rate for agents, output schema conformance for generative pipelines — but the structure is the same: explicit success criteria, graded examples, a gate that blocks bad deploys. Teams that skip this don't discover the gap at week 2; they discover it at week 14 when a real user finds it first.

The practical implication for a new engagement: the first two weeks of any build are not about the model or the architecture. They're about writing down what good looks like. If that conversation is harder than expected — if stakeholders can't agree on a scoring rubric, or if nobody has domain expertise to grade the outputs — that is the risk to surface and resolve before any code ships. We facilitate that conversation on every engagement. We've ended discovery calls by telling a prospective client that the org needs to resolve an internal disagreement about quality criteria before any engineering makes sense. That's the right call, even when it means a smaller engagement.

005 / STACK

## Cross-cutting tools — how we pick them.

These are the decisions that recur across all twelve practices. Each card shows when we reach for the tool and when we don't — the honest trade-off, not the vendor brief.

Hosted LLMs

Strengths

Fast to ship, top-of-class quality, no inference infra to operate.

When We Pick

Default for most workloads below ~500k requests/day. Data residency rules permitting.

When We Don't

When a regulatory ring-fence forbids third-party inference — EU health data, defence-adjacent, financial services with strict data mandates.

Paiteq Pattern

Claude Sonnet 4.6 as default reasoning model; GPT-5 for vision-heavy; Gemini 3.0 Pro for long-context document work.

AnthropicOpenAIVertex AIAWS Bedrock

Self-hosted models

Strengths

Data stays in your cloud, fixed infra cost, adapters and weights you own.

When We Pick

Regulated workloads with residency rules, or very high volume where per-token cost justifies the infra overhead.

When We Don't

When the team has no ops capacity — self-hosted needs someone on-call for the inference layer.

Paiteq Pattern

Llama 4 / Mistral on vLLM with continuous batching. LoRA/QLoRA adapters where fine-tuning is warranted by data.

Llama 4MistralvLLMQLoRA

Vector DBs

Strengths

Semantic retrieval at scale, hybrid search (dense + sparse), metadata filtering.

When We Pick

Any RAG workload, long-term agent memory, semantic deduplication pipelines.

When We Don't

When the corpus is under ~10k chunks and a well-indexed Postgres full-text search covers the use case.

Paiteq Pattern

Qdrant for self-hosted; Pinecone for managed; pgvector when staying in existing Postgres is the right trade-off.

QdrantPineconeWeaviatepgvector

Eval harness

Strengths

Quantified quality scores that block bad deploys — not vibes, not manual review.

When We Pick

Every production engagement, from week 2. Non-negotiable; this is the whole point.

When We Don't

Never skip. On short Pilots the eval set is smaller (30 examples), not absent.

Paiteq Pattern

Inspect AI for agent task eval; RAGAS for retrieval quality; Langfuse for production traces and LLM-as-judge scoring.

Inspect AIRAGASLangfuseLLM-as-judge

Orchestration

Strengths

Durable execution, retry logic, multi-step agent coordination without hand-rolled state machines.

When We Pick

Workflow automation and any agent system where a mid-run crash would cost meaningful retries or user trust.

When We Don't

Simple request-response LLM apps don't need an orchestration layer — it adds latency and complexity for no benefit.

Paiteq Pattern

n8n for no-code-adjacent automation; Inngest for event-driven serverless; Temporal for long-running stateful agents.

n8nInngestTemporalLangGraph

Cloud & infra

Strengths

Each major cloud has a different strength for AI workloads — provider lock-in is a real cost to manage.

When We Pick

AWS as primary (Bedrock + SageMaker + Lambda). GCP when Vertex AI or BigQuery is already in the stack. Cloudflare Workers for edge-latency constraints.

When We Don't

We don't build multi-cloud for its own sake — it raises cost and complexity without a proportional resilience benefit at most engagement sizes.

Paiteq Pattern

AWS-first. CDK for infra-as-code. Cloudflare for edge inference and static delivery.

AWSGCPCloudflareCDK

006 / FAQ

## Common questions across all practices.

Questions that apply regardless of which practice you're evaluating. Practice-specific questions live on the pillar pages.

How do I know which practice fits my workload?

Start with the output: if you're moving data between systems with minimal judgment required, that's Workflow Automation or RPA. If the system needs to reason, retrieve, or generate — it's one of the model-layer practices (LLM Development, RAG, Generative AI). If you want a system that takes multi-step actions with tools, that's AI Agents. If you have a running system that's underperforming, that's MLOps or a consulting engagement before any rebuild.

The [contact form](/contact/) asks you to describe the workload in one sentence — that's usually enough to route you. If it genuinely spans two practices (common: RAG + Agents, or LLM Dev + MLOps), we'll say so and scope accordingly.

Can two practices ship in one engagement?

Yes, and it's common. A production AI agent almost always spans AI Agents + RAG + LLM Development — the agent is the orchestration layer, RAG is the retrieval layer, and LLM Development is where model selection and eval live. We scope these as a single engagement with a unified eval set, not three separate SOWs stapled together.

What we don't do: scope two practices in one engagement when the work genuinely runs in sequence. If you need a consulting engagement to clarify what to build before the build starts, that's two phases, not one — and conflating them is how scope creep starts.

What's the actual difference between Pilot, Build, Audit, and Rescue?

**Pilot** is a bet-sizing exercise. One scoped workload, one eval set, a working prototype. The deliverable is a demo plus a build-or-stop memo — not a production system. Fixed scope, 2–4 weeks.

**Build** is production. Eval gates baked into CI, observability, auth, cost guardrails, model fallback. It starts where the Pilot ended. 8–16 weeks depending on scope.

**Audit** is a read-only engagement. We don't touch your code; we read it, run your system, stress the eval surface, and hand back a prioritised fix-list and a root-cause memo. 2–3 weeks.

**Rescue** is an Audit where we stay to fix it. We diagnose the failing layer — usually eval debt, a poisoned retrieval index, or a prompt that was never stress-tested — and ship the regression tests that prevent the same failure next quarter. 4–6 weeks.

How does eval-first change the process if my team has no ML experience?

The eval set is the only ML artefact your team needs to own long-term, and it doesn't require ML expertise to write — it requires domain expertise. An eval example is a question your system should answer, plus a graded correct answer your subject-matter expert signs off on. A lawyer can write legal-domain eval examples without knowing what a transformer is.

What eval-first changes in practice: it front-loads the hardest conversation (what does "good" look like for this output?) to week 2 instead of week 10 when a user complaint surfaces it. Teams without ML experience are often surprised that this is the bottleneck — not the model, not the code, but whether anyone in the organisation can grade the output consistently. We facilitate that process, but the domain knowledge has to come from your side.

How is pricing structured across practices?

Fixed-scope per engagement, quoted up front. Pilots and Audits ship on a fixed fee tied to a written scope and exit criteria; Production Builds are scoped after the Pilot or Audit closes so the number is anchored in real eval data, not a guess. Rescue engagements quote after the diagnostic week so we're pricing the actual fix layer, not a sales estimate.

We don't price on team size — scope and deliverables drive the number. Every quote ships with a written scope, eval gates, exit criteria, and a line explaining what we'd descope first if the budget needs to come down. For specific dollar ranges, talk to engineering — we'll scope to your workload in the discovery call.

When do you say no?

When nobody can grade what good looks like. Every AI system we ship needs a measurable eval surface — if the output is inherently subjective and the stakeholders can't agree on a scoring rubric, we're not the right team. We'll say this in the discovery call rather than take the engagement and deliver something neither side can evaluate.

We also say no to research POCs with no path to production, projects where the real ask is headcount augmentation rather than a shipped system, and anything where the success criterion is "impress the board demo" with no production follow-through. The pattern we've seen too many times: a flashy demo, six months of slide-deck iteration, and a production system that never ships because the eval set was never written.

007 / Start a project

## Describe the workload. *We'll route you.*

One sentence about what you're trying to build is enough to map it to the right practice and suggest an engagement shape.

[Talk to engineering](/contact/) [See all practices](#practices)


---

## SECTION: 4.1. Service: ai-agent-development

_Source: https://www.paiteq.com/services/ai-agent-development/_

# AI Agent Development Services — Paiteq

> Paiteq is an AI agent development company building production AI agents on LangGraph, CrewAI, AutoGen, Composio, and DSPy. Plan/act/reflect loops, multi-agent systems, voice agents — with evaluation built in. AI agent development services from pilot to production.

**HTML version:** https://www.paiteq.com/services/ai-agent-development/

## Key facts

- Stacks: LangGraph, CrewAI, AutoGen, Composio, DSPy.
- Architectures: plan/act/reflect loops, multi-agent systems, voice agents.
- Eval-instrumented from day one; defined kill point before the build starts.

## Related pages

- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [LLM Development](https://www.paiteq.com/services/llm-development/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI Agent Development

# The *AI agent development company* production teams trust to ship.

Paiteq is an AI agent development company building production agents on LangGraph, CrewAI, AutoGen, and Composio. AI agent development services from pilot to production — plan/act/reflect loops, multi-agent systems, voice agents, evaluation built in.

[Talk to engineering](/contact/) [See build process](#process)

Stack LangGraph · CrewAI · Composio

Engage Pilot · Build · Run

Eval Task success · Latency · Halluc.

Compliance SOC 2 · ISO 27001

001 / WHAT WE BUILD

## Agents shipped across eight surfaces.

We don't list "AI for everything." Each surface below is a workload we've shipped to production — with the eval methodology, the framework choice, and the failure-mode story already worked out. [See three of them in detail →](#use-cases)

AI agent development sorts cleanly by function — not by industry. A sales agent for a B2B SaaS uses the same plan/act/reflect loop as one we ship for a manufacturer; the integration surface differs, the eval anchor differs, but the shape of the build doesn't. Sorting by function lets us reuse the eval framework, the observability rig, and the prompt-iteration playbook across clients. Sorting by industry — the way most listicle competitors organize their pages — hides where the engineering actually lives. Every custom AI agent development engagement we run reuses the same scaffolding; only the workload-specific tool surface changes.

[

01 / SALES ↗

Sales agents

Lead qualification, outbound research, CRM action. Replace SDR work that doesn't need a human.

B2BOutboundEnrichment

](#use-cases)[

02 / SUPPORT ↗

Support agents

Tier-1 deflection, ticket triage, knowledge-base augmented. Escalates with full context, not raw transcripts.

Tier-1RAGVoice

](#use-cases)[

03 / OPS ↗

Ops agents

Invoice matching, AP routing, inventory triage. Replaces structured-but-tedious back-office work.

AP / ARWorkflows

](#use-cases)[

04 / CODING ↗

Coding agents

Repo-aware code review, refactor, doc-gen. Wired into your CI as a PR gate, not a chat surface.

Repo-awarePR gate

](#use-cases)[

05 / RESEARCH ↗

Research agents

Multi-step deep research over the open web + your internal corpora. Cited, structured outputs.

Multi-stepCitations

](#use-cases)[

06 / VOICE ↗

Voice agents

Phone and in-app voice agents with low-latency turn-taking. Built on LiveKit, Vapi, or Pipecat.

LiveKit<400ms

](#use-cases)[

07 / MULTI-AGENT ↗

Multi-agent systems

Supervisor + worker patterns for tasks that don't fit one agent. Planner, executor, critic.

SupervisorCritic

](#architecture)[

08 / EVAL ↗

Eval & observability

Every agent ships with task-success scoring, hallucination metrics, and weekly eval runs.

LangfuseBraintrust

](#eval)

FUNCTION × INDUSTRY

Where we've shipped each agent class. Strength reflects production volume, not theoretical fit — empty cells mean we either haven't done it yet or the workload didn't justify an agent.

Function Industry

B2B SaaS

Health-tech

Mfg

Fin-tech

Legal

E-comm

Ed-tech

Logistics

Sales agents

Support agents

Ops / back-office

Research agents

Voice agents

Multi-agent

Sales agents

B2B SaaSMfgFin-techE-commEd-tech Health-techLegalLogistics

Support agents

B2B SaaSHealth-techMfgFin-techLegalE-commEd-techLogistics

Ops / back-office

B2B SaaSHealth-techMfgFin-techLegalE-commLogistics Ed-tech

Research agents

B2B SaaSHealth-techFin-techLegalEd-tech MfgE-commLogistics

Voice agents

B2B SaaSHealth-techFin-techEd-techLogistics MfgLegalE-comm

Multi-agent

B2B SaaSHealth-techMfgFin-techLegalE-commEd-techLogistics

Possible fit Good fit Primary vertical

The shipped-volume bias is intentional. Sales, support, and ops are the three columns we run the most — they're where evaluable agent work meets clear ROI math. Voice and research agents are growing fastest in 2026; multi-agent systems remain a smaller share of the book because we recommend a single-agent loop unless the workload actually has separable sub-tasks. We'll talk you out of multi-agent if the work doesn't fit it. [More on that under reference architectures.](#architecture)

002 / SERVICES

## AI agent development services — pick where to start.

Three engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done. [Full engagement-model breakdown below →](#engage)

Choosing an AI agents development company is mostly about choosing the right starting shape. How you start usually decides how the project ends. Buyers who walk in with a single scoped workload and an eval set in mind ship to production 70% of the time. Buyers who walk in with "we want AI agents" without a target workload ship 25% of the time, usually after a re-scope. We've built the four shapes below to map cleanly onto those starting points — pick the one that matches what you actually have, not what you wish you had. Each shape is an AI agent development service we've shipped 10+ times — the deliverables and gate criteria are locked in by repetition, not invented for your engagement.

[

01 / PILOT ↗

Agent Pilot

One agent, one scoped workflow, intake to live in 2–4 weeks. Eval framework included.

2–4 wksFixed scope

](#engage)[

02 / BUILD ↗

Custom Agent Build

Full production build with integrations, eval gates, observability, and post-launch iteration.

8–16 wksFixed scope

](#engage)[

03 / MULTI-AGENT ↗

Multi-Agent Systems

Supervisor + worker + critic orchestration for tasks one agent can't handle. LangGraph or CrewAI.

10–20 wks

](#engage)[

04 / VOICE ↗

Voice / Conversational Agents

Sub-400ms turn-taking voice agents. Inbound, outbound, in-app. LiveKit + your LLM of choice.

6–12 wks<400ms

](#engage)

A practical decision tree: if the workload is scoped but unproven, start with a **Pilot**. If the workload is proven (one of yours is already working manually or in a janky prototype) and you need production discipline, start with a **Custom Agent Build**. If the workload has 3+ sub-tasks that fight each other in a single prompt, start with **Multi-Agent Systems**. Voice is a separate workstream — latency budget, turn-taking, telephony — so it gets its own shape. Compared to other AI agent development companies, we build the eval framework before writing code, not after the agent ships. [Week-by-week scope on each, further down →](#engage)

003 / STACK

## Frameworks we build on.

Framework choice follows the workload, not the other way around. We don't have a house framework. The six below cover ~95% of what we ship; the rest live in **Vercel AI SDK**, **OpenAI Agents**, or hand-rolled SDK wrappers when the surface is small enough.

-   LangChain
-   LangGraph
-   CrewAI
-   AutoGen
-   DSPy
-   Composio
-   Pydantic-AI
-   Phidata
-   AG2
-   Vercel AI SDK
-   OpenAI Agents
-   Anthropic SDK
-   LangChain
-   LangGraph
-   CrewAI
-   AutoGen
-   DSPy
-   Composio
-   Pydantic-AI
-   Phidata
-   AG2
-   Vercel AI SDK
-   OpenAI Agents
-   Anthropic SDK

FRAMEWORK PICKS

For each framework: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production agents on every one of these — the "when we don't" lines come from actual builds, not theory.

LangGraph

Strengths

Explicit graph control over a stateful agent loop. Checkpointing, time-travel debugging, and human-in-the-loop interrupts are first-class.

When We Pick

Plan/act/reflect loops where you need to inspect, replay, or branch state. Long-running agents that survive crashes. Any agent that needs an explicit retry policy.

When We Don't

Single-turn extraction. Stateless tool calls. Anything that fits in a Vercel AI SDK route — don't pay the graph tax if you don't need it.

Paiteq Pattern

We use LangGraph for ~70% of production agents. The checkpoint store goes to Postgres so resume-after-crash is one redeploy away.

StatefulCheckpointsMulti-step

CrewAI

Strengths

Supervisor / worker orchestration with role-prompt scaffolding. Less code than LangGraph for the same 3-agent pattern.

When We Pick

Research workflows. Content pipelines (planner → writer → critic). Multi-agent prototypes where the orchestration topology won't change.

When We Don't

Once you need explicit state graph control, retries, or runtime topology changes — graduate to LangGraph. CrewAI's role abstraction starts to fight you.

Paiteq Pattern

Pilots that need to demo a multi-agent loop in week 3 start in CrewAI. About a third graduate to LangGraph for production.

SupervisorMulti-agent

AutoGen

Strengths

Multi-agent conversation patterns from Microsoft Research. Strong code-execution agents and group-chat orchestration.

When We Pick

Code-generation agents that need a sandboxed runtime. Group-chat patterns where 3+ agents debate before acting.

When We Don't

Single-agent loops. Latency-sensitive paths. AutoGen's chat metaphor adds round-trips that LangGraph avoids with direct state passing.

Paiteq Pattern

We use AutoGen Studio when a client wants to author agent conversations themselves before we wire it into production.

Code-execGroup chat

Composio

Strengths

Pre-built tool surface. 250+ integrations (Salesforce, Slack, GitHub, Linear, Zendesk) with auth + rate-limit handling already solved.

When We Pick

Agents that touch 4+ external systems and we'd otherwise spend the first three weeks writing OAuth and webhook plumbing.

When We Don't

Internal-only tools or anything off the catalog. The wrapper costs ~80ms per call and you're at Composio's rate-limit policy — for hot paths, write the tool yourself.

Paiteq Pattern

Sales and support agents almost always start on Composio. We migrate to native SDKs only when latency or rate limits become the bottleneck.

250+ toolsOAuth

DSPy

Strengths

Compile prompts the way you'd compile code. Optimizers (BootstrapFewShot, MIPRO) tune the prompt against your eval set automatically.

When We Pick

Agents where the prompt is the bottleneck and we have an eval set ≥50 examples. Hand-tuned prompts plateau; DSPy gets another 5–10 points.

When We Don't

Eval set too small (DSPy needs signal). Workflows where the bottleneck is tool design, not prompting. Prototypes — start with hand-crafted, compile later.

Paiteq Pattern

Phase 2 of any Build engagement. We hand-craft the prompts in pilot, then DSPy-compile them once the eval set is mature.

Prompt optMIPRO

LiveKit

Strengths

Voice agent infrastructure — WebRTC transport, STT/LLM/TTS pipeline, turn-detection. Sub-400ms mic-to-speaker on Claude or GPT-5 realtime.

When We Pick

Phone or in-app voice agents. Anywhere the user expects a human-cadence conversation, not a chat UI with TTS bolted on.

When We Don't

Long-form generation. Voice agents that need 5+ seconds to think — LiveKit's turn detector will keep interrupting them.

Paiteq Pattern

Pipecat for prototypes (faster to demo), LiveKit when going to production. We've shipped voice agents with p95 turn-take at 320ms.

VoiceSub-400msWebRTC

Two patterns worth flagging on every custom AI agent development engagement we lead: we benchmark **two frameworks against the eval set** before locking the stack — usually LangGraph vs whichever lighter option fits the workload. The eval set decides, not the framework's marketing. And we keep an out: every enterprise AI agent ships with prompts in portable YAML for re-hosting on a different framework if needed. Most AI agent development companies don't scope portability into the SOW — we do it by default.

003b / MODELS & INTEGRATIONS

## Models, integrations, and the tool surface.

Framework choice gets the H2 but rarely the headline call. Model and integration choice usually matter more for production behaviour. We benchmark at least two models per workload, name every integration in scope on day one, and pick between Composio's pre-built tool surface and native SDK wrappers per call site — not as a blanket policy.

MODELS WE DEPLOY

Four model families cover ~98% of what we ship. We benchmark candidates against your eval set before locking; the leader on cost-adjusted quality wins, regardless of which vendor we'd default to. Routing across multiple families in one production deployment is increasingly the norm.

Claude (Sonnet / Opus)

Strengths

Tool use, structured output, long-context reasoning. Our default for agent planning loops where the prompt is doing the heavy lifting.

When We Pick

Stateful agents with complex tool surfaces. Long-context RAG. Anywhere prompt-following matters more than raw speed. Sonnet for ~80% of production; Opus when reasoning depth justifies the cost.

When We Don't

Hyper-latency-sensitive paths (Sonnet TTFT runs 400-800ms). Very high-volume routine calls where GPT-5 mini or Llama-3 are cheaper.

Paiteq Pattern

Sonnet 3.5 is the default planner in our agent loops. We benchmark it head-to-head against GPT-5 on every new engagement's eval set.

Tool useStructuredLong context

GPT (4o / 4o-mini)

Strengths

Lowest latency on the hosted side (4o realtime TTFT ~250ms). Strong vision. 4o-mini is the price-per-token king for routing.

When We Pick

Voice agents needing realtime TTFT. Multimodal apps (vision + text). High-volume classifier or router tier where 4o-mini's cost wins.

When We Don't

Heavy tool-using planning loops — Claude usually wins our eval. Workflows requiring strict structured output without retries.

Paiteq Pattern

GPT-5 realtime is the standard for voice agents. 4o-mini routes ~70% of high-volume traffic in cost-engineered deployments.

RealtimeVisionRouting

Gemini (2.0 Flash / Pro)

Strengths

Massive context window (1M+ tokens). Strongest cost-per-token at the frontier tier. Native multimodal across video.

When We Pick

Workloads with very large context (whole-codebase analysis, long document Q&A) where chunking would lose signal. Video understanding.

When We Don't

Tool-using agents that don't need the context window — Gemini's tool-call accuracy still trails Claude on our evals. Production paths where Google Cloud lock-in is a concern.

Paiteq Pattern

We use Gemini Flash for long-document agents (legal contracts, codebase audits). Rarely as a primary agent planner — yet.

1M contextMultimodalVideo

Llama 4 / Mistral / Qwen

Strengths

Self-hosted on your cloud. Fixed infra cost. No data leaves your perimeter. LoRA fine-tuneable for domain language.

When We Pick

Regulated data rules (HIPAA, FedRAMP, EU residency). Very high token volume where dedicated GPU amortizes. Workloads where prompt + small fine-tune beats hosted prompt-only.

When We Don't

Tool-using agents on the frontier — open weights still trail Claude/GPT-5 by 5-15 points on tool-call accuracy. Engagements with no ops capacity to run inference infrastructure.

Paiteq Pattern

vLLM on dedicated A100/H100 for self-hosted. LoRA fine-tunes on Llama 4 70B for domain-specific classification or extraction agents. Hybrid: hosted planner + self-hosted worker is increasingly common.

Self-hostedvLLMLoRA

INTEGRATIONS WE SHIP AGAINST

Tool-call accuracy against your real systems is one of the four eval metrics. We don't trust integrations until we've graded them. Below: the systems we've shipped agent integrations against in the last 12 months. Adding to the list takes a few days, not a re-architecture.

CRM & Sales

-   Salesforce
-   HubSpot
-   Pipedrive
-   Apollo
-   Clearbit

Support & Ticketing

-   Zendesk
-   Intercom
-   Freshdesk
-   Linear
-   Jira

Data Warehouse & Search

-   Snowflake
-   BigQuery
-   Databricks
-   Pinecone
-   Qdrant

Communication & Files

-   Slack
-   Microsoft Teams
-   Gmail / Google Workspace
-   SharePoint
-   Notion

Code & DevOps

-   GitHub
-   GitLab
-   CircleCI
-   Linear
-   PagerDuty

Voice & Telephony

-   LiveKit
-   Twilio
-   Vapi
-   Plivo
-   Telnyx

TOOL SURFACE DESIGN

Every agent has a tool surface — the set of functions it can call to do its job. Sizing that surface is one of the most consequential decisions in the build. Too small and the agent can't do the work; too large and it loses focus, tool-call accuracy drops, latency balloons.

01 **≤8 tools per planning loop.** Beyond that, accuracy starts dropping on every eval we've run. Decompose into supervisor + workers.

02 **Composio for breadth, native SDK for hot paths.** Composio's 250+ integrations save 2-3 weeks on OAuth + webhook plumbing but add ~80ms per call. Native SDKs when latency or throughput rules.

03 **Confirmation step in front of write actions.** Read tools call freely; write tools (send email, create ticket, charge card) ship with a confirmation gate unless tool-call accuracy is above 99.5%.

04 **Structured outputs over freeform.** Every tool input gets a Pydantic / Zod schema. We use Anthropic and OpenAI structured-output mode by default; retries on schema violation are cheaper than guessing what the agent meant.

The model and integration choices are where engagement scope quietly grows. Buyers ask for "an AI agent for our sales workflow" without specifying which CRM, which ICP scoring fields, which model. We force those choices into the spec in week 2 — naming the model, naming the four CRM endpoints we'll integrate, naming the cost band per request. Decisions made explicit at scope time stop being re-litigations during build.

004 / PROCESS

## Six steps from discovery to running.

The same process runs across both a 2-week pilot and a 16-week custom build. The gates change in depth, not in shape. Every step has an explicit deliverable, a named owner, and a gate criterion — pass or rework, no "we'll figure it out next week."

WEEK 1–2

### Discovery

We map the workflow, scope the agent's job, and identify the eval surface — what counts as the agent doing its job correctly?

WEEK 2–3

### Spec

Tools, prompts, guardrails, model choice, and the first 30–50 eval examples. Signed off before any code.

WEEK 3–6

### Prototype

First runnable agent against the scoped eval set. We iterate prompts and tools until baseline accuracy is hit.

WEEK 6–10

### Eval gates

Task success, hallucination rate, tool-call accuracy, and latency thresholds all green before production wire-up begins.

WEEK 10+

### Deploy

Production integration — auth, rate limits, observability via Langfuse, retry + fallback policies, on-call runbook.

ONGOING

### Running

Weekly eval runs, prompt + tool iteration, and a regression alarm if any metric drops by more than 5%.

01

### Discovery

We map the full workload — every decision point, handoff, and exception — before scoping any agent. That means watching one of your team members do the work today, recording every decision point, and identifying which decisions are deterministic (rule-based) vs judgment-based (LLM-fit). The week-1 output is a workload map + a draft eval surface: what counts as the agent doing the job correctly?

OwnersPaiteq AI engineer + your subject-matter expert. ~6 hours of their time across the week.

GateWorkload boundary signed off. If sub-tasks straddle a fuzzy boundary, we shrink scope rather than guess.

02

### Spec

Stack picks (framework + model + observability), prompt sketches, tool surface, guardrails policy, and the first 30–50 eval examples. The eval examples come from your domain expert grading real candidate outputs — not from us guessing. Signed off as a one-pager before code starts.

OwnersPaiteq AI engineer + senior architect review.

GateEval examples graded. If your team can't agree on a grade for the example outputs, the spec isn't done.

03

### Prototype

First runnable agent against the scoped eval set. We iterate on prompts, tool design, retrieval (if RAG), and model choice. Multiple models get benchmarked against the same eval set — the leader on cost-adjusted quality wins, regardless of which vendor we'd default to.

OwnersPaiteq AI engineer building; weekly demo to your team.

GateBaseline accuracy hit on the eval set. Below baseline, we revise the spec rather than the threshold.

04

### Eval gates

Four thresholds must all be green before any production wire-up: task success rate, hallucination rate, tool-call accuracy, and p95 latency. Hallucination is dual-scored (LLM-as-judge + human spot-check on disputed examples). Tool-call accuracy is separately measured because a wrong tool call can succeed at the wrong thing.

OwnersPaiteq AI engineer + your domain expert verifying the human-spot-check.

GateAll four metrics green or the build doesn't deploy. Period. We've shipped exactly zero agents that bypassed this gate.

05

### Deploy

Production integrations — auth, rate-limit, observability via Langfuse, fallback policies, cost guardrails, on-call runbook. We wire the eval set into the deploy pipeline so regression alarms fire automatically when an upstream model change drops scores. The handoff includes the runbook in your repo, not in a doc somewhere.

OwnersPaiteq AI engineer + your platform/SRE team.

GateRunbook drilled (we simulate an outage + rollback before the actual go-live).

06

### Running

Four weeks of post-launch iteration are part of every Build engagement — weekly eval review, prompt iteration on edge cases, regression alarm triage. After week 16, ongoing iteration moves to a Run engagement (separate monthly SOW) only if the workload genuinely benefits. About half of completed builds graduate to Run.

OwnersPaiteq AI engineer (decreasing % of time) + your team picking up ownership.

GateOngoing — weekly eval review never stops while we're engaged.

Two notes that matter. **Eval gates are non-negotiable** — we will not wire an agent into production traffic until task success rate, hallucination rate, tool-call accuracy, and latency are all green against the eval set scoped during discovery. **Running is a real phase**, not an afterthought. The first 4 weeks post-launch are part of every Build engagement, with weekly eval runs and prompt iteration baked into the SOW.

005 / DECISION

## AI agents vs. chatbots — when do you need which?

This is the most common scoping mistake we see. Buyers ask for "an AI agent" when a chatbot is enough, or ask for "a chatbot" when the workload genuinely needs autonomy. The seven dimensions below cover most of the call.

Chatbots

AI Agents

Turns

Single, request-response

Multi-turn, planning loop

State

Stateless or thin context

Stateful, often memory-backed

Tool use

None or one-shot lookup

Core — APIs, code, retrieval, other agents

"Tool use" is the line buyers most often miss. A chatbot can *call one API* on intent match — that's not tool use, that's a function call. Real tool use means the model decides **which** tool to call, **when**, with **what arguments**, and how to react when the tool errors. That decision loop is the agent.

Autonomy

Scripted by intent map

Goal-driven, decides its own steps

Autonomy is the scope dial. Scripted flows are easier to test, cheaper to run, and won't surprise you in production — but they cap at the conversation tree you drew. Agents will solve problems you didn't anticipate, which is the value *and* the risk. Most production deployments end up with bounded autonomy: agent decides within a fixed toolset and rejects-with-explanation outside it.

Eval surface

Intent classification acc.

Task success rate + sub-step accuracy

Failure mode

Wrong intent → wrong reply

Wrong plan → cascading bad actions

Best for

FAQ, lookups, routing

Multi-step workflows, research, ops

Cost (rough)

$$

$$$$ — per-task LLM cost dominates

Cost flips the typical recommendation. At **$0.001/resolution for a tuned chatbot** vs **$0.05–$0.20/task for an agent**, the volume math is brutal. A chatbot handling 10k tickets/day at $0.001 costs $300/mo. The same volume on an agent at $0.10 is $30k/mo. Agents earn their cost on **multi-step work** (research, ops, integration) — not on volume. If the workload is bounded and high-volume, chatbot every time.

Full breakdown — [when to pick which](/blog/ai-agents-vs-chatbots/)

Rule of thumb: if the work is **look something up and respond**, you want a chatbot. If the work is **understand a goal, take several steps, and use tools along the way**, you want an agent. Anything in between, the decision tree below walks you through a few diagnostic questions — most projects fit cleanly into one of five outcomes.

DECISION TREE

Answer four questions about your workload. We've used these same questions to right-size scope on every engagement since 2023.

Path

Question

Pick one

Result

006 / ARCHITECTURE

## Three patterns we deploy.

Most production agents reduce to one of three patterns. The taxonomy isn't ours — it's standard in the LangGraph and CrewAI communities — but the *deployment choices* are where engineering judgment lives: when to pick which, what fails first, which eval metric becomes the anchor.

  
01

### Single-agent + tools

The simplest production pattern. One agent runs a plan/act/reflect loop with a fixed tool surface, one LLM call per turn. This is where most production agents land — sales research, support deflection, ops routing. State is small (recent turns + scratchpad), the topology is fixed, and the eval anchor is end-task success plus tool-call accuracy. Around 60-70% of our production agents fit this pattern. Don't reach for multi-agent until single-agent demonstrably fails the eval set.

Pick when

-   The workload is bounded with stable tools. A single planning loop covers the task. Tool surface ≤8. Most pilots start here.

Skip when

-   Sub-tasks fight each other in one prompt. Task needs >15 sequential tool calls (latency budget breaks). Workflow has clearly separable specialised roles.

Stack

LangGraphClaude Sonnet 4.6Composio

02

### Supervisor / worker

One supervisor agent plans and routes; workers specialise (research, draft, execute, critique). Used when no single agent's prompt can hold the full task without quality collapse. The supervisor's job is decomposition + routing, not execution — keeping its prompt focused on "which worker, with what input?" beats letting it also try to do the work. Per-worker success and supervisor routing accuracy become separate eval metrics; either failing tells you something different about what to fix.

Pick when

-   Task has clearly separable sub-tasks (research → draft → critique). Single mega-prompt is producing worse results than orchestrating focused agents. You can score each worker's output independently.

Skip when

-   Workflow is linear with no decision points (just chain LLM calls). Latency budget tight (each handoff adds 800-1500ms). Sub-tasks share too much context.

Stack

LangGraphCrewAIClaude 4GPT-5

03

### RAG-augmented agent

The agent treats retrieval as a tool it can call mid-loop, not as a fixed pre-step before generation. Vector store sits behind a retriever the agent invokes when grounding is needed. Right when context grounding matters more than autonomy depth — clinical Q&A, contract review, regulatory research. Eval anchors shift: retrieval recall (did we find the relevant chunks?) and answer faithfulness (did the agent stay grounded in what was retrieved?) matter more than tool-call accuracy. We hand-build the chunking + reranker per corpus — defaults are bad.

Pick when

-   Output must be grounded in your corpus (docs, tickets, contracts). Corpus too large or too fresh to fit in the prompt. Citation enforcement is a hard requirement.

Skip when

-   Workload is mostly generative (writing, image). Corpus fits in context window with room to spare. You don't have ground-truth answers for eval.

Stack

LangGraphPineconeBGE rerankClaude 4

A common scoping mistake we see in enterprise AI agent projects: clients ask for pattern 02 (multi-agent) when pattern 01 + a better prompt would have shipped in half the time. The supervisor/worker abstraction is seductive — it *sounds* rigorous — but every extra agent doubles the eval surface and adds 800-1500ms of latency per handoff. Default to pattern 01. Move up only when the eval set tells you to. Most enterprise AI agent deployments we audit land back on pattern 01 within 90 days.

007 / EVAL

## Four metrics on every agent we ship.

Most "agent" projects fail in production because nobody scoped what success looked like before writing code. We invert that. The eval set lands in week 2, before the first prompt is written.

94%

Task success rate

Did the agent complete the user's goal start-to-finish, scored against the eval set's expected outputs.

<2%

Hallucination rate

LLM-as-judge scoring with weekly human spot-check on disputed examples. Hard gate before production wire-up.

99.2%

Tool-call accuracy

Right tool, right args. Separately scored from end-task success because a wrong tool call can succeed at the wrong thing.

<2.4s

P95 latency

Measured across the full call chain including tool invocations. Voice agents target sub-400ms turn-taking. Budget reviewed weekly.

Numbers shown are illustrative target ranges for new engagements until eval data from production work is published.

EVAL GATES

The four gates aren't suggestions. All four must be green before we wire the agent into production traffic. Each has an explicit methodology, a target, and a fail-state — codified before the first prompt is written.

1.  01 Task success
    
    ≥94%
    
    Domain-expert graded eval set, 30–50 examples covering main flow plus edge cases. Re-graded weekly. Production traces sampled into the eval set monthly.
    
    If <90%, the agent doesn't ship. We revise the spec before retrying.
    
2.  02 Hallucination
    
    <2%
    
    LLM-as-judge with Claude Sonnet 4.6 scoring each output, then human spot-check on the 5% of outputs the judge marked disputed.
    
    If ≥3%, hard gate before production wire-up. No exceptions.
    
3.  03 Tool-call accuracy
    
    ≥99%
    
    Right tool + right args. Scored independently of end-task success because a wrong tool call can accidentally succeed.
    
    If <97%, the agent ships with a tool-confirmation step in front of write actions.
    
4.  04 P95 latency
    
    <2.4s
    
    Measured across the full call chain including tool invocations. Voice agents target <400ms turn-take. Budget reviewed weekly; regression alarm if breached for 24h.
    
    If breached for >72h, we revisit model routing or tool design.
    

Two methodology notes that matter. We use **LLM-as-judge with Claude Sonnet 4.6** as the default scorer because it produces the most consistent grades against human ground-truth on the eval sets we've shipped. Hallucination disputes (5-8% of outputs typically) get human spot-check by your domain expert — we never let LLM-as-judge stand alone for the hard cases. And the eval set **grows during production**: real traces sampled monthly into the eval set, with regression alarms when an upstream model change drops scores. The eval set we hand you on day 1 is not the eval set you have on day 365.

Eval and observability stack we deploy by default:

Langfuse Braintrust Promptfoo LangSmith Helicone Inspect AI

007b / SECURITY · COMPLIANCE · COST

## Security, compliance, and cost engineering.

Three concerns enterprise buyers always ask about before procurement. We address each one explicitly in the spec — not as a "we'll figure it out at the security review" promise.

### Security & guardrails

Defense in depth, not a single classifier. Every production agent ships with input filtering, output filtering, system-prompt isolation, and an adversarial eval set we re-run on every model swap.

-   **Input classifier** — Llama Guard 3 or a custom policy classifier blocks known prompt-injection patterns before they hit the planner.
-   **Structured output enforcement** — Pydantic / Zod schemas with retry on violation. Cuts most "agent decided to do something weird" failure modes.
-   **System-prompt isolation** — user content can never override system instructions. We test this with an adversarial eval on every deploy.
-   **Output filtering** — Llama Guard or Presidio on outbound responses for PII leakage, prohibited content, hallucinated tool calls.
-   **Tool confirmation** — write actions (send email, charge card, update CRM) gate behind a confirmation step unless tool-call accuracy is ≥99.5%.

### Compliance posture

Default posture covers most enterprise procurement bars. Regulated workloads (clinical, financial, EU) layer in additional controls — scoped into the SOW, not retrofitted at security review.

SOC 2 Type II

Audited annually · default posture

ISO 27001

Information security mgmt · default posture

HIPAA-aligned

PII-scrubbed prompts · BAAs · log redaction

GDPR / EU AI Act

EU residency · DPA · model-card disclosures

On-prem / VPC deployment available — Llama 4, Mistral, Qwen on your cloud via vLLM. Standard pattern for healthcare and defense-adjacent engagements.

### Cost engineering

Token cost is the second-highest line item on most production agents after engineering time. We model expected cost during discovery and cut it 40-70% on the average build through routing, caching, and batch APIs.

40–70%

Token-cost reduction

Via model routing on a typical mid-volume agent

92%

Cache hit rate

On stable system prompts using Anthropic / OpenAI prompt caching

5–10×

Batch API throughput

On overnight enrichment / classification workloads

-   **Model routing** — classifier routes by query complexity. Easy queries to GPT-5 mini or Claude Haiku at 1/20th the cost; hard ones to the frontier model. Quality holds via eval gate.
-   **Prompt caching** — Anthropic / OpenAI prompt caching on stable system prompts and tool definitions. 90%+ cache hit rate on most agents within two weeks of launch.
-   **Batch API for async** — overnight enrichment, classification, scoring. 50% cost cut vs sync API, 5-10× throughput.
-   **Token budget per request** — hard ceilings on context size and tool-call chain length. Outliers get circuit-broken, not silently bloated.

All three concerns share a pattern: the discipline is in the spec, not in the build. We name the threat model, the compliance posture, and the cost band during the discovery phase. The build executes against those targets — security and cost aren't add-on phases that happen after the agent works. They're how it gets to work.

008 / USE CASES

## Where teams have shipped agents.

Engagements anonymised. Industry and segment are real; metrics are real; brand names removed under standard NDA terms.

Use cases are organised by function — sales, support, ops, coding, research, voice — not by industry. The same plan/act/reflect loop ships to a B2B SaaS and a manufacturer; what changes is the integration surface and the eval anchor, not the agent shape. Below are six representative engagements: three flagship cases (full numbers), three additional function stubs (recent shipments where the metric narrative is short).

Sales

B2B SaaS · 11–50 emp

### Lead-qualification + outbound research agent

Pulls signals from LinkedIn, Crunchbase, the prospect's own website, and recent news. Scores fit against the ICP, drafts a personalised first-touch message citing the strongest 2 signals, and only hands off to an AE when the score crosses a tuned threshold. Replaced 2.5 SDR seats in the first six months. The AEs report higher-quality top-of-funnel and shorter first-call discovery.

0

SDR seats

Support

Health-tech · enterprise

### Tier-1 deflection agent

RAG over product docs and a redacted 18-month ticket archive. Resolves password resets, billing edits, and onboarding questions without any human touch. Clinical questions are escalated immediately to a human, with the agent's draft attached so the responder has full context. Cut p1 ticket volume by 38% over 90 days, with zero clinical false negatives in the eval set.

0 %

p1 ticket volume

Ops

Mfg · 200+ emp

### Invoice matching + AP routing agent

Reads PDF and scanned invoices, runs OCR + LLM extraction, matches against open POs in NetSuite, and routes to the correct approver via Slack with a structured summary. Exceptions go to the ops lead with an annotated diff explaining why the agent didn't match. ROI inside six months. The ops lead now handles the 8% of invoices that genuinely need judgment.

<6 months

in

Coding

Dev-tools SaaS · 50–200 emp

### Repo-aware code review agent as a PR gate

Wired into GitHub Actions on every PR. Pulls the repo's conventions, runs a critic loop on the diff, leaves inline review comments. Flagged a missed null check on 12% of merged PRs in the first month.

0 %

\+ issue catch rate

Research

Fin services · 1,000+ emp

### Regulatory research agent across 6 jurisdictions

Multi-step research over published regulations + the firm's internal interpretation memos. Cited outputs, refuses on out-of-corpus, escalates ambiguity to a compliance reviewer rather than guessing.

0

8 days → min per memo

Voice

Health-tech · enterprise

### Intake triage voice agent on LiveKit

Phone intake agent with sub-400ms turn-take. Asks the standard intake questions, escalates clinical-judgment cases to a nurse with full context. PII-scrubbed transcripts; HIPAA-aligned deployment.

p95 turn-take 320ms

Patterns across all six engagements: **the metric anchor was scoped in week 2**, before code; **the eval set grew during production** via sampled traces; **handoff included the runbook in the client's repo**, not in a doc somewhere. Outcome numbers are what your team measured at week 8 post-launch, not at deploy. The work that matters happens after the agent ships — picking an ai agents development company that stays for that work is the most underrated criterion in vendor selection. As an agentic AI company, we run post-launch eval reviews as part of the standard SOW, not as an add-on.

009 / ENGAGE

## Three ways to start.

Every AI agent development engagement is fixed-scope and fixed-duration. The first phase is small enough that stopping is a real option — about a third of our pilots end at the pilot for legitimate scoping reasons. That's a feature, not a bug. Cheap to discover the workload doesn't fit; expensive to discover it 12 weeks in.

Pilot · 2–4 weeks

Pilot · 4 weeks 4 phases

WEEK 1 Discovery

Workload map + eval surface scoped

Workload boundary signed off

WEEK 2 Spec

Stack picks + 30–50 graded eval examples

Eval examples agreed by your domain expert

WEEK 3 Prototype

First runnable agent + baseline scores

Baseline accuracy hit

WEEK 4 Demo

Demo + scoping memo for next phase

Build · 8–16 weeks

Build · 16 weeks 6 phases

WEEK 1–2 Discovery + Spec

Workload map, stack lock, eval scope

WEEK 3–6 Prototype

Runnable agent against eval set

Baseline accuracy hit

WEEK 6–10 Eval gates

Four metrics green vs target

All four green

WEEK 10 Deploy

Auth, observability, rollback drilled

WEEK 11–14 Iteration

Weekly eval review + prompt iteration

WEEK 15–16 Handoff

Runbook in your repo, ownership transferred

Multi-Agent + Voice · 10–20 weeks

Multi-Agent + Voice · 20 weeks 5 phases

WEEK 1–3 Discovery + Spec

Workload graph + per-agent eval surfaces

WEEK 4–8 Prototype

Supervisor + 2 workers + critic running

WEEK 9–14 Eval + Voice

Per-agent eval gates green; voice latency tuned

Per-agent success + routing accuracy

WEEK 15–18 Production

Telephony / SDK integration + observability

WEEK 19–20 Handoff

Multi-agent runbook + on-call rotation

01 Agent Pilot Fixed scope

2–4 weeks

### Pilot one agent, intake to live.

In scope

-   One scoped use case and workflow map
-   Eval framework with 30–50 graded examples
-   Working prototype against your real data
-   Demo, scoring report, and a recommendation memo for the next phase

Out of scope

-   Production deploy and integrations
-   Multi-agent orchestration
-   Voice / sub-400ms latency work

02 Custom Agent Build Fixed scope

8–16 weeks

### Production build with eval gates.

In scope

-   Everything in the Pilot
-   Production integrations — auth, observability, rate limits, fallback policies
-   Eval gates baked into the deploy pipeline (regression alarms enabled)
-   Four weeks of post-launch iteration with weekly eval runs
-   On-call runbook and ownership transfer

Out of scope

-   Open-ended Run engagements after week 16 (separate SOW)

03 Multi-Agent + Voice Fixed scope

10–20 weeks

### Multi-agent or voice systems.

In scope

-   Supervisor / worker / critic orchestration on LangGraph or CrewAI
-   Or voice agents with sub-400ms turn-taking on LiveKit / Pipecat / Vapi
-   Eval focus on per-agent success and inter-agent routing accuracy
-   Production wire-up including telephony or in-app SDK integration

04 Agent Rescue Fixed scope

4–6 weeks

### Diagnose and fix a struggling agent.

In scope

-   Trace + eval audit of the existing agent (tool-call accuracy, loop rate, p95 latency, cost per task)
-   Root-cause memo: prompt, planner, tool surface, retrieval, or evals — where it actually fails
-   Targeted rebuild of the failing layer with regression tests before swap-in
-   Handover with a sustainable eval gate so the next regression is caught in CI, not by users

Out of scope

-   Rewrite from scratch (becomes Custom Agent Build)
-   Migrations to a different orchestrator unless root-cause requires it

Want ongoing iteration after week 16? A **Run engagement** is a separate monthly SOW — typically one AI engineer half-time, weekly eval review, and a fixed iteration budget. We move you to Run only if the workload genuinely benefits from continued investment, which is roughly half of completed builds. As an agentic AI company we're built for this: custom AI agent development doesn't end at deploy.

010 / FAQ

## Common buyer questions about AI agent development.

If the answer you need isn't here, the contact form is faster than email — we triage same-day from an engineer.

How is AI agent development different from chatbot development?

Chatbots are **single-turn or short-turn conversational systems** with minimal autonomy. The user asks, the chatbot answers. State is small or none, tool use is minimal (usually a single RAG retrieval), and the eval anchor is "intent accuracy + answer relevance."

AI agents are **autonomous, goal-driven systems** that take multiple steps to complete a task. They plan, call tools, reflect on intermediate results, and decide their next move. State is rich and stateful loops survive crashes via checkpointing. The eval anchor is "task success + tool-call accuracy + latency budget."

The decision is rarely binary. Most failed projects we audit picked the wrong shape: a chatbot when the work needed autonomy, or an agent when a chatbot would have shipped in half the time. [Our flagship piece breaks down the seven dimensions](/blog/ai-agents-vs-chatbots/) that decide between them.

Do you work with our existing AI stack (Claude / GPT / Gemini / Llama)?

Yes. We're model-agnostic by default — we benchmark **at least two models against your eval set** before locking the choice. The leader on cost-adjusted quality wins, regardless of which vendor we'd default to.

-   **Hosted** — Claude 4/Sonnet, GPT-5, Gemini 3.0/Flash, Mistral hosted
-   **Self-hosted** — Llama 4, Mistral, Qwen on vLLM/TGI on your cloud
-   **Routing** — Production agents often route across 2–3 models by query complexity to cut cost 40–70% without quality loss

If you have an existing contract with a specific provider, we work within it. If you don't, we'll recommend the routing pattern that fits the workload.

Who owns the code, prompts, and eval sets at the end?

You do. Everything we ship transfers into your repository under the SOW:

-   All agent code (framework wrappers, tool definitions, integration glue)
-   All prompts in portable YAML — re-hostable on a different framework if needed
-   The eval set (30–50+ graded examples with criteria)
-   Infrastructure-as-code (Terraform / Pulumi) for the deployment
-   Runbook and on-call procedures

Paiteq retains zero rights to your prompts, eval data, fine-tuned weights, or domain examples. We keep the **engineering learnings** — patterns and methodologies — for our internal playbook. That's it.

How long does it take to ship a production AI agent?

An Agent Pilot ships in 2–4 weeks. A Custom Agent Build with eval gates, integrations, and observability runs 8–16 weeks. Voice agents and multi-agent systems are longer because of latency tuning and orchestration complexity. We always scope a fixed-duration first phase so you can stop or scale up after seeing the prototype.

What frameworks do you build on, and how do you choose?

We default to LangGraph for stateful agents that need explicit graph control, CrewAI for multi-agent supervisor / worker patterns, Vercel AI SDK or the OpenAI Agents SDK for simpler tool-calling, and Composio when the tool surface is large and pre-built integrations matter. Framework choice follows the workload, not the other way around. We do not have a house framework we push regardless of fit.

How is an AI agent different from a chatbot?

Chatbots are single-turn, stateless, scripted by intent maps, and measured on intent classification accuracy. Agents are multi-turn, stateful, goal-driven, use tools autonomously, and measured on task success rate. Picking the wrong one is the most common scoping mistake — we cover this in detail in our piece on AI agents vs chatbots.

How do you measure agent quality and prevent hallucinations?

Every agent ships with an eval set scoped during discovery — usually 30 to 50 graded examples covering the agent's main cases and the edge cases that worry the business. We track task success rate, tool-call accuracy, hallucination rate (via LLM-as-judge plus human spot-check), and p95 latency. Eval runs weekly post-deploy. If any metric drops more than 5%, a regression alarm fires and the build is paused.

Who owns the IP, code, and prompts?

You do. All code, prompts, eval sets, and architecture diagrams are delivered into your repository under a transfer-of-ownership clause in the SOW. We retain no rights to your prompts or data. Paiteq keeps non-identifying engineering learnings — frameworks, patterns, eval methodologies — for our internal playbook.

How do you handle security, PII, and compliance?

Default posture is SOC 2 Type II and ISO 27001 aligned. We can deploy fully on your cloud (AWS, GCP, Azure) with no data leaving your perimeter, run prompt-level PII scrubbing via Presidio or your existing DLP, and use private-link endpoints to model providers where required. HIPAA, GDPR, and SOC 2 evidence work is included in regulated engagements.

Can we start with a pilot and scale to production?

Yes. The Pilot is designed to graduate into a Custom Build — eval framework, prompts, and architecture carry forward. About 70% of pilots we run convert to a production engagement. The 30% that don't either pivoted scope based on what the pilot revealed, or decided the workflow wasn't yet ready for AI. Both are valid outcomes.

What's the typical team shape on an engagement?

One AI engineering lead, one senior AI engineer, and a fractional product manager for scope and stakeholder management. Multi-agent and voice projects add a second AI engineer. We run two-week iteration cycles with a weekly demo. You always have a direct Slack channel with the build team — no account-management buffer.

011 / Related practices

## Adjacent services.

[

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

](/services/rag-development/)[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

AI WORKFLOW AUTOMATION

AI Workflow Automation

Intelligent workflows on n8n, Make, and custom agent orchestration.

](/services/ai-workflow-automation/)

012 / Start a project

## Let's *build* something that ships.

Pilot in 2–4 weeks. Custom build in 8–16. Same-day response on every inbound.

[Talk to engineering](/contact/) [Architecture review](/contact/?topic=arch-review)


---

## SECTION: 4.2. Service: rag-development

_Source: https://www.paiteq.com/services/rag-development/_

# RAG Development Services — Paiteq

> Paiteq delivers RAG development services — production retrieval pipelines on Pinecone, Qdrant, Weaviate, pgvector. Hybrid search, citation enforcement, RAG eval built in.

**HTML version:** https://www.paiteq.com/services/rag-development/

## Key facts

- Vector stores: Pinecone, Qdrant, Weaviate, pgvector.
- Retrieval: hybrid (dense + BM25), reranking, citation enforcement.
- RAG eval (RAGAS-style) built into delivery.

## Related pages

- [LLM Development](https://www.paiteq.com/services/llm-development/)
- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

RAG Development

# RAG development services for systems that *cite their sources*.

Paiteq builds retrieval augmented generation systems from corpus audit to live retrieval — production RAG development services on Pinecone, Qdrant, Weaviate, and pgvector. Hybrid retrieval, reranking, and citation-enforced answers. Refusal when the context is thin, not confident guesses dressed up as evidence.

[Talk to engineering](/contact/) [See pipeline architecture](#architecture)

Stack LlamaIndex · Pinecone · Cohere

Eval Recall · Faithfulness · Latency

Engage Pilot · Build · Rescue · Migrate

Compliance PII-scrubbed · SOC 2

001 / SURFACES

## Eight RAG surfaces we ship.

Each surface below is a workload we've shipped to production — with the corpus shape, the eval methodology, and the failure modes already worked out. [See three in detail →](#use-cases)

RAG pipeline development sorts cleanly by corpus shape, not by industry. A contracts-Q&A pipeline for a law firm reuses the same hierarchical chunking, hybrid retrieval, and citation-enforcement prompts as an enterprise rag deployment over policy documents at a health-tech company. The integrations, freshness rules, and residency posture change — the retrieval pipeline shape doesn't. Sorting by corpus shape lets us reuse the eval harness and reranker tuning across clients. Sorting by industry, the way most competing pages organize themselves, hides where the engineering actually lives.

[

01 / DOCS ↗

Document Q&A

Ask your contracts, policies, manuals. Cited answers with paragraph-level provenance. Refusal when context is thin.

CitedRefusal

](#use-cases)[

02 / SUPPORT ↗

Support knowledge agent

RAG over docs + ticket archive. Deflects tier-1 cleanly, escalates with full context — not just a transcript dump.

Tier-1Tickets

](#use-cases)[

03 / SEARCH ↗

Enterprise semantic search

BM25 + dense hybrid retrieval across your full corpus. Answers, not ten blue links — and an llm knowledge base behind it.

HybridRerank

](#use-cases)[

04 / CODE ↗

Code & repo Q&A

Repo-aware RAG for engineering teams. Symbol-graph + dense hybrid. Answers carry file:line citations.

Repo-aware

](#use-cases)[

05 / LEGAL ↗

Contracts & compliance

Clause extraction, contract Q&A, redline review. PII-scrubbed at ingestion. A document intelligence services workload at its core.

PII-scrubbed

](#use-cases)[

06 / RESEARCH ↗

Research synthesis

Multi-hop RAG over papers, market reports, internal notes. Citations to primary sources only — never fabricated page numbers.

Multi-hop

](#use-cases)[

07 / EVAL ↗

Evaluation & re-ranking

Retrieval recall, faithfulness, context relevance. Measured weekly with RAGAS + Trulens, not eyeballed.

RAGASTrulens

](#eval)[

08 / FRESHNESS ↗

Live + fresh corpora

Incremental embeddings, delete-and-replace, change-data-capture wired into your source systems. Stale-doc rate as a first-class metric.

CDCIncremental

](#eval)

SURFACE × INDUSTRY

Where we've shipped each RAG surface. Strength reflects production volume, not theoretical fit — empty cells mean we either haven't done it yet or the workload didn't justify a retrieval pipeline.

Surface Industry

B2B SaaS

Health-tech

Mfg

Fin-tech

Legal

E-comm

Ed-tech

Logistics

Docs Q&A

Support / KB

Search

Code repo

Legal / contract

Research / memo

Docs Q&A

B2B SaaSHealth-techMfgFin-techLegalE-commEd-techLogistics

Support / KB

B2B SaaSHealth-techMfgFin-techE-commEd-techLogistics Legal

Search

B2B SaaSHealth-techMfgFin-techLegalE-commEd-tech Logistics

Code repo

B2B SaaSFin-techE-commEd-tech Health-techMfgLegalLogistics

Legal / contract

B2B SaaSHealth-techFin-techLegalE-commLogistics MfgEd-tech

Research / memo

B2B SaaSHealth-techFin-techLegalEd-tech MfgE-commLogistics

Possible fit Good fit Primary vertical

Heavier columns: legal, fintech, healthcare. The pattern is unsurprising — those are the industries where citations, refusal, and provenance pay back the hardest. Lighter columns: e-commerce and ed-tech, where workloads tilt toward generation and recommendation more than retrieval. The grid isn't a roadmap; it's a record. We'll talk you out of a RAG-shaped engagement if the workload is actually a generation problem (better fit for [our generative AI practice](/services/generative-ai/)) or a tool-use problem (better fit for [autonomous tool-use agents](/services/ai-agent-development/)).

One more pattern worth naming: about 25% of the engagements that start as RAG end up being rag consulting work instead. The client comes in wanting a built pipeline; the corpus audit reveals that the source data isn't yet in a shape any pipeline can win against. The right next step is a six-week consulting engagement to get the source data structured — header conventions, OCR cleanup, deduplication, freshness pipeline — before any retrieval code ships. We do that work, sometimes hand it to your team, sometimes we run it ourselves. Either way, we say "not yet" on the build until the corpus is ready, because shipping a RAG against a broken corpus is how you end up calling Rescue six months later.

002 / SERVICES

## RAG development services — pick where to start.

Four engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done. [Full engagement-model breakdown below →](#engage)

Choosing a RAG development services partner is mostly about choosing the right starting shape. Buyers who walk in with a scoped corpus and one question type ship to production around 75% of the time. Buyers who walk in with "we want a RAG over everything we have" ship around 30% of the time, usually after a re-scope. The four shapes below map onto the starting points we see most often — pick the one that matches what you actually have, not what you wish you had. Every shape is a RAG development service we've shipped 10+ times; the deliverables and gate criteria are locked in by repetition, not invented for your engagement. Each starts by scoping the llm knowledge base boundary — what the system can and can't answer — before any ingest code ships.

[

01 / PILOT ↗

RAG Pilot

One corpus, one question shape, graded against a real eval set. Demo-ready in 2–4 weeks. Fixed scope.

2–4 wksFixed scope

](#engage)[

02 / BUILD ↗

Production RAG Build

Full pipeline — ingestion, hybrid retrieval, reranker, eval gates, observability. 8–14 weeks. Fixed scope.

8–14 wks

](#engage)[

03 / RESCUE ↗

RAG Rescue

Your RAG shipped but hallucinates, retrieves wrong, or can't keep up. We diagnose, fix, re-evaluate against your real data.

4–6 wksAudit + fix

](#engage)[

04 / MIGRATION ↗

Vector DB Migration

Pinecone → Qdrant, Weaviate → pgvector, etc. Dual-write, eval-validated cutover, zero downtime. 6–10 weeks.

6–10 wks

](#engage)

A practical decision tree: if the corpus is scoped but the retrieval quality is unproven, start with a **Pilot**. If you know the corpus works and you need production discipline (CDC, observability, eval gates), start with a **Production RAG Build**. If you've already shipped RAG and it hallucinates, retrieves wrong, or can't keep up, start with **Rescue**. If the only thing wrong with the current system is the vector store underneath it, start with **Migration**. Each of these is a real RAG development service we run on a fixed scope, and our rag consulting practice will tell you to start narrow — moving up is cheap; over-scoping isn't. [Week-by-week scope on each, further down →](#engage)

003 / STACK

## The retrieval stack: vector DBs, embedders, rerankers.

Stack choices follow the workload, not house preferences. Vector store, embedding model, and reranker are all benchmarked against your real eval set before we lock anything in.

-   LlamaIndex
-   LangChain
-   Pinecone
-   Qdrant
-   Weaviate
-   pgvector
-   Chroma
-   Cohere Rerank
-   Voyage
-   OpenAI Emb.
-   RAGAS
-   Trulens
-   BGE
-   Langfuse
-   Presidio
-   Unstructured
-   LlamaIndex
-   LangChain
-   Pinecone
-   Qdrant
-   Weaviate
-   pgvector
-   Chroma
-   Cohere Rerank
-   Voyage
-   OpenAI Emb.
-   RAGAS
-   Trulens
-   BGE
-   Langfuse
-   Presidio
-   Unstructured

VECTOR DATABASE PICKS

For each store: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production rag solutions on every one of these — the "when we don't" lines come from real builds, not theory. Vector database services selection isn't a one-shot decision; we benchmark two candidates against your eval set on every Production Build.

Pinecone

Strengths

Managed, serverless, predictable. Sub-100ms p95 on most workloads up to ~100M vectors. Hybrid search and namespaces built in.

When We Pick

Mid-size corpora (5M–100M vectors) where the team has zero appetite to run infra. SaaS clients without a platform team. Anywhere predictable cost and SOC 2-ready hosting matter more than knob-tuning.

When We Don't

Cost-sensitive workloads above 100M vectors (the unit economics flip vs Qdrant self-hosted). Workloads needing exotic distance metrics or custom HNSW parameters Pinecone doesn't expose.

Paiteq Pattern

Default for clients without a platform team. We benchmark it against Qdrant on every Production Build, but Pinecone wins ~60% of those head-to-heads on operational simplicity.

ManagedHybridServerless

Qdrant

Strengths

Open-source with a Rust core. Excellent performance per dollar self-hosted. Strong filtering, payload indexing, and quantization options.

When We Pick

Self-hosted requirements (residency, regulated workloads). Large corpora where the dedicated infra cost beats managed pricing. Anywhere we need scalar / product quantization to cut RAM 4×.

When We Don't

Tiny corpora (<1M vectors) — the operational tax doesn't pay off. Teams with no Kubernetes capacity.

Paiteq Pattern

Our default for enterprise rag deployments on AWS / GCP where data residency matters. We run Qdrant Cloud for managed convenience and self-hosted for regulated workloads — same API.

Self-hostedQuantizationFilterable

Weaviate

Strengths

Hybrid search and modular reranking are first-class. GraphQL API. Built-in multi-tenancy. Optional ML modules (vectorizer, reader).

When We Pick

Hybrid search heavy workloads where rerank logic and ranking modules need to ship with the store. Multi-tenant SaaS where each customer gets isolated namespaces.

When We Don't

Simpler workloads where pgvector or Pinecone covers it. Teams allergic to GraphQL — Weaviate's REST is fine but the docs lean GraphQL.

Paiteq Pattern

We reach for Weaviate when the client's eval set shows hybrid search lifts recall by ≥10 points over pure vector. Modular reranker plug-ins shorten the build by a week or two.

Hybrid-firstGraphQLMulti-tenant

pgvector

Strengths

Postgres extension. No extra system to operate. Joins between vectors and structured data. HNSW + IVFFlat indexes.

When We Pick

Corpora that fit on one Postgres node (typically <5M vectors). Teams that already run Postgres and don't want another service. Apps where vector + relational filtering matter together.

When We Don't

Above ~5M vectors with high QPS — index rebuild times and connection pool pressure start hurting. Workloads with very high write churn.

Paiteq Pattern

Default for early-stage SaaS clients. We migrate to Qdrant or Pinecone the first time index rebuild crosses 90 seconds on production data — that's the usual tell.

Postgres-nativeJoinsHNSW

EMBEDDERS & RERANKERS

Embedder choice usually matters more than vector store choice for retrieval recall — and it's the easier thing to swap later. A reranker on top of hybrid retrieval is the single highest-ROI add on most of the RAG pipeline development work we do. Hybrid search implementation, when it earns its keep on the eval set, lifts top-k precision by another 8–18 points.

OpenAI text-embedding-3-large

Strengths

Strong general-purpose baseline. 3072 dims (truncatable to 1536 / 512 / 256 via MRL). Stable API.

When We Pick

Default for English-first corpora and mixed-domain workloads. When the client has an OpenAI contract already and procurement won't add a vendor.

When We Don't

Multi-lingual corpora — Voyage and BGE win our evals there. Workloads where data residency rules out hosted embedding APIs.

Paiteq Pattern

Our day-one baseline; we benchmark against Voyage and BGE on every Production Build. About 40% of builds end up swapping to one of those.

3072 dimsMRL

Voyage-3 / voyage-large

Strengths

Consistently tops the MTEB leaderboard for our evals. Strong long-context (32k) embedding. Per-domain models for code, finance, law.

When We Pick

Long documents where 32k context preserves more signal than chunking. Domain-specific corpora (legal, finance, code) where Voyage's domain models lift recall 5–12 points.

When We Don't

Hyper-cost-sensitive batch workloads — OpenAI's batch pricing wins by ~30%. Workloads requiring on-prem embedding (use BGE instead).

Paiteq Pattern

Default for legal and clinical RAG. We've shipped contract-Q&A systems where Voyage-law improved retrieval recall from 78% to 91% over the OpenAI baseline.

MTEB topLong-contextDomain

BGE / GTE (self-hosted)

Strengths

Open-weights. Run on your own GPU or CPU. BGE-large-en is the strongest open embedder we've benchmarked. Free at scale.

When We Pick

Regulated workloads where no data can leave the perimeter. Very high-volume batch embedding where hosted API cost dominates. BAAI-licensed deployments.

When We Don't

Tiny corpora — the GPU isn't worth it. Teams without infrastructure to run an embedding service. Multilingual workloads where Voyage still wins.

Paiteq Pattern

We run BGE on a small CPU pool for healthcare and finance clients with strict residency. Throughput tuning matters — batch size and ONNX runtime get most of the gains.

Open-weightsSelf-hostedBAAI

Cohere Rerank 3

Strengths

Cross-encoder reranker that lifts top-k precision by 8–18 points on our evals over pure vector. Hosted API, sub-200ms.

When We Pick

Almost every Production RAG Build. Reranking on top-50 vector hits down to top-5 is the single highest-ROI add we make.

When We Don't

Very tight latency budgets (Cohere adds ~150–200ms). Pure offline batch where the reranker cost stacks up. Workloads where a self-hosted bge-reranker fits the same role for free.

Paiteq Pattern

Default. We benchmark Cohere Rerank 3 vs bge-reranker-large head-to-head on the eval set; Cohere usually wins but bge-reranker self-hosted is the strong runner-up.

RerankCross-encoderHosted

Two patterns worth flagging on every RAG engagement. First, **we benchmark two embedders against the eval set** before locking the stack — usually OpenAI text-embedding-3-large against a domain-specific Voyage model. The eval set decides, not the leaderboard. Second, **we default to including a reranker** on every Production Build. The 150–200ms latency tax is almost always worth the 8–18 point precision lift. Skip it only when latency budget rules it out — voice-RAG, sub-700ms response targets. Our deeper take on chunking — including when fixed-size beats semantic — lives in our [chunking strategy deep dive](/blog/chunking-strategies-deep-dive/).

CHUNKING DECISIONS

Chunking strategy is the choice that decides retrieval recall more than embedder choice does, and it's the one most teams get wrong on a first build. Three patterns cover ~90% of what we ship.

#### 01 · Fixed-size + overlap

500–1,000 token chunks with 10–20% overlap. Default starting point for narrative text — support docs, knowledge bases, marketing-style content. Fastest to ship; usually the baseline every other chunker has to beat.

**When it works:** uniform-density text where paragraphs and sections are short. **When it fails:** structured documents (contracts, manuals) where headers carry semantic weight.

#### 02 · Recursive + title-aware

Splits at heading boundaries first, then by paragraph, then by sentence — preserving document structure in the chunk. Hierarchical: each chunk knows its parent section. Our default for legal, regulatory, technical documentation.

**When it works:** structured docs with consistent heading conventions. **When it fails:** badly OCR'd PDFs where the heading detection breaks — falls back to fixed-size with a recall penalty of 8–15 points.

#### 03 · Semantic / embedding-based

Splits on cosine-distance jumps between sentences — chunks form where the topic shifts. Higher ingest cost (one embed per sentence pair), but lifts retrieval recall 5–12 points on conceptually dense content like research papers and analyst memos.

**When it works:** long-form analytical writing, research synthesis, ed-tech curriculum. **When it fails:** fragmentary content (chat logs, ticket threads) where the semantic signal is noisy.

We benchmark two chunkers head-to-head against your eval set on every Production RAG Build — typically recursive + title-aware vs semantic — and the winner is locked at week 4. The losing strategy stays in the codebase as a fallback for sources that don't fit the primary pattern (badly OCR'd PDFs, mixed-format archives).

004 / PROCESS

## Six steps to build a RAG pipeline — eval-first, every time.

The same process runs across a 4-week Pilot and a 14-week Production Build. The gates change in depth, not in shape. Every step has an explicit deliverable, a named owner, and a gate criterion — pass or rework, no "we'll figure it out next sprint."

WEEK 1

### Corpus audit

What you've got, in what shape, how fresh, what's allowed to leave your perimeter. We don't write any ingest code until this is signed off.

WEEK 2

### Eval set

30–80 graded questions with reference answers + the supporting passages each answer should retrieve. Lands before any pipeline code.

WEEK 2–4

### Ingestion

Chunking strategy, embedding model selection, store choice, freshness pipeline. Each choice benchmarked, not picked by vibe.

WEEK 4–6

### Retrieval

Hybrid (BM25 + dense) plus rerank. Query rewriting where it earns its keep. Tuned against your eval set, not a public benchmark.

WEEK 6–8

### Eval gates

Retrieval recall, answer faithfulness, context relevance, hallucination rate, p95 latency — all green before any production wire-up.

ONGOING

### Running

Weekly eval, freshness monitoring, prompt + chunking iteration based on production logs. The eval set grows from sampled traces.

01

### Corpus audit

We walk every source in your corpus before anyone proposes a chunking strategy. That means listing every source (Confluence, SharePoint, Postgres tables, S3 PDFs, the legacy DMS no one talks about), the rough doc count, the freshness cadence per source, the access policy, and what counts as in-scope vs out. Week-1 output is a corpus map: sources, volumes, formats, access patterns, and what's allowed to embed off-perimeter.

OwnersPaiteq RAG engineer + your data / IT owner. ~5 hours of their time spread across the week.

GateCorpus boundary signed off. If a source is fuzzy on access policy, we exclude it from the pilot — better to start narrow than discover a residency violation in week 7.

02

### Eval set

30–80 graded questions, each with a reference answer and the passage(s) the retriever should pull. Your domain expert grades; we facilitate. Realistic edge cases matter more than easy ones — the eval set is what tells us when chunking strategy needs to change, so it has to surface real failure modes. We build it before any retrieval code ships.

OwnersYour domain expert (~8 hours) + Paiteq engineer facilitating. We've never had a RAG engagement where the eval set wasn't the single most valuable week of work.

GateEval examples graded. If your team can't agree on a reference answer for an example, the spec isn't done — that's a scoping bug, not an eval bug.

03

### Ingestion

Chunking strategy, embedder choice, vector store choice, and the freshness pipeline. Each gets a small head-to-head against the eval set — 2–3 chunking strategies (fixed-size / semantic / title-aware), 2 embedders, sometimes 2 stores. The winner isn't decided by a benchmark blog post; it's decided by your eval set.

OwnersPaiteq RAG engineer; weekly demo of intermediate scores to your team.

GateBaseline retrieval recall hit on the eval set. Below baseline, we revise chunking or embedder before moving to retrieval logic.

04

### Retrieval

Hybrid retrieval (BM25 + dense, scored together), reranker layer (Cohere Rerank 3 or bge-reranker), query rewriting (HyDE, multi-query) where it lifts scores. Each layer is added only if it earns its keep on the eval set. We've shipped RAG systems where the reranker alone moved hallucination rate from 11% to 2%.

OwnersPaiteq RAG engineer; weekly score reviews with your team.

GateRetrieval recall + answer faithfulness both above target on the eval set.

05

### Eval gates

Five thresholds all green before any production wire-up: retrieval recall (did we pull the right passages?), context relevance (are they on-topic?), answer faithfulness (is the answer grounded?), hallucination rate (LLM-as-judge + human spot-check), and p95 latency. Hallucination disputes get human spot-check from your domain expert — we don't let LLM-as-judge stand alone on the hard cases.

OwnersPaiteq RAG engineer + your domain expert on the human spot-check.

GateAll five metrics green or the build doesn't deploy. Period.

06

### Running

Four weeks of post-launch iteration are part of every Production RAG Build. Weekly eval runs, freshness drift checks (stale-document rate as a first-class metric), prompt + chunking iteration on edge cases. The eval set grows from sampled production traces every month; regression alarms fire when an upstream model change drops scores by >5 points.

OwnersPaiteq RAG engineer (decreasing % of time) + your team taking over ownership.

GateOngoing — weekly eval review continues for the duration of the engagement.

Two things that matter across every RAG system implementation we ship. **The eval set lands in week 2**, before any retrieval code. The eval set is what tells us whether chunking strategy is wrong, whether the embedder needs to change, whether the reranker is earning its latency budget. Without it, you're tuning blind. **Running is a real phase**, not an afterthought. The first 4 weeks post-launch are part of every Build engagement — weekly eval review, freshness checks, prompt iteration on edge cases, regression alarms wired to your on-call. A rag consultant in the room at week 2 almost never sees the hallucination complaint that arrives at week 12.

005 / DECISION

## RAG vs. fine-tuning — when do you need which?

The most common scoping question we get. Most production systems use both: RAG for facts, a small fine-tune for style or domain vocabulary. Picking which to lean on first decides the engagement shape — and our rag consulting practice runs this conversation at week 1 of every project.

Fine-tuning

RAG

Grounds in

Static training data

Your live corpus

Freshness

Frozen at training

As fresh as your CDC pipeline

Setup cost

$$$$ — full fine-tune run

$$ — chunk, embed, index

Fine-tune runs on GPT-4o or Llama-3 70B easily reach **$2,000–$8,000 for a first pass** (compute + data prep + eval iteration). RAG infrastructure — chunking, embedding, and indexing a corpus — is typically under $500 for an initial pilot. The cost delta widens when the corpus changes: RAG re-indexes incrementally; fine-tuning re-runs the whole job.

Latency

Lower per turn

+200–600ms (retrieval + rerank)

Fine-tuning genuinely wins on latency: all knowledge is baked into weights, so the forward pass is the only cost. The RAG overhead is real — **+200–600ms is mostly the rerank stage** (a cross-encoder scoring 50 candidates), not the ANN lookup itself (typically under 20ms on Qdrant or pgvector). For voice-RAG with a sub-700ms end-to-end budget, you often skip the reranker or run a self-hosted [bge-reranker-large](/services/rag-development/) on the same box as the embedder.

Hallucination

Higher — no grounding

Lower — refusal is possible

RAG's hallucination advantage comes from two mechanisms: retrieved passages anchor the generation, and the system can **refuse when no retrieved chunk clears the relevance threshold**. Fine-tuned models have no equivalent circuit-breaker — if the question pattern resembles training data, the model will produce a fluent-sounding answer regardless of factual grounding. For regulated domains (legal, healthcare, financial), that refusal capability is often non-negotiable.

Best for

Style, voice, output format

Facts, lookups, citations

Eval surface

Output quality only

Recall + faithfulness + answer

A richer eval surface sounds like extra work, but it's actually **a debugging advantage**. When a RAG system regresses, recall vs faithfulness vs answer-quality scores tell you exactly where the pipeline broke. A fine-tuned model that degrades gives you one signal — output quality — and root-causing that to data, prompt, or model is harder. Most teams we work with under-invest in RAG evals initially and then thank themselves for the instrumentation at week 6.

Compose with

RAG, prompting

Fine-tune, prompting

Full breakdown — [when RAG beats fine-tuning](/blog/when-rag-beats-fine-tuning/)

Rule of thumb: if the answer needs to **cite its source or stay fresh**, you want RAG. If the answer needs to **speak in your voice or your domain's jargon**, you want fine-tuning. Anything in between, the decision tree below walks four diagnostic questions — most projects fit cleanly into one of five outcomes.

DECISION TREE

Answer four questions about the workload. We've used these same questions to right-size scope on every RAG engagement we've run.

Path

Question

Pick one

Result

006 / ARCHITECTURE

## Four production RAG patterns.

Most production retrieval augmented generation services reduce to one of four patterns. The taxonomy isn't ours — it's standard across the LlamaIndex and LangChain communities — but the *deployment choices* are where engineering judgment lives: when to pick which, what fails first, which eval metric becomes the anchor.

Pattern choice matters more than store choice or embedder choice for production retrieval recall. We've watched teams agonize over Pinecone vs Qdrant while running pattern 01 (naive RAG) on a corpus that needed pattern 02 (hybrid + rerank). The store decision was worth 1–3 points of recall; the pattern decision was worth 14. When you build rag system architecture, start with the pattern question — "what shape of retrieval does this corpus need?" — then pick the store and embedder that fit. Reverse that order and you'll rebuild at week 9, which is the call we get on most Rescue engagements. The LLM knowledge base scope — what the system answers vs refuses — belongs in week 2.

   
01

### Naive RAG

The simplest pipeline. Query gets embedded, top-k passages come back from the vector store, the LLM generates an answer grounded in them. No rewriting, no rerank, no reflection. About 20% of our pilots ship in this shape — usually when the corpus is small and well-structured, and the question shape is narrow enough that recall is naturally high. Don't reach for the fancier patterns until the eval set says you need to. The naive pipeline ships in days, not weeks, and it's the baseline every other pattern has to beat.

Pick when

-   Corpora under ~1M vectors. Narrow, well-defined question shape. Retrieval recall already above 85% on the eval set with a single-strategy retriever. Pilots where the cost of complexity is higher than the precision lift.

Skip when

-   Recall is below ~80% — try hybrid before adding rerank. Hallucination rate above target. Multi-hop questions that need iterative retrieval. Voice-RAG where you can afford the rerank latency.

Stack

LlamaIndexpgvectorOpenAI text-embedding-3-largeClaude Sonnet 4.6

02

### Hybrid + Rerank

Our default for Production RAG Builds. Sparse retrieval (BM25) catches keyword matches that dense vectors miss; dense retrieval catches semantic matches BM25 misses; a reranker (Cohere Rerank 3 or bge-reranker-large) takes top-50 down to top-5 with a precision lift of 8–18 points on most evals. The latency cost is real (~150–200ms for the rerank step), but on every Production Build except voice-RAG it's worth paying. Hybrid search implementation is what separates a production-ready pipeline from a demo; about 60% of our production rag solutions land here.

Pick when

-   Mixed question types — some lookup-style, some semantic. Corpora where keyword precision matters (legal, code, technical docs with exact terms). Anywhere recall is the bottleneck on the eval set.

Skip when

-   Voice-RAG with sub-700ms latency budgets (use self-hosted bge-reranker or skip rerank entirely). Tiny corpora where naive RAG already hits target. Workloads where the BM25 side adds zero recall — measure before adding.

Stack

LlamaIndexQdrantVoyage-largeCohere Rerank 3Claude Sonnet 4.6

03

### Multi-step RAG

When a single retrieval pass isn't enough — multi-hop questions, ambiguous queries, sparse corpora where the first retrieval misses. The agent rewrites the query (HyDE or multi-query reformulation), retrieves, reflects on whether the retrieved context is sufficient, and loops back to retrieve again if it isn't. Lift on multi-hop questions is 12–25 points of faithfulness; cost is a 1.5–2.5× latency tax. Worth it for research workloads, regulatory Q&A, anything where one retrieval pass isn't going to cover the question.

Pick when

-   Multi-hop questions where the answer needs evidence from passages that aren't co-located in the corpus. Research synthesis workloads. Regulatory Q&A across jurisdictions. Anywhere the eval set shows the second-most-likely answer is correct more often than the first.

Skip when

-   Latency-sensitive paths. Most support / Q&A workloads where the question is narrow enough for single-pass. Cost-sensitive workloads — each loop is another LLM call.

Stack

LangChainWeaviateVoyage-3Cohere RerankGPT-5

04

### Agentic RAG

Used when the workload needs retrieval as part of a broader autonomous task, not as the whole task. The agent treats the vector store as a tool it can call when grounding is needed — not as a fixed pre-step before generation. Right when context grounding matters but autonomy depth matters more: clinical decision support, contract review with multi-document reasoning, regulatory research across jurisdictions. Eval anchors shift: retrieval recall and answer faithfulness still matter, but task success rate becomes the headline. Agentic RAG sits at the boundary of this pillar and <a href="/services/ai-agent-development/">autonomous tool-use agents</a>; when the autonomy is the bigger half of the workload, the engagement usually lives on that pillar instead.

Pick when

-   Workload that's mostly autonomous task execution with retrieval as one tool among several. Multi-document reasoning where the agent has to decide what to retrieve next. Clinical or compliance workflows where citation enforcement and judgment both matter.

Skip when

-   Pure retrieval Q&A — agentic adds 800–1500ms per loop turn for no gain. Latency budgets that don't tolerate iterative retrieval. Workloads where the agent's tool surface is mostly non-retrieval.

Stack

LangGraphPineconeBGE-reranker-largeClaude Sonnet 4.6

A common scoping mistake on enterprise rag projects: clients ask for pattern 03 (multi-step) when pattern 02 + a better embedder would have shipped in half the time. Each retrieval loop doubles latency cost and the eval surface widens. Default to pattern 02 (hybrid + rerank). Move up only when the eval set tells you to. About a third of "we need multi-step RAG" requests we audit end up landing back on pattern 02 once the hybrid search implementation is properly tuned — the deeper take on that call lives in our [hybrid retrieval breakdown](/blog/hybrid-search-vs-pure-vector/).

FAILURE MODES BY PATTERN

Most RAG systems we audit fail on the same handful of issues — and the symptoms line up with the pattern in use. Quick triage list:

01 · Naive RAG

-   **Low recall (<75%):** chunks too large, embedder weak on the domain, or BM25-eligible queries hitting only dense.
-   **High latency:** rare; usually the LLM, not retrieval. Check token count first.
-   **Wrong-page citations:** chunk boundaries broke mid-section; switch to recursive + title-aware.

02 · Hybrid + Rerank

-   **Reranker latency spike:** Cohere rate-limited or self-hosted bge-reranker over-loaded; cache or batch.
-   **Hybrid weighting wrong:** sparse pulling too much noise; tune the BM25/dense ratio against your eval set.
-   **Domain drift:** embedder trained on general English failing on legal / clinical vocabulary; swap to Voyage-domain.

03 · Multi-step

-   **Loop never terminates:** reflection prompt unable to recognise "good enough"; add a max-iterations cap + scoring gate.
-   **Cost blow-up:** each loop is another LLM call; budget per query, circuit-break above ceiling.
-   **Latency tax:** 2–3× over single-pass; only worth it when faithfulness lift on the eval set crosses 12+ points.

04 · Agentic RAG

-   **Agent retrieves too much:** tool-call budget unconstrained; cap retrievals per loop turn.
-   **Citations drift:** agent paraphrases passages losing provenance; force structured-output citations.
-   **Scope creep into agent territory:** if 80%+ of work is non-retrieval tool-use, the engagement belongs on the [AI agent practice](/services/ai-agent-development/) instead.

007 / EVAL

## Four eval dimensions on every RAG we ship.

Generic LLM eval frameworks miss RAG-specific failure modes. We score retrieval and generation separately, then together — and the eval set lands in week 2 of every engagement, before the first chunk is embedded.

88%

Retrieval recall

Did the retriever pull the passages that contain the answer? Scored against gold-passages in the eval set.

94%

Answer faithfulness

Is every claim grounded in a retrieved passage? RAGAS + LLM-as-judge, with human spot-check on the disputed 5%.

<3%

Hallucination rate

Claims no retrieved passage supports. Hard gate before production. Refusal is preferred over guessing.

<1.6s

P95 latency

Query-to-answer latency across retrieve, rerank, and generate. Reranker is the usual bottleneck. Voice-RAG targets sub-700ms.

Numbers shown are illustrative target ranges for new engagements until production eval data from anonymised builds is published.

EVAL GATES

The four gates aren't suggestions. All four must be green before we wire any RAG into production traffic. Each has an explicit methodology, a target, and a fail-state — codified before the first chunk is embedded.

1.  01 Retrieval recall
    
    ≥88%
    
    Gold-passages in the eval set; recall@k for k=10 and k=5. Re-graded weekly. Production traces sampled into the eval set monthly.
    
    If <80%, retrieval logic gets rewritten — chunking, embedder, or hybrid weighting all back on the table.
    
2.  02 Answer faithfulness
    
    ≥94%
    
    RAGAS + LLM-as-judge (Claude Sonnet 4.6) scoring whether every claim is supported by a retrieved passage. Human spot-check on the 5% disputed by the judge.
    
    If <90%, citation-enforcement prompts get rewritten or the model gets demoted to refusal-only on low-confidence retrievals.
    
3.  03 Hallucination rate
    
    <3%
    
    Claims that no retrieved passage supports. Hard gate before production wire-up. Refusal is the preferred failure mode, not a guess.
    
    If ≥5%, we widen the refusal threshold and rerun. We've never shipped a RAG with hallucination above 3% on the eval set.
    
4.  04 P95 latency
    
    <1.6s
    
    Full query-to-answer latency across embed, retrieve, rerank, generate. Reranker is the usual bottleneck. Voice-agent RAG targets sub-700ms with bge-reranker self-hosted.
    
    If breached for >72h, we re-evaluate reranker placement or move to streaming generation.
    

Two methodology notes that matter. We use **LLM-as-judge with Claude Sonnet 4.6** as the default faithfulness scorer because it produces the most consistent grades against human ground-truth on the eval sets we've shipped. Hallucination disputes (typically 5–8% of outputs) get human spot-check by your domain expert — we never let LLM-as-judge stand alone for the hard cases. And the eval set **grows during production**: traces sampled monthly, regression alarms firing when an upstream model swap drops scores. An llm knowledge base scored weekly degrades at a fraction of the rate of one eyeballed quarterly.

Default eval and observability stack we deploy:

RAGAS Trulens Langfuse Promptfoo LangSmith DeepEval

007b / SECURITY · COMPLIANCE · COST

## Security, compliance, and cost engineering for RAG.

Three concerns enterprise rag buyers always ask about before procurement. We address each one in the spec — not as a "we'll figure it out at the security review" promise. Most RAG projects we rescue had at least one of these three left for later.

### Security & guardrails

Defense in depth for RAG, not a single classifier. Every production pipeline ships with PII scrubbing at ingest, citation enforcement at generation, and an adversarial eval set we re-run on every model or embedder swap.

-   **PII scrubbing at ingest** — Microsoft Presidio or your existing DLP runs on text before embedding. Embeddings store no raw PII by default; redaction tokens preserve structure where needed.
-   **Citation enforcement** — the LLM is prompted to ground every claim in a retrieved passage; outputs without citations get flagged or refused. We've shipped systems where 8–12% of queries get refused — clients prefer that over confident wrong answers.
-   **Prompt-injection defence** — Llama Guard 3 or a custom classifier on inbound queries. Retrieved passages get a separate isolation prompt so a poisoned doc can't override system instructions.
-   **Refusal threshold** — if no passage scores above a tuned floor, the answer is "I don't have a grounded answer for that." Refusal is a first-class output, not a degraded one.
-   **Output filtering** — Presidio on the LLM's response for PII leakage; we've caught models hallucinating Social Security numbers that weren't in the corpus more than once.

### Compliance posture

Default posture covers most enterprise procurement bars. Regulated workloads (clinical, financial, EU) layer in additional controls — scoped into the SOW at week 1, not retrofitted at security review in week 12.

SOC 2 Type II

Audited annually · default posture

ISO 27001

Information security mgmt · default posture

HIPAA-aligned

PII-scrubbed prompts · BAAs · log redaction

GDPR / EU AI Act

EU residency · DPA · model-card disclosures

On-prem / VPC deployment available — BGE embeddings, Qdrant self-hosted, Llama 4 / Mistral on vLLM. Standard pattern for healthcare, financial services, and defence-adjacent engagements where no data can leave the perimeter.

### Cost engineering

Embedding cost is usually the second-highest line item on production RAG after engineering time; LLM token cost is the highest. We model expected cost during corpus audit and cut it 40–70% on the average build through routing, caching, and quantization.

40–70%

Token-cost cut

Via routing easy queries to Haiku / 4o-mini after retrieval

85%

Cache hit

On stable system + retrieved-context prefixes (Anthropic prompt cache)

4×

Memory reduction

With Qdrant scalar / product quantization at <1% recall loss

-   **Model routing** — a classifier routes by query complexity. Easy lookups go to Haiku or 4o-mini at 1/20th the cost; hard ones to the frontier model. Faithfulness holds via the eval gate.
-   **Prompt caching** — Anthropic / OpenAI prompt caching on stable system prompts and retrieved-context prefixes. 85%+ hit rate on most agents within two weeks of launch — our [hosted vs self-hosted decision](/blog/hosted-vs-self-hosted-rag/) piece breaks down where the savings land.
-   **Quantization** — Qdrant scalar / product quantization cuts RAM 4× with under 1% recall loss on most corpora. The single highest-ROI infra optimization on large vector indexes.
-   **Batch embedding** — OpenAI / Voyage batch APIs for re-embedding and corpus refresh. 50% cost cut vs sync, 5–10× throughput. The default for any ingest run above ~100k docs.

All three concerns share a pattern: the discipline is in the spec, not in the build. We name the threat model, the compliance posture, and the cost band during corpus audit. The build executes against those targets — security and cost aren't add-on phases that happen after retrieval recall is green. They're how it gets there.

008 / USE CASES

## Where teams have shipped RAG.

Engagements anonymised. Industry and segment are real; metrics are real; brand names removed under standard NDA terms.

Use cases below are organised by corpus shape — contracts, tickets, research notes, code repos, regulations, equipment manuals — not by industry. The same hybrid retrieval and citation-enforcement pipeline ships to a law firm and a manufacturer; what changes is the chunking strategy, the freshness pipeline, and the embedder. Below: three flagship engagements (full numbers) plus three function stubs from recent ships. A semantic search implementation is the throughline on most of them; document intelligence services patterns recur across legal and ops.

Legal

Mid-market law · 80+ atty

### Contracts Q&A over 11 years of MSAs

Chunked 14,000 contracts with hierarchical headers (recursive + title-aware), hybrid retrieval, Cohere Rerank 3 down to top-5. Voyage-law embedder lifted recall from 78% to 92% over the OpenAI baseline. PII-scrubbed at ingest; the legal team grades the eval set monthly.

0 %

retrieval recall on the eval set

Support

Health-tech · enterprise

### Knowledge agent over 18 months of tickets

RAG over product docs and a redacted ticket archive. Refuses cleanly on out-of-corpus questions; escalates clinical to a human with the agent's draft and retrieved passages attached. p95 latency 1.4s; hallucination rate held at <2% across the post-launch quarter.

0 %

p1 ticket volume

Research

VC fund · 35-person

### Memo synthesis over public + private corpus

Multi-step RAG over SEC filings, press, and the fund's internal notes. Citations to primary sources only; the agent refuses gracefully when the corpus is thin on a target rather than synthesising plausible nonsense.

0 %

first-pass diligence ~

Code

Dev-tools SaaS · 50–200 emp

### Repo-aware code Q&A across a 1.8M-line monorepo

Symbol-graph indexing combined with chunked dense vectors. Engineering team asks 'where does this config flag get read?' and gets file:line citations with surrounding context. Stale-symbol rate stays below 2% via webhook-driven incremental re-indexing.

0

min → 40s per repo question

Compliance

Fin services · 1,000+ emp

### Regulatory Q&A across 6 jurisdictions

Hybrid retrieval over published regulations + the firm's interpretation memos. Citations to the underlying regulation always; refusal when jurisdictions disagree rather than averaging answers. Compliance team graded the eval set; faithfulness held at 96%.

0

8 days → min per memo

Ops

Mfg · 200+ emp

### Equipment-manual Q&A for the maintenance floor

RAG over scanned PDF manuals (OCR via Unstructured + Tesseract), pgvector on a single Postgres node — corpus was 4.2M vectors. Maintenance engineers ask in plain English from a tablet; answers cite manual + page number, with photos when present.

0 %

Mean diagnosis time -

Patterns across all six engagements: **the eval set landed in week 2**, before retrieval code; **the eval set grew during production** via sampled traces; **citation enforcement was the headline guardrail**, not an add-on. The outcome numbers are what each team measured at 90 days post-launch, not at deploy. The rag solutions that hold up at 90 days are the ones where the eval set was graded by a domain expert before the first chunk was embedded — picking a partner that stays for that work is the most underrated criterion in vendor selection.

009 / ENGAGE

## Four ways to start a RAG engagement.

Every RAG development services engagement is fixed-scope and fixed-duration. The first phase is small enough that stopping is a real option — about a third of our RAG Pilots end at the pilot for legitimate scoping reasons. Cheap to discover the corpus shape doesn't fit; expensive to discover it 12 weeks in.

RAG Pilot · 2–4 weeks

RAG Pilot · 4 weeks 4 phases

WEEK 1 Corpus audit

Corpus map + eval scope agreed

Corpus boundary signed off

WEEK 2 Eval set

30–80 graded examples + reference passages

Domain-expert grading complete

WEEK 3 Pipeline

Chunk + embed + retrieve baseline against eval

Baseline retrieval recall hit

WEEK 4 Demo + memo

Demo, scores report, next-phase recommendation

Production Build · 8–14 weeks

Production Build · 14 weeks 6 phases

WEEK 1–2 Corpus + eval

Corpus map, eval set, stack lock

WEEK 3–5 Ingestion

Chunking + embedder benchmarked + locked

Baseline retrieval recall hit

WEEK 5–8 Retrieval

Hybrid + rerank + rewrite tuned to eval set

Recall + faithfulness above target

WEEK 8–10 Eval gates

Five metrics green vs target

All five green or no deploy

WEEK 10–12 Deploy

Auth, observability (Langfuse), CDC pipeline live

WEEK 12–14 Iteration

Weekly eval review, runbook, ownership transfer

RAG Rescue · 4–6 weeks

RAG Rescue · 6 weeks 4 phases

WEEK 1 Eval audit

Current-system grading vs your eval set (we build one if absent)

WEEK 2 Failure-mode

Classified failures: chunking / retrieval / rerank / prompt / model

Failure breakdown reviewed

WEEK 3–5 Targeted fix

Each failure-mode addressed in order of recall lift expected

WEEK 6 Validation

Validated against your eval set; runbook updated

01 RAG Pilot Fixed scope

2–4 weeks

### Prove the corpus works.

In scope

-   One corpus, one question shape
-   Eval set with 30–80 graded examples
-   Working prototype against your real docs
-   Demo + recommendation memo for the next phase

Out of scope

-   Production deploy
-   CDC / freshness pipeline
-   Multi-corpus orchestration

02 Production RAG Build Fixed scope

8–14 weeks

### Full pipeline with eval gates.

In scope

-   All Pilot deliverables
-   Ingestion + CDC for freshness
-   Hybrid retrieval + reranker
-   Production wire-up, Langfuse observability, eval gates
-   Four weeks of post-launch iteration with weekly eval runs
-   On-call runbook and ownership transfer

03 RAG Rescue Fixed scope

4–6 weeks

### Diagnose and fix a struggling RAG.

In scope

-   Eval audit on the current system (we build an eval set if absent)
-   Failure-mode classification (chunking · retrieval · rerank · prompt · model)
-   Targeted fixes in order of expected recall lift
-   Validated against your eval set; runbook updated

04 Vector DB Migration Fixed scope

6–10 weeks

### Move stores, zero downtime.

In scope

-   Dual-write phase
-   Index parity checks against your eval set
-   Cutover playbook with rollback ready
-   Documented for handover

Two patterns worth flagging on RAG engagements specifically. **The eval set is the deliverable** — even more than the pipeline. A pipeline you can rebuild; an eval set is institutional knowledge about what your business considers a correct answer. We hand it over in your repo, with grading criteria documented. **About 70% of Pilots convert to Build engagements**. The 30% that don't either re-scoped based on what the Pilot revealed or decided the workflow wasn't yet ready for retrieval. Both are legitimate outcomes; we'd rather flag it at week 3 than at week 12.

WHO YOU WORK WITH

One Paiteq RAG engineering lead acting as your dedicated rag consultant, one senior RAG developer handling the retrieval pipeline, and a fractional product manager for scope and stakeholder management. On Rescue and Migration engagements we add a platform engineer for the index / CDC work. Two-week iteration cycles with a weekly demo. You have a direct Slack channel with the build team — no account-management buffer between you and the people doing the work.

On the client side, the engagement needs a **domain expert** to grade the eval set (~6 hours per week during weeks 1–3, then ~2 hours per week running) and an **IT or data owner** to clear access to source systems. We don't need a project manager on your side — we run that. We do need fast decisions on residency, scope boundaries, and acceptable refusal rates. If you're considering hiring a rag developer or rag consultant rather than a team engagement, the Pilot usually clarifies whether that's the right call.

010 / FAQ

## Common RAG questions.

RAG or fine-tuning — how do we decide?

Default to RAG. Fine-tune only when style, output format, or domain language can't be solved at the prompt + retrieval layer. They compose well: **most production systems use RAG for facts and a small LoRA fine-tune for output style**.

The clearest split: if the answer needs citations, freshness, or refusal-on-thin-context, RAG fits. If the answer is purely stylistic (tone, format, jargon the base model fumbles), fine-tuning fits. Hybrid is common — we scope both at week 2 of any Build.

The interactive picker above walks the decision in 3–4 questions. Our piece on [when RAG beats fine-tuning](/blog/when-rag-beats-fine-tuning/) has a deeper breakdown by workload type. Fine-tuning specifically lives in [our LLM fine-tuning practice](/services/llm-development/).

Which vector database should we pick?

Depends on five inputs: corpus size, residency requirements, ops capacity, latency budget, and whether hybrid search and reranking are first-class needs. Our usual call:

-   **pgvector** — corpora under ~5M vectors, team already runs Postgres. Joins between vectors and structured filters matter.
-   **Pinecone** — 5M–100M vectors, no ops appetite, SOC 2 hosted. Default for SaaS clients.
-   **Qdrant** — self-hosted residency requirements, very large corpora, want quantization to cut RAM 4×.
-   **Weaviate** — hybrid search heavy workloads, multi-tenant SaaS where each tenant gets a namespace.

We benchmark two candidates against your real eval set on every Production Build before locking. The eval set, not vendor marketing, decides. A vector database services capability is part of every engagement — selection isn't a one-shot.

How do you measure RAG quality?

Five dimensions, scored separately because they fail differently:

1.  **Retrieval recall** — did we pull the right passages? Scored against gold-passages in the eval set.
2.  **Context relevance** — are the retrieved passages on-topic, or off-topic noise that fits keyword-wise?
3.  **Answer faithfulness** — is every claim grounded in a retrieved passage? RAGAS + LLM-as-judge, human spot-check on the disputed 5%.
4.  **Hallucination rate** — claims with no retrieved support. Hard gate before deploy.
5.  **P95 latency** — query to final token. Reranker is usually the bottleneck.

The default eval stack is RAGAS + Trulens + Langfuse. The eval set grows from production traces every month, with regression alarms if any metric drops by >5 points. Our piece on [eval framework comparison](/blog/rag-eval-frameworks-compared/) covers when to reach for which tool.

What about freshness — our docs change every day.

Incremental ingestion with change-data-capture from your source systems. New and changed documents re-embed and replace in the store; deletes propagate. Stale-document rate is a first-class metric, not an afterthought — we track it weekly and alert above your tolerance. For Confluence / SharePoint / Notion, the standard pattern is webhook-driven; for Postgres or other DBs we use Debezium or the equivalent.

Can you migrate us off Pinecone (or any other store)?

Yes. Vector DB Migration is a fixed-scope engagement, 6–10 weeks depending on corpus size and the retrieval logic that has to come along. Dual-write phase first (writes go to both stores), then index parity checks against your eval set, then read cutover with rollback ready. We've shipped Pinecone → Qdrant migrations of 22M chunks with zero downtime and zero retrieval-recall regression.

How do you handle PII, residency, and compliance?

PII scrubbing happens at ingest via Microsoft Presidio or your existing DLP — embeddings store no raw PII by default. For regulated workloads we deploy fully on your cloud (AWS, GCP, Azure) with no data leaving the perimeter; the embedding model runs on dedicated GPU/CPU (BGE for residency-constrained clients). SOC 2 Type II and ISO 27001 are default; HIPAA-aligned and GDPR / EU AI Act postures are scoped into the SOW for regulated engagements.

How do you prevent hallucination in production?

Three layers. (1) Retrieval threshold: if no passage scores above a tuned floor, the agent refuses rather than guesses. (2) Citation enforcement: every claim points to a retrieved passage, and the LLM is prompted to flag claims it can't ground. (3) Faithfulness scoring: LLM-as-judge with Claude Sonnet 4.6 plus human spot-check on disputed cases. Refusal is a feature, not a failure mode — we've shipped systems where 8–12% of queries get refused and the business is happier with that than with confident wrong answers.

What does a RAG development services engagement cost?

Pilot is fixed-scope at 2–4 weeks; Production RAG Build is 8–14 weeks; Rescue is 4–6 weeks; Migration is 6–10. We hold the price band on the contact call rather than publishing here because corpus size, residency posture, and integration count swing it meaningfully. The Pilot is small enough that stopping is a real option — about a third of RAG Pilots end at the pilot for legitimate scoping reasons.

Do you build the eval set or do we?

Your domain expert grades; we facilitate. The eval set is the most important deliverable of the engagement and it has to reflect your business's failure modes, not ours. We bring the structure (30–80 examples, gold-passages, edge cases over easy cases), the tooling (RAGAS, Trulens, custom harness), and 4–6 hours of facilitation per week. Your domain expert grades the examples and signs off. After launch, we co-curate from sampled production traces monthly.

011 / Related practices

## Adjacent services.

[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

AI CONSULTING

AI Consulting

AI strategy, audits, roadmap.

](/services/ai-consulting/)

012 / Start a project

## Let's *ground* your AI in real data.

RAG Pilot in 2–4 weeks. Production Build in 8–14. Rescue in 4–6.

[Talk to engineering](/contact/) [Architecture review](/contact/?topic=arch-review)


---

## SECTION: 4.3. Service: llm-development

_Source: https://www.paiteq.com/services/llm-development/_

# LLM Development Services & Company — Paiteq

> Paiteq is an LLM development company shipping production LLM development services on Claude, GPT-5, Llama 4. Fine-tuning, eval, cost engineering, observability.

**HTML version:** https://www.paiteq.com/services/llm-development/

## Key facts

- Models: Claude, GPT-5, Llama 4.
- Practice areas: fine-tuning, eval design, cost engineering, observability.

## Related pages

- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

LLM Development

# Production llm development services on *Claude · GPT-5 · Llama 4*.

Paiteq is an llm development company shipping custom language model applications with evaluation built in — hosted or self-hosted, with auth, observability, cost guardrails, and model routing wired before the first user lands. Not a notebook with a chat box.

[Talk to engineering](/contact/) [See engagement shapes](#engage)

Models Claude · GPT-5 · Llama 4 · Mistral · Qwen

Practice App · Fine-tune · Deploy · Eval

Engage Pilot · Build · Tune · Audit

Compliance SOC 2 · HIPAA · on-prem

001 / SURFACES

## Eight LLM workloads we ship.

Each surface below is a workload we've shipped to production — with the eval methodology, the model choice, and the cost engineering already worked out.

LLM application development sorts cleanly by workload shape, not by industry. A clinical-note copilot for a 50-person health-tech reuses the same eval methodology, the same model-routing logic, and the same observability stack as a sales-research copilot for an 800-person fin-tech. The integrations differ, the residency posture differs, the prompts differ — but the engineering shape doesn't. Sorting by workload is what lets us reuse the eval harness, the prompt-versioning patterns, and the cost-monitoring playbook across clients instead of reinventing them every engagement.

[

01 / APP BUILD ↗

Custom LLM applications

Full app build on Claude, GPT-5, Gemini, or Llama 4. Auth, persistence, observability, fallback. Not a notebook with a chat box — production llm application development with eval gates baked in.

AuthPersistEval gates

](#use-cases)[

02 / FINE-TUNE ↗

Fine-tuning

LoRA / QLoRA on Llama 4, Mistral, Qwen. Or hosted fine-tunes on OpenAI / Anthropic. Dataset curation, eval-against-baseline, deploy only if the fine-tune wins.

LoRAQLoRAHosted

](#fine-tuning)[

03 / EVAL ↗

Eval infrastructure

Eval sets, graders, regression alarms. Build this first, pick the model second. Inspect AI as the harness; RAGAS for retrieval-adjacent; custom graders for app-specific scoring.

Inspect AIRAGAS

](#eval)[

04 / DEPLOY ↗

Production deploy

Hosted on the provider, self-hosted on your cloud via vLLM or TGI, or hybrid. Auth, rate-limit, observability, model fallback, cost guardrails — all wired before the first user lands.

vLLMHybrid

](#deploy)[

05 / ROUTING ↗

Model routing

Cost engineering by routing. A classifier sends easy queries to GPT-5 mini or Haiku; hard queries to GPT-5 or Sonnet. Quality holds via eval gates; spend drops 40–70%.

RouterCache

](#cost)[

06 / OBSERVABILITY ↗

LLM observability

Langfuse, Helicone, LangSmith — what we instrument and why. Per-call cost, hallucination scoring, prompt versioning, regression alarms. Production traces feed the eval set monthly.

LangfuseHelicone

](#observability)[

07 / SAFETY ↗

Guardrails + safety

Llama Guard, Presidio, custom policy classifiers. Pre-generation input filtering, structured-output enforcement, post-generation classification. Every production deploy ships with an adversarial eval set.

Llama GuardPresidio

](#safety)[

08 / MULTIMODAL ↗

Multimodal apps

Vision plus text plus audio. GPT-5, Claude Sonnet 4.6, Gemini Pro for OCR-grade document extraction, image Q&A, voice-driven workflows. Modality choice follows the eval set, not the demo video.

VisionAudio

](#use-cases)

Heavier surfaces this year: internal copilots, doc extraction, model routing. Lighter: multimodal voice, where the latency budget rules out about half of candidate workloads, and fine-tuned models, where the data prep cost stops many engagements at "the baseline already covers it." We'll talk you out of an LLM-shaped engagement if the workload is actually a retrieval problem — better fit for [grounded retrieval pipelines](/services/rag-development/) — or an autonomous task-execution problem — better fit for [agent loops with state](/services/ai-agent-development/). About 15% of inbound LLM development services inquiries get redirected before the first call.

002 / SERVICES

## LLM development services from an LLM development company — pick where to start.

Four engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done.

Choosing an llm development company is mostly choosing the right starting shape. Buyers who come in with a scoped use case and a way to measure success ship to production around 80% of the time. Buyers who come in with "we want LLM somewhere in the product" ship around 30% of the time, usually after a re-scope. The four shapes below map onto the starting points we see most often — pick the one that matches what you have, not what you wish you had.

[

01 / PILOT ↗

LLM App Pilot

One scoped use case, eval-graded, demoed in 2–4 weeks. Fixed scope. About a third of pilots end at the pilot — that's a feature, not a failure.

2–4 wksFixed scope

](#engage)[

02 / BUILD ↗

Production LLM Build

Full app with auth, observability, eval gates, deploy pipeline, and four weeks of post-launch iteration. 8–16 weeks. Fixed scope.

8–16 wks

](#engage)[

03 / FINE-TUNE ↗

Fine-tuning engagement

Dataset curation, baseline eval, fine-tune run, head-to-head comparison. Deploy only if it wins; we'll recommend prompt-only if that's enough. 6–10 weeks.

6–10 wks

](#engage)[

04 / AUDIT ↗

LLM App Audit

Eval coverage, cost profile, latency, safety posture. Report with a prioritised fix list and a fixed-scope follow-on if you want us to ship the fixes. 2–3 weeks.

2–3 wksAudit

](#engage)

If the workload is scoped but the model quality is unproven, start with a **Pilot**. If you know the workload works and you need production discipline (observability, eval gates, routing, deploy), start with a **Production LLM Build**. If you've already shipped LLM and it's underperforming or surprise-billing, start with an **Audit**. If the data tells you a fine-tune is justified, start with a **Fine-tuning engagement** — but the audit usually comes first because the eval set has to exist before fine-tuning can be evaluated. Week-by-week scope on each is further down the page.

003 / MODELS

## Models — when each one wins.

Model choice follows workload, not house preferences. Cost-adjusted quality against your eval set decides — every time, no exceptions.

-   Claude
-   GPT-5
-   GPT-5 mini
-   Gemini
-   Llama 4
-   Mistral
-   Qwen
-   vLLM
-   TGI
-   Modal
-   Replicate
-   Langfuse
-   Helicone
-   Inspect AI
-   Llama Guard
-   Presidio
-   Claude
-   GPT-5
-   GPT-5 mini
-   Gemini
-   Llama 4
-   Mistral
-   Qwen
-   vLLM
-   TGI
-   Modal
-   Replicate
-   Langfuse
-   Helicone
-   Inspect AI
-   Llama Guard
-   Presidio

FRONTIER + OPEN-WEIGHT MODEL PICKS

For each model: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production llm development on every model below. The "when we don't" lines come from real builds — usually a moment in week 4 where the eval set told us to swap.

Claude Sonnet 4.6

Strengths

Strongest tool-call accuracy in our eval set across 2025–2026. Excellent at structured output. Prompt caching cuts cost ~80% on stable system prompts. Vision is competitive with GPT-5 on OCR-grade extraction.

When We Pick

Default for agentic workloads where the model holds state and calls tools. Default for any app that streams JSON. Default when prompt caching unlocks meaningful spend reduction (long stable system prompts).

When We Don't

Workloads that need a 128k+ context window with reliable recall — Gemini 3.0 Pro wins there. Hyper-cost-sensitive batch workloads where Haiku or GPT-5 mini covers the quality bar.

Paiteq Pattern

Our day-one baseline on most production LLM development services. About half the apps we ship in 2026 run on Sonnet plus prompt caching.

Tool-callJSONCache 80%

GPT-5

Strengths

Strongest multimodal — vision, voice, real-time bidirectional audio. Latency-tuned for streaming UIs. Robust function-calling. Batch API at ~50% off list for non-interactive workloads.

When We Pick

Multimodal workloads, voice apps via the Realtime API, and any client with an existing OpenAI contract where procurement won't add Anthropic. Batch summarisation pipelines.

When We Don't

Pure tool-call agentic workloads — Sonnet wins our evals there. Apps with strict refusal policies — GPT-5's safety surface is looser than Claude's by default.

Paiteq Pattern

Default for voice and vision pipelines. We pair it with GPT-5 mini as the routing fallback on most production deploys.

MultimodalRealtimeBatch

GPT-5 mini / Haiku

Strengths

10–20× cheaper per token than the flagships. Latency 2–3× faster. Good enough for ~70% of routing-tier traffic on most apps once a router classifier is in place.

When We Pick

As the cheap-tier in a routed stack. Easy classification, short summarisation, slot-filling extractors. Anywhere the eval set says the small model holds quality.

When We Don't

Hard reasoning, multi-step tool use, long context with citation. We've seen too many builds try to push these down to mini and watch task scores drop 15–25 points.

Paiteq Pattern

Roughly two-thirds of cost engineering work is just plumbing in a router that sends easy queries here. Spend cuts of 40–70% are common, quality holds when the eval set guards the routing thresholds.

CheapFastRouter-tier

Llama 4 / Mistral (self-hosted)

Strengths

Open weights. Data never leaves your perimeter. Fixed infra cost beats per-token billing above ~2M requests/day on dedicated GPUs. Tunable — quantization, LoRA adapters, custom vocab.

When We Pick

Regulated data with hard residency rules (healthcare PHI, finance MNPI, EU AI Act high-risk workloads). Very high-volume batch workloads where the math flips. Anywhere the team has the ops capacity to run an inference service.

When We Don't

Mixed-quality workloads where a flagship's edge matters per call. Teams without Kubernetes / GPU ops capacity — the operational tax shows up around month two.

Paiteq Pattern

We run Llama 4 70B on vLLM with A100 80GB nodes for residency-constrained clients. Sub-200ms TTFT achievable with continuous batching. ~$0.08 per 1K output tokens amortised at the workloads we see.

Open-weightsvLLMResidency

Gemini 3.0 Pro

Strengths

2M-token context window with usable recall. Strong multimodal — video and document understanding at length. Aggressive pricing on long-context workloads. Native Google Cloud integration.

When We Pick

Document-stack workloads where the corpus fits in context — usually under 1.5M tokens. Long-form synthesis. Clients standardised on Vertex AI for procurement reasons.

When We Don't

Tool-call-heavy agents — Sonnet still leads our evals there. Workloads where the long-context advantage doesn't pay back the per-token cost on the actual queries (most are still short).

Paiteq Pattern

Default for long-document Q&A where the customer would rather pay the per-token cost than maintain a retrieval pipeline. We'll talk you back to RAG if the corpus grows past ~1.5M tokens.

2M contextVisionVertex

Qwen 3 / specialised open

Strengths

Strong Chinese / Asian-language performance, competitive coding benchmarks, permissive licence. Sizes from 0.5B to 72B cover edge-to-cloud. Aggressive on multilingual workloads.

When We Pick

Multilingual workloads with heavy Chinese / Japanese / Korean traffic. Code-heavy domains where Qwen-Coder lifts pass-rate over Llama 4. Self-hosted multilingual chat where Llama's English bias hurts.

When We Don't

English-first SaaS. Workloads where the licence diligence isn't worth the lift for marginal gain over Llama 4.

Paiteq Pattern

We reach for Qwen on 1 in 8 builds — usually multilingual SaaS expanding into APAC or enterprises with code-search workloads.

MultilingualCodePermissive

Two patterns worth flagging. First, **we benchmark three models against the eval set** before locking the stack — usually Sonnet, GPT-5, and one open-weights candidate (Llama 4 70B or Qwen 3). The eval set decides, not the leaderboard, not the demo video. Second, **we default to a two-tier routing stack on Production Builds** — flagship for hard queries, cheap-tier for easy ones, classifier router in the middle. The 40–70% cost cut almost always pays back the routing complexity. Skip routing only when traffic is too low for the savings to matter or when the workload is uniformly hard. Our deeper take on hosted vs self-hosted economics lives in our [hosted-vs-self-hosted analysis](/blog/hosted-vs-self-hosted-llms/).

004 / WHERE LLMs SHIP

## Where LLMs deliver — capability × industry.

Capability rows × industry columns. Cell strength reflects production volume in our work, not theoretical fit. Empty cells mean we either haven't shipped it yet or the workload didn't justify an LLM.

Capability Industry

B2B SaaS

Fin-tech

Health-tech

Legal

Mfg

E-comm

Ed-tech

Logistics

Custom LLM apps

Internal copilots

Doc extraction

Voice / multimodal

Classification

Fine-tuned models

Custom LLM apps

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Internal copilots

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Doc extraction

B2B SaaSFin-techHealth-techLegalMfgE-commLogistics Ed-tech

Voice / multimodal

B2B SaaSFin-techHealth-techE-commEd-techLogistics LegalMfg

Classification

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Fine-tuned models

B2B SaaSFin-techHealth-techLegalMfgEd-tech E-commLogistics

Possible fit Good fit Primary vertical

Heaviest columns: fin-tech, health-tech, legal. The pattern isn't surprising — those are the industries where structured doc extraction, regulated workflows, and high-stakes refusal pay back hardest. Lightest column: ed-tech, where workloads tilt toward generation and personalisation more than analysis. The grid isn't a roadmap; it's a record. If your industry's column looks thin and your use case sounds promising, that's often where the most interesting llm consulting engagements come from — fewer prior comparisons, more white space.

005 / DEPLOY

## Hosted, self-hosted, or hybrid — pick the right deployment.

The most common scoping question on any llm development services engagement. The answer depends on residency rules, steady-state volume, and workload mix. Walk the picker; it'll get you to one of five recommendations in two or three questions.

Path

Question

Pick one

Result

Most clients land on **hosted with a routing layer** — Sonnet plus GPT-5 mini in a two-tier stack, or GPT-5 plus mini, depending on procurement constraints. About a quarter end up self-hosted, almost entirely health-tech and finance with hard residency rules. Hybrid is the rarer path but real for clients with mixed regulated and non-regulated traffic. Our piece on [hosted-vs-self-hosted economics](/blog/hosted-vs-self-hosted-llms/) walks through the break-even math; below ~100k requests/day, hosted almost always wins on TCO.

006 / PATTERNS

## Four LLM application architectures we ship.

Most production llm application development collapses onto one of these four shapes. The shape decides where the eval gates land, where the cost lives, and where the failure modes are.

   
01

### Single-prompt app

The simplest production shape. A stable system prompt, prompt caching for cost, structured output for reliability, one model. Most internal copilots that don't need state or tool-use start here and stay here.

Pick when

-   Single-turn or stateless multi-turn
-   output is text or a constrained JSON schema
-   latency budget allows one round-trip
-   cost is bounded by per-request length, not session state.

Skip when

-   Multi-step reasoning that benefits from chained models
-   tool use
-   long-running sessions where state matters
-   workloads where each call shape varies enough that caching doesn't hit.

Stack

Claude Sonnet 4.6 or GPT-5Prompt cachingStructured output (Pydantic / Zod schemas)Llama Guard

02

### Prompt chain

Split a hard task into stages. Use a cheap model (GPT-5 mini, Haiku) for extraction and formatting; reserve the flagship for the reasoning step in the middle. Each stage has its own eval target; failure modes are localised.

Pick when

-   Hard reasoning task with cheap pre/post-processing
-   structured input or output where the format work doesn't need the flagship's quality
-   cost-sensitive workloads where the flagship is overkill at the edges of the chain.

Skip when

-   Tasks where the stages share too much context — the chain re-prompts the context at each call
-   that gets expensive fast. Tool-call agentic workloads, where the chain shape forces unnatural rigidity.

Stack

GPT-5 mini or Haiku (edges)Sonnet or GPT-5 (middle)Inspect AI per-stage evalLangfuse trace correlation

03

### Model routing

A small classifier (often the cheap-tier model itself) routes each query to the right tier. Easy queries go to GPT-5 mini or Haiku; hard queries go to Sonnet or GPT-5. Routing thresholds are guarded by eval gates on both tiers. The single highest-ROI cost engineering pattern in our playbook.

Pick when

-   Mixed-difficulty traffic above ~100k requests/day where the flagship-only bill becomes the constraint
-   workloads where you have a measurable success metric on both tiers (eval gates can guard the threshold)
-   long-running production apps where iteration on the router pays back.

Skip when

-   Pre-launch builds — you don't know the traffic shape yet. Workloads where the cheap-tier quality is consistently below threshold (just use the flagship). Tiny daily volume where the routing complexity costs more than the model spend it saves.

Stack

Classifier (cheap-tier LLM as judge)Eval gates per tierLangfuse cost traceFallback policy (cheap-fail → flagship)

04

### Fine-tune + adapter

A small LoRA adapter trained on 1.5–10k domain examples, stacked on a frozen base (Llama 4, Mistral, Qwen). Adapter weights are under 200MB and load in seconds at inference. Cheap to run, cheap to iterate, easy to roll back. Works for output-style and domain-vocab problems; doesn't replace RAG for facts.

Pick when

-   Domain vocabulary the base fumbles
-   consistent output format the prompt can't reliably enforce
-   high-volume self-hosted workloads where the per-call cost matters
-   latency requirements that rule out long few-shot prompts.

Skip when

-   Small datasets (under ~800 clean examples) — the fine-tune won't move the needle. Workloads where the source of truth changes weekly — adapters go stale and re-training is overhead. Closed-source models where adapters aren't supported (you're stuck with hosted fine-tunes).

Stack

Llama 4 70B or Mistral 7B baseLoRA / QLoRA on a single A100 80GBvLLM with adapter swap-inInspect AI per-domain eval set

In practice, most production stacks compose two of these patterns. A routed two-tier stack with the flagship tier itself being a prompt chain is common. A single-prompt app on the flagship tier plus a fine-tune adapter on the cheap tier shows up on the high-volume self-hosted side. The right composition isn't pre-decided; it falls out of the eval set and the cost model during weeks 3–5 of a Build. We'll talk you out of patterns that look interesting but don't match your workload — premature complexity is the most common llm development mistake we see in audits.

007 / EVAL

## Four gates on every production LLM app.

Eval-first isn't a slogan; it's a build-order decision. The eval set lands in week 2 — before any model is picked. You can't choose a model without a way to measure what "good" means on your workload.

All four gates green before any production wire-up. If one's amber, we rework it in place; if it's red, we re-baseline the model choice or the prompt structure. The gates are the most important part of our llm development services — they're what stops 'feels good in the demo' from shipping to production.

1.  01 Task score
    
    ≥94%
    
    App-specific grader on a 30–80-example domain-expert-graded eval set. Lands in week 2 before any model selection. The eval set is your most important artifact; we facilitate, your domain expert grades.
    
    If <90%, we re-baseline against a different model tier or revisit the prompt. We've never shipped below 90% on the eval set, full stop.
    
2.  02 Hallucination
    
    <2%
    
    LLM-as-judge (Claude Sonnet 4.6) checking whether every claim is supported, plus human spot-check on the disputed 5%. For grounded apps with RAG, faithfulness is the equivalent metric.
    
    If ≥5%, we widen the refusal threshold and rerun. Refusal is a feature, not a failure mode — confident wrong answers are the worse outcome.
    
3.  03 P95 TTFT
    
    <800ms
    
    Time to first token, measured across the router and any pre-generation safety classifier. Streaming UX target. Voice apps target sub-400ms; voice runs on a leaner stack.
    
    If breached on the production traffic shape for >72h, we tune routing thresholds, swap to a faster model on the slow tier, or move to streaming generation where we weren't already.
    
4.  04 P50 cost per request
    
    Modelled at discovery
    
    Per-request cost tracked weekly post-launch via Langfuse. Modelled during the Pilot using expected traffic shape; surprise bills aren't a surprise because the modelling lands in week 3.
    
    If cost drifts >25% over the modelled baseline for two weeks, we audit the router thresholds and the system-prompt cache hit rate — usually one of those two is the culprit.
    

The four gates above are the floor. For specific workloads we add more: **refusal rate** (what fraction of queries the model declines, and how that tracks against the adversarial eval set), **citation accuracy** (for grounded apps, see our [grounded retrieval pipelines](/services/rag-development/)), **format compliance** (for structured-output apps, % of outputs that parse against the Pydantic / Zod schema first try). Add metrics only when the workload demands them — gate proliferation slows iteration without lifting quality.

008 / COST

## LLM cost engineering — model routing, caching, batching.

Cost engineering is where most of the practical ROI on a production LLM build lives. Five levers cover ~90% of the savings we ship.

01

#### Model routing

A classifier (often the cheap-tier model itself) sends easy queries to GPT-5 mini or Haiku; hard queries to Sonnet or GPT-5. Most apps see 60–75% of traffic route to the cheap tier with quality holding. **Typical cut: 40–70% of token spend.**

02

#### Prompt caching

Anthropic prompt cache hits run 80–90% on stable system prompts; OpenAI's automatic cache catches the equivalent on chat completions. We design system prompts to maximise cacheable prefixes from day one. **Typical cut: 70–85% of input-token cost on the cached prefix.**

03

#### Batch API

For non-interactive workloads — overnight summarisation, batch classification, embedding regeneration — the OpenAI Batch API and Anthropic's batch tier knock ~50% off list at the cost of higher latency (up to 24h). **Typical cut: 50% on batch-eligible traffic.**

04

#### Context windowing

Trim the input. Summarise prior turns; retrieve only the chunks that score above a threshold; cap response length aggressively when the format doesn't need more. Boring engineering work that quietly cuts the per-call cost. **Typical cut: 15–30% of token spend.**

05

#### Self-host at break-even

Above ~1M requests/day on a single workload, self-hosted Llama 4 on vLLM with continuous batching usually flips the economics. The trigger isn't a slogan — it's where the dedicated A100 pool plus ops overhead crosses the hosted bill. **Typical cut at break-even: 60–80% of hosted spend.**

The order matters. Routing and caching together usually deliver 60–80% of the savings before any other lever fires. Batching catches the next chunk for the workloads that fit. Context windowing is the boring engineering work that compounds. Self-hosting is the heavy lever that only fires when volume justifies it — premature self-hosting is the most common cost-engineering mistake we see in audits. Our deeper take lives in our [LLM cost engineering deep dive](/blog/llm-cost-engineering/).

009 / OBSERVABILITY

## LLM observability — what we instrument.

Production LLM apps fail differently than other production systems. The traces have to capture cost, quality, and latency together — and feed back into the eval set every week.

We default to **Langfuse** as the trace store on most builds; **Helicone** when the team wants a thinner proxy-only setup; **LangSmith** when LangChain is already the orchestration layer. The tool matters less than what gets instrumented. Every production call captures: model name, prompt hash, cache hit / miss, input tokens, output tokens, cost, p95 latency contribution, eval-grader score (sampled), and a user / team / tenant tag for downstream cost attribution.

Production traces feed the eval set monthly. We sample ~1% of production calls into a review queue; the domain expert grades a subset; the graded examples land in the eval set as regression cases. This is what stops the slow drift that hits most LLM apps 90 days post-launch — a model upgrade or a prompt edit subtly changes the failure mode, and without sampled feedback you don't notice until users complain. Regression alarms fire on any 5-point drop on any of the four gates.

For high-stakes workloads we add a second layer: per-tenant cost budgets enforced at the proxy, per-user rate limits to catch runaway scripts, and an alert when the cache hit rate drops more than 10 points in a 24h window (usually the canary for a system-prompt edit that broke caching). About a quarter of LLM App Audits we run trace problems back to observability gaps — you can't fix what you can't see.

010 / FINE-TUNING

## Fine-tuning — when it's worth it.

Llm fine tuning services live at a specific decision point: when prompting and retrieval don't cover the workload but a small domain dataset does. Most engagements don't need fine-tuning; the ones that do tend to be obvious.

We support three fine-tuning shapes. **LoRA** on open-weights bases (Llama 4, Mistral, Qwen) for domain-vocab and output-format problems — usually 1,500–10,000 examples, single A100 80GB, 8–24 hours of training time, ~$300–$1,200 per run. **QLoRA** for the same shapes when memory is constrained, with a small quality trade-off versus full LoRA on 70B models. **Hosted fine-tunes** on OpenAI or Anthropic when the base model has to stay frontier-class and the dataset is small enough to be cost-effective at hosted-fine-tune pricing.

The order is always the same. Baseline the prompt-only path against the eval set. If the baseline is good enough, ship it — about a third of fine-tuning engagements end here, and that's a successful outcome. If the baseline is not good enough, build the fine-tuning dataset (this is usually the bottleneck, not the training run), run the fine-tune, score against the same eval set. Deploy only if the fine-tune wins on the eval set with a margin that justifies the iteration cost. We've shipped engagements where the fine-tune lost by 1–2 points and we still recommended sticking with prompt-only because the iteration overhead wasn't worth the marginal quality.

A specific real example: a health-tech client's clinical-note structuring task. GPT-5 baseline scored F1 ~71 on the clinician-graded eval set. We benchmarked a QLoRA on Llama 4 70B trained on 8,400 redacted notes — F1 ~85, plus the data stayed in-cloud and per-call cost dropped 60%. The fine-tune shipped. Same shape, different client, finance domain: baseline scored 89, fine-tune scored 91. We didn't ship — the 2-point lift didn't justify the operational ownership of an adapter. Most fine-tuning decisions are this kind of close call; the eval set has to be the tiebreaker, not the wishful thinking.

011 / PROMPT

## Prompt engineering at scale.

Prompt engineering services on production LLM systems are mostly version control, eval correlation, and cost discipline — not clever phrasing. The cleverness happens in the eval set design.

Production prompts live in version control, not in a notebook or a CMS. Every prompt change ships with a commit message, a diff, and an eval-set re-run. If the eval scores drift, we know which prompt edit caused it — usually within a few hours of the change landing. The prompt versioning pattern we ship on most Production Builds keeps the active system prompt as a git artifact with a SHA pinned in the deploy config; rollback is a config change.

For prompt patterns themselves: **structured-output enforcement** via Pydantic or Zod schemas catches more than chain-of-thought tricks on most production workloads — the model can't drift if the output has to parse. **Few-shot examples** live in a separate corpus that the prompt builder injects at request time; this lets us A/B example sets without touching the system prompt. **Caching prefixes** are designed in from day one — anything stable goes at the top of the prompt; anything dynamic goes after, where it doesn't bust the cache. **System-prompt isolation** prevents user content from being interpreted as instructions; we wrap user input in unambiguous delimiters and tell the model so explicitly.

We ship our prompt engineering services as part of the Build, never as a standalone — prompting without eval is just creative writing. For teams that want the prompts handed off cleanly, every Production LLM Build ends with a documented prompt library plus a small [prompt-versus-fine-tune decision memo](/blog/prompt-engineering-vs-fine-tuning/) covering when to escalate which knob next.

012 / PROCESS

## How a build runs — eval-first, every time.

The same six-step process runs across a 4-week Pilot and a 16-week Production Build. The gates change in depth, not in shape. Every step has a deliverable, a named owner, and a gate criterion — pass or rework.

WEEK 1

### Discovery

Workload shape, eval surface, cost target, residency posture. Models aren't picked yet — that's week 3.

WEEK 2

### Eval set

30–80 domain-expert-graded examples covering main paths and edge cases. Lands before any prompting.

WEEK 2–4

### Baseline

Three to four models scored against the eval set. Cost-adjusted quality wins, not benchmark theatre.

WEEK 4–8

### Iteration

Prompt, routing, retrieval (if RAG), or fine-tune (if the data justifies it). Each change re-scored.

WEEK 8+

### Deploy

Auth, rate-limit, Langfuse observability, model fallback, cost guardrails, regression alarms.

ONGOING

### Running

Weekly eval, drift alarms, prompt iteration log, model-upgrade regression checks. The eval set grows.

013 / VS

## Hosted versus self-hosted — the side-by-side.

A reference table covering the practical trade-offs. The picker above gets you a recommendation; this table gives you the numbers to defend it in a procurement conversation.

Hosted (provider)

Self-hosted (yours)

Hosted (OpenAI / Anthropic / Vertex)

Fast to ship, top-of-class quality, per-token pricing — no inference infra to run

For most teams below ~500k requests/day, hosted is strictly faster to ship *and* cheaper to operate. No GPU cluster to provision, no CUDA dependency hell, no on-call rotation for inference infra. The switching cost if you later need self-hosted is a model swap, not a rewrite.

Self-hosted (Llama / Mistral / Qwen on your cloud)

Data stays in your cloud, fixed infra cost, model under your control, custom quantization + adapters

Self-hosting is the only option when a data-residency rule forbids third-party inference — common in EU healthcare (GDPR + national health-data law), defence-adjacent contracts, and financial services with ring-fenced data mandates. It's also the only path to QLoRA/LoRA fine-tunes that you own outright.

Best when

Quality > cost; no residency rule; team has no ops capacity

Regulated data; very high volume; or strategic need to own the model

Latency floor

Provider-dependent; rarely below 600ms TTFT for flagships

Sub-200ms TTFT achievable with vLLM continuous batching

vLLM's continuous batching scheduler saturates GPU memory instead of waiting for fixed batch boundaries — at typical QPS, p50 TTFT on a single A100 running Llama 3 70B sits around 130–180ms. Provider network round-trips alone add 80–150ms before a single token is returned; flagship models (GPT-5, Claude Opus) sit materially above 600ms TTFT at launch load.

Per-1K-tokens cost

$0.30–$15 input/output (flagships); $0.05–$0.50 (small models)

$0.05–$0.40 amortised on dedicated A100s once volume crosses the break-even

The ranges genuinely overlap. At low volume the amortised A100 cost looks great on paper but the fixed reservation cost ($10–25k/mo per A100 on-demand) dominates unless utilisation is high. At very high volume the per-token numbers flip — but the real lever is utilisation rate, not raw token price.

Operational load

Low — provider runs inference, you run the app

Higher — you (or we) run GPU pools, batching, quantization, autoscaling

Break-even point

<100k requests/day — hosted almost always wins

\>1M requests/day on a single workload — self-hosted economics flip

Full breakdown — [when to pick which](/blog/hosted-vs-self-hosted-llms/)

014 / USE CASES

## Where teams have shipped.

Six anonymised engagements across recent quarters. Workloads, segments, and outcome metrics are real; brand removed under NDA.

Internal copilot

Fin-tech · 800+ emp

### Internal LLM assistant on Claude + private corpus

Slack-deployed advisor that pulls from a redacted Confluence + Snowflake corpus. Refuses cleanly on out-of-corpus. Auth scoped per team; Langfuse traces every call; cost capped per-user per-day. The eval set is graded monthly by two senior analysts.

0 %

internal-knowledge tickets

Fine-tune

Health-tech SaaS · 50–200 emp

### QLoRA fine-tune on Llama 4 for clinical note structuring

8,400 redacted clinical notes, QLoRA fine-tune on a single A100 80GB, eval against a clinician-graded reference set. Self-hosted on the client's AWS via vLLM. We benchmarked GPT-5, Sonnet, and the fine-tune; the fine-tune won on F1 and on per-request cost simultaneously.

0

F1 + points over GPT-5 baseline

Cost engineering

B2C app · 1M+ MAU

### Model routing + prompt caching on a GPT-5 stack

A small intent classifier routes ~70% of traffic to GPT-5 mini and ~30% to GPT-5. Anthropic-style prompt caching on the stable system prompt cut the cached portion by 85%. Eval gates guard the routing thresholds — quality on both tiers tracked weekly.

0 %

token spend , p95 unchanged

Voice

Health-tech · enterprise

### Voice intake on GPT-5 Realtime, sub-400ms p95

Patient intake voice agent on GPT-5 Realtime API with a custom safety filter pre-generation. Voice-RAG pulled from a HIPAA-compliant Postgres index running BGE embeddings on the same VPC. We sit at p95 ~360ms TTFT; clinical reviewer signed off on faithfulness at 96%.

0 %

p95 360ms TTFT, faithfulness

Doc extraction

Logistics · 200+ emp

### Bill-of-lading extraction with GPT-5 Vision + structured output

Scanned BoLs across 14 carriers. GPT-5 Vision extracts to a Pydantic schema with structured output, validated against a rules engine, and posted into the TMS. Carrier-specific layouts get few-shot examples; the rare edge case routes to human review with the model's draft attached.

OCR-to-TMS 14m → 90s per BoL

Multilingual

Ed-tech · APAC

### Qwen 3 72B for Chinese-first tutor copilot

Self-hosted Qwen 3 72B on a 4×A100 pool for a Chinese / English bilingual tutoring app. We benchmarked it against Llama 4 70B and GPT-5; Qwen won on Chinese fluency and cost, GPT-5 won on English. The app routes by detected language at the prompt layer.

0 %

Bilingual quality parity, cost cut

015 / CONSULTING

## LLM consulting services — advisory engagements.

Sometimes the right answer isn't "build the app." Our llm consulting services cover the strategic decisions that need to land before any code ships — and occasionally the engagement ends "don't build, buy this off-the-shelf tool instead."

01

#### Model selection audit

Two-week engagement: workload audit, eval-set design, head-to-head benchmark of 3–5 candidate models, build-vs-buy-vs-fine-tune memo.

02

#### LLM cost projection

TCO model over 12 months: hosted vs self-hosted, routing scenarios, prompt-cache assumptions, traffic-growth sensitivities. Lands in 1–2 weeks.

03

#### Provider roadmap

OpenAI vs Anthropic vs Vertex vs open-weights, with a procurement + risk lens. Useful when leadership needs the trade-offs on paper.

04

#### Build-vs-buy assessment

Custom app vs SaaS vs hybrid. Honest stop-recommendation when an off-the-shelf tool covers your workload — that's about 1 in 5 consulting engagements.

Llm consulting is what we run when the question is "should we build this?" not "how do we build this?" — a 1–2 week engagement with a written memo, sometimes a benchmark, sometimes a TCO model. About one in five consulting engagements ends with "don't build, here's the SaaS that does this," which is a successful outcome we sometimes have to talk new clients into believing we mean. For broader strategic AI work that spans LLM + RAG + agents, see [our strategy advisory practice](/services/ai-consulting/).

016 / ENGAGE

## Four ways to start.

01 LLM App Pilot Fixed scope

2–4 weeks

### Pilot one workload, intake to live.

In scope

-   One scoped use case
-   Eval set (30–80 examples)
-   3–4 model baseline scored
-   Working prototype
-   Demo + build-or-stop memo

Out of scope

-   Production deploy
-   Fine-tuning
-   Multi-workload scope

02 Production Build Fixed scope

8–16 weeks

### Full LLM app with eval gates.

In scope

-   All Pilot deliverables
-   Auth · rate-limit · observability (Langfuse)
-   Model routing + cost guardrails
-   Eval gates baked into the deploy pipeline
-   Adversarial eval + safety classifier
-   Four weeks of post-launch iteration

03 Fine-tuning Fixed scope

6–10 weeks

### LoRA / QLoRA / hosted fine-tune.

In scope

-   Dataset curation + cleaning
-   Eval set + prompt-only baseline
-   Fine-tune run + eval comparison
-   Deploy if it wins; recommend prompt-only if not
-   Weights + adapters + runbook transferred

04 LLM App Audit Fixed scope

2–3 weeks

### Eval, cost, latency, safety review.

In scope

-   Coverage audit on current eval
-   Cost + latency profiling
-   Adversarial / jailbreak test
-   Prioritised fix-list + follow-on quote

017 / FAQ

## Common LLM development questions.

Hosted (Claude / GPT) or self-hosted (Llama) — how do we decide?

Default to hosted unless you have one of three triggers: a residency rule on the data, very high steady-state volume (above ~1M requests/day on a single workload), or a strategic need to own the model weights. Self-hosting is a meaningful operational commitment — GPU pools, autoscaling, quantization tuning, batching — and we'll only recommend it when the math actually works.

The interactive picker above walks through the decision in 2–3 questions. Most clients we see end up on hosted with an Anthropic Sonnet baseline and GPT-5 mini as the cheap-tier in a routed stack; about a quarter end up self-hosted (mostly healthcare and finance with strict residency). Hybrid is rare but real for clients with mixed regulated and non-regulated traffic. Our piece on [hosted vs self-hosted LLMs](/blog/hosted-vs-self-hosted-llms/) covers the break-even math in detail.

When does fine-tuning beat prompt engineering and RAG?

Fine-tuning wins on three specific shapes. **One**: when the output format is so consistent that prompting struggles (a custom JSON schema, a regulatory document format, code in a niche DSL). **Two**: when the domain language has terms the base model fumbles — clinical jargon, legal idioms, internal product vocabulary that the base model never saw enough of. **Three**: when latency or cost have to drop and you've exhausted the prompting and routing levers.

Prompt + RAG usually wins outside those three shapes. Fine-tunes are slower to iterate on, harder to debug, and they don't compose with citations the way RAG does. We baseline prompt-only and prompt+RAG before we propose a fine-tune; if either of those clears the eval bar, the fine-tune budget gets spent on the eval set instead. Our llm fine tuning services live inside the Fine-tuning engagement shape above — about a third end in "the fine-tune didn't beat the baseline, so we shipped prompt-only," and that's a win, not a loss.

Composing them is common too: RAG for facts, a small LoRA fine-tune for output style and domain vocab. We scope both at week 2 if the use case has both shapes. The picker on our [grounded retrieval pipelines page](/services/rag-development/) covers the RAG vs fine-tune decision tree in more depth.

How do you measure LLM application quality?

Four metrics, scored separately because they fail differently:

1.  **Task score** — app-specific grader on a domain-expert-graded eval set (30–80 examples). The eval set is the most important deliverable of the engagement; your domain expert grades, we facilitate.
2.  **Hallucination rate** — claims with no factual support, scored by LLM-as-judge (Claude Sonnet 4.6) with human spot-check on the disputed cases. Hard gate before production.
3.  **P95 TTFT** — full time to first token including any pre-generation safety filter. Voice apps target sub-400ms; chat apps sub-800ms.
4.  **P50 cost per request** — modelled at discovery, tracked weekly post-launch via Langfuse. Surprise bills aren't a surprise.

The default eval stack is Inspect AI as the harness, RAGAS for any retrieval-adjacent metric, custom graders for app-specific scoring, and Langfuse for production trace sampling that feeds the eval set monthly. Regression alarms fire when any metric drops more than 5 points on a model upgrade. Our piece on [eval framework comparison](/blog/llm-eval-frameworks/) covers when to reach for which tool.

How do you handle prompt injection and jailbreaks?

Defence in depth. Input classifier (Llama Guard or a custom classifier) runs pre-generation on user input; structured-output enforcement constrains what the model can return; system-prompt isolation prevents user content from leaking into instructions; output filter classifies pre-response. Every production deploy ships with a documented threat model and an adversarial eval set — 30–80 known jailbreak attempts plus domain-specific abuse patterns.

We don't claim "unbreakable" because that's not a real thing. We claim measured — we know our refusal rate on the adversarial set, we know our false-positive rate on the benign set, and we know the trade-off we're holding. For high-stakes workloads (clinical, financial advice, legal) we add a second classifier post-generation and a human-in-the-loop for refused-but-disputed cases. Our customer-side budget for safety work is usually 10–15% of the build effort; trying to do this cheaper is the cheap way to ship the wrong thing.

What about cost runaway — how do you prevent surprise bills?

Five mechanisms layered together. **Per-request cost budgets**: hard caps that refuse the request rather than emit it when the prompt is bigger than expected. **Rate limits** per user, per team, per endpoint — usually configured via Langfuse or a thin proxy. **Model routing**: easy queries to GPT-5 mini or Haiku, hard queries to flagships, controlled by a small classifier. **Prompt caching** on stable system prompts and any prefix that repeats — Anthropic's cache hits often run 80–90% on enterprise system prompts. **Batch API** for non-interactive workloads, which knocks ~50% off list at the price of higher latency.

Cost is modelled in week 3 of any engagement using the actual expected traffic shape, not a marketing average. Once live, weekly cost reviews catch drift before it compounds; the regression alarm trips on a 25% drift over two weeks. We've seen "surprise" bills usually trace to one of three things — a router threshold miscalibration, a cache hit rate that quietly collapsed after a system-prompt edit, or a runaway agent loop that didn't have a step-budget. All three are observability-detectable if you instrument the right things on day one.

Can you migrate us from one model provider to another?

Yes. Provider migration is part of about a quarter of our LLM App Audit engagements. The pattern is dual-write first (calls go to both providers for a sample of traffic), eval parity checks against your real eval set, then read-cutover with a fast rollback path. We've done OpenAI → Anthropic, Anthropic → Vertex, and hosted → self-hosted migrations with zero downtime when the abstraction was clean.

The abstraction matters more than the migration. We bake a thin provider-abstraction layer into every Production Build for exactly this reason — model swaps shouldn't require touching app code. If you're locked into a vendor SDK now, the migration scope grows because we have to abstract first, then migrate; that's a 6–10 week engagement instead of 3–4. Either way, the eval set decides whether the migration ships, not vendor benchmarks.

Who owns the prompts, fine-tuned weights, and the eval set?

You do. All artifacts transfer into your repository under the SOW: system prompts, few-shot examples, eval set, fine-tuned adapter weights, the Langfuse instance configuration, the runbook. We retain no rights to your prompts, weights, or data. Paiteq keeps engineering learnings — patterns, methodologies, anonymised case-study takeaways for our internal playbook — but never your specific artifacts.

This matters more than people realise on a first build. We've onboarded several clients whose previous vendor "owned" the prompts as IP, which made provider migration impossibly expensive. We refuse that pattern. Your business knowledge lives in the eval set and the prompts; treating either as vendor IP would be malpractice.

What's a realistic budget and timeline for production LLM development services?

The four engagement shapes above are fixed-scope and fixed-duration; we hold the price band on the contact call because workload depth, residency posture, and integration count swing the budget meaningfully. Rough order of magnitude:

-   **LLM App Pilot** (2–4 weeks): small enough that stopping at the pilot is a real option. About one in three pilots end at the pilot because the eval surface wasn't measurable or the workload turned out to be a generation problem better served by a [generative AI engagement](/services/generative-ai/).
-   **Production LLM Build** (8–16 weeks): the bulk of our llm development services revenue. Includes four weeks of post-launch iteration baked into the SOW.
-   **Fine-tuning engagement** (6–10 weeks): includes the head-to-head against the prompt-only baseline. We've shipped engagements that ended "the fine-tune didn't beat the baseline, so we shipped prompt-only" — that's a successful outcome.
-   **LLM App Audit** (2–3 weeks): outputs a prioritised fix-list and a fixed-scope follow-on quote if you want us to ship the fixes.

For llm consulting services (model selection audits, cost projection, provider roadmap), 1–2 week engagements at a flat fee. The full breakdown lives in [our strategy advisory practice](/services/ai-consulting/) for cross-service strategic work.

018 / Related practices

## Adjacent services.

[

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

](/services/rag-development/)[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

AI CONSULTING

AI Consulting

AI strategy, audits, roadmap.

](/services/ai-consulting/)

019 / Start a project

## Let's *ship* the LLM app.

Pilot in 2–4 weeks. Production build in 8–16. Fine-tune in 6–10. Audit in 2–3.

[Talk to engineering](/contact/) [Architecture review](/contact/?topic=arch-review)


---

## SECTION: 4.4. Service: ai-consulting

_Source: https://www.paiteq.com/services/ai-consulting/_

# AI Consulting Services — Paiteq

> AI consulting services and AI strategy consulting from an engineering-led AI consulting company. AI capability assessment, AI roadmap consulting, AI audit services, AI vendor selection consulting. Fixed scope, no kickbacks.

**HTML version:** https://www.paiteq.com/services/ai-consulting/

## Key facts

- Deliverables: capability assessment, roadmap, fixed-scope audit, vendor shortlist.
- No kickbacks from vendors; recommendations are model-agnostic.
- Walk-away clause on every audit.

## Related pages

- [AI Migration](https://www.paiteq.com/services/ai-migration/)
- [AI Integration](https://www.paiteq.com/services/ai-integration/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

P7 · Services

# *AI consulting services* — capability audits, costed roadmaps, vendor selection, board-grade memos

AI consulting services from an engineering-led AI consulting company that ships AI strategy consulting in writing, not slides. AI capability assessment, AI roadmap consulting, AI vendor selection consulting, AI readiness assessment, and AI due diligence — fixed-scope, signed by the engineers who'll still be on the call when the build starts. No kickbacks, no 600-person practice to push into the answer.

[Talk to a partner](/contact/) [See engagement shapes](#engage)

Practice AI strategy consulting

Shapes Audit · Roadmap · RFP · DD

Default Written memo + exec readout

Engagements 2–6 weeks · fixed scope

001 / FRAME

## Where AI consulting services earn their fee.

Most buyers arrive at enterprise ai strategy with the wrong question pre-attached to the right budget. An ai readiness assessment up front saves them from the most common failure — paying for the wrong shape. The grid below is the first frame we run — buyer shape on the left, engagement stage across the top — and the call we'd make on the rubric. It's the most expensive mistake in this category: spending audit dollars on a question that should've gone straight to vendor RFP, or running vendor RFP on a strategy gap that an audit would've named in two weeks.

Where you are

Audit (2–3w)

Roadmap (3–5w)

Vendor RFP (4–6w)

Build (route out)

Pre-AI, board-mandated

Default

Sequel

Premature

Not yet

AI pilot stalled at month 6

Default

After audit

Often the gap

Reroute to P1/P3

Vendor demos already booked

Skip

Skip

Default

After pick

Build vs buy unresolved

Yes — frame it

Default

Half of it

Only one side

Regulated industry (HIPAA, EU AI Act)

Compliance read

Roadmap + posture

DPIA-aware RFP

Routed to siblings

Acquisition / DD in flight

AI DD memo

Post-close

Rare

N/A

Tooling sprawl, no strategy

Default

Consolidation plan

After rationalise

Not now

Yes = default recommendation. Maybe = depends on a follow-up question we'll cover in the kickoff. No = we'd actively steer you away.

If your shape isn't on the grid, the framing call is free — DM us with the situation and we'll write back inside a business day with which shape fits and what it costs.

002 / SHAPES

## Four engagement shapes. Every ai consulting services engagement maps to one.

Fixed scope, fixed fee, written deliverable. We don't sell hours; we sell a memo. The four shapes below cover roughly 95% of inbound — Audit, Roadmap, Vendor RFP, AI Due Diligence. Mixed engagements bill as two consecutive shapes, not an open retainer.

01 AI CAPABILITY AUDIT Fixed scope

2–3 weeks

### Read of practice + written memo.

In scope

-   60-minute kickoff to lock the question
-   Capability read across model selection, retrieval, eval rigour, observability, MLOps
-   Data-hygiene audit with named leakage/labelling gaps
-   20-page written memo + 90-minute exec readout
-   Recommended next step in writing (Roadmap, RFP, Build, or no-action)

Out of scope

-   Vendor RFP authoring (Shape 03)
-   12-month costed roadmap (Shape 02)
-   Hands-on build (route to technical pillar)

02 AI ROADMAP Fixed scope

3–5 weeks

### 12-month costed plan against scored use cases.

In scope

-   All Capability Audit deliverables
-   Use-case scoring across business value, feasibility, organisational readiness
-   Vendor longlist scored against the audit rubric
-   Build-vs-buy frame with TCO modelled three postures
-   12-month sequence with named phases, owners, exit gates, named tools

Out of scope

-   Vendor demo shadowing (Shape 03)
-   Hands-on build (route to technical pillar)
-   Ongoing retainer (separate engagement)

03 VENDOR RFP SUPPORT Fixed scope

4–6 weeks

### RFP authoring + demo shadowing + reference checks.

In scope

-   RFP authored against the audit rubric
-   Vendor demos shadowed by an engineer
-   Scoring sheets filled with named criteria
-   Reference checks run with at least two existing customers per vendor
-   Contract terms reviewed for data residency, IP, exit clauses
-   Procurement-ready recommendation memo

04 AI DUE DILIGENCE Fixed scope

2–4 weeks

### Acquisition or board-mandated AI posture read.

In scope

-   AI surface audit of the target or business unit
-   Model-evaluation re-run on a leakage-free holdout where applicable
-   Risk register across IP, data residency, vendor lock-in, regulatory exposure
-   20-page board-grade memo + 90-minute presentation
-   Dissenting view named in writing — we don't bury the no-flags

003 / NUMBERS

## What an honest AI consulting company looks like at the spreadsheet level.

Pricing transparency that most ai strategy consulting firms hide behind a "let's chat" wall. The shapes are fixed, the timelines are fixed, the deliverable is written. We can't quote the fee until we've scoped the surface, but the range is on the higher end of independent advisory and the lower end of tier-one strategy houses — roughly where the value sits. An ai strategy consulting engagement at this depth is a one-time cost, not a quarterly retainer drip.

0 –3w

AI capability audit

Fixed scope, written memo

0 –5w

AI roadmap engagement

12-month costed plan

0 –6w

Vendor RFP authoring

Audit rubric → scored shortlist

0 –4w

AI due diligence read

Acquisition or board-mandated

004 / GATES

## Six gates an honest AI capability audit clears.

An ai audit services memo is only as honest as the gates the auditor runs. Below is the screen we apply to every ai audit services engagement — and the same screen we use when we're hired to second-opinion a memo a tier-one firm already shipped. Second-opinion work routinely flags at least one gate the original audit silently skipped.

-   01
    
    ### Eval rigour, not eval theatre
    
    Has the existing AI surface been graded on a leakage-free holdout, with named metrics (AUC, NDCG, faithfulness, ECE) tracked over time? Or is the "eval" a few cherry-picked outputs in a quarterly review deck? Eval theatre is the most common silent failure in stalled pilots — calibration drifts, faithfulness drops, nobody measures, the team blames the model.
    
-   02
    
    ### Observability priced day-one
    
    Langfuse, Braintrust, or LangSmith wired before the second sprint. Traces searchable, prompts versioned, eval runs reproducible. Audit memos that don't price observability as a day-one line item are usually written by a vendor that doesn't want you watching too closely.
    
-   03
    
    ### Data residency named in writing
    
    For regulated workloads — HIPAA, EU AI Act, MAS, GDPR — the deployment posture is named explicitly. Which region. Which provider. Which sub-processor agreement. Half the vendor pilots we audit have a residency gap the vendor's standard contract can't close inside the renewal window.
    
-   04
    
    ### Vendor swap is a 1–2-week migration
    
    If you can't switch from Claude Opus 4.7 to GPT-5 to Gemini 3 Pro inside two weeks, you don't have an AI stack — you have a vendor stack. A routing layer above the provider SDKs costs roughly a week of engineering and saves you the contract negotiation that arrives in month nine.
    
-   05
    
    ### TCO modelled past month 12
    
    Hosted-frontier looks cheap at 1M monthly tokens; at 500M it's eye-watering. The opposite for self-hosted Llama 4. Audits that don't carry the TCO past the budget cycle's horizon land buyers in renegotiation surprise at exactly the wrong time. We model 24 months as the floor and 36 as the ceiling.
    
-   06
    
    ### Failure mode named in writing
    
    What's the single most likely way this build fails at month nine? If the memo can't answer that question, it isn't an audit memo — it's a sales doc. We name the failure mode, the leading indicator, and the threshold at which the trigger fires. A meaningful share of our memos recommend not proceeding at all; the rest name what to watch.
    

Six-out-of-six clean is rare in our review history. Two or fewer clean is the trigger for the "stalled-pilot" intervention shape under our [agent](/services/ai-agent-development/) or [LLM](/services/llm-development/) practices — fix the methodology before the model.

005 / ROADMAP

## What a four-phase ai roadmap consulting engagement actually ships.

A 12-month AI roadmap that lands in a board pack isn't a slide deck — it's a sequence with named owners, named gates, and named tools. The four phases below are the standard shape; a complex multi-BU engagement carries an extra discovery phase, a narrow single-use-case engagement collapses phase 2 and 3 into one.

1.  01
    
    ### Discovery + landscape read
    
    Sixty-minute exec session to lock the question, then a structured read of the current AI surface — what's in production, what's stalled, what's in vendor demos, what's in the spreadsheet. The output of this phase is a one-page problem statement that everyone on the engagement signs off in writing. Some engagements end here because the right answer is "do nothing yet" — we still ship the memo and bill the phase.
    
2.  02
    
    ### Use-case scoring + vendor longlist
    
    Every candidate use case scored across three axes — business value, technical feasibility, organisational readiness. Scoring rubric shared with the buyer, not run on a private spreadsheet. Vendor longlist assembled per surviving use case — frontier hosted, self-hosted open-weight, vertical SaaS, and the build-it-yourself option each scored on the same rubric. Audit memo's recommendation feeds the scoring inputs.
    
3.  03
    
    ### Build-vs-buy frame + TCO
    
    Explicit build-vs-buy frame for the two or three use cases that survive scoring. TCO modelled across three postures — hosted-frontier (Claude / GPT-5 / Gemini), self-hosted open-weight (Llama 4 / Mistral / Qwen 3 on vLLM), and hybrid routing — with the volume crossover named in months, not vibes. Sensitivity analysis on the three assumptions most likely to change. We share the spreadsheet, not a sanitised summary.
    
4.  04
    
    ### 12-month sequence + exit gates
    
    Twelve-month sequence with named phases, named owners (internal hire, sibling practice, third-party vendor), named exit gates per phase, and named tools — LangGraph, Pinecone, Langfuse, the actual names a procurement team has to put on contracts. Each phase carries a "fail-here-and-pivot-there" branch. The memo is the artefact that survives the engagement; the readout is theatre.
    

Clean handoff is the default. Most roadmaps name a recommended internal hire alongside the vendor sequence — the work that survives this engagement is the practice you build inside, not the consultant you retain.

006 / EVALUATE

## The six vendor categories every roadmap evaluates.

An ai vendor selection consulting engagement isn't a vendor-by-vendor scorecard — it's a category-by-category architectural call. The six families below cover roughly 95% of the recommendations in the roadmaps we've shipped this year. Per family, the audit names the default pick, the cost-floor alternative, and the conditions under which we'd revisit in 12 months. We've run ai vendor selection consulting across all six categories in the last 18 months.

Frontier hosted LLMs (Claude · GPT-5 · Gemini 3)

Strengths

Highest reasoning ceiling and the fastest iteration loop. Claude Opus 4.7 holds the lead on long-context analysis; GPT-5 leads on tool-call latency at scale; Gemini 3 Pro wins on 1M-token retrieval workloads. Pricing has compressed but premium tier still runs $3–15 input, $15–75 output per million tokens.

When We Pick

Greenfield AI roadmaps where time-to-first-value matters more than per-token cost. C-suite-visible builds where the model name is itself a signalling cost. Workloads under ~200M monthly tokens — below that, the frontier price premium is rounding error against engineering salary.

When We Don't

Predictable, high-volume workloads where a tuned <a href="/services/llm-development/">smaller LLM</a> beats frontier on cost by 8–20×. Strict data-residency where the provider's region map doesn't match yours. Vendor-lock anxiety where a board member has already vetoed single-source.

Paiteq Pattern

Audit memos almost always recommend a two-vendor posture — one frontier, one mid-tier — with a routing layer keeping migration friction near zero. Three names beats two for posture and one for cost discipline.

FrontierReasoningMulti-vendor

Self-hosted open-weight (Llama 4 · Mistral · Qwen 3)

Strengths

Total control. Llama 4 405B served on H100s via vLLM hits sub-150ms p50 on most chat workloads at roughly $0.05–0.20 per million tokens amortised. Mistral and Qwen 3 cover the mid-tier; both ship instruct-tuned variants that beat year-old frontier on narrow domains after light fine-tuning.

When We Pick

Workloads above ~500M monthly tokens where per-token economics flip the spreadsheet. Regulated workloads where data residency or model-weight ownership is non-negotiable. Vertical workloads where fine-tuning on customer data unlocks 30–60% accuracy lift over generic frontier.

When We Don't

Sub-50M-token workloads — GPU amortisation kills the math. Teams without MLOps capacity — see <a href="/services/mlops/">MLOps services</a>; running vLLM in production is not a side project. Reasoning-heavy workloads where the frontier ceiling still matters more than cost.

Paiteq Pattern

We recommend self-host when the unit economics cross the line — usually a clear inflection rather than a gradient. Audit memo names the volume threshold and the year it'll be hit, not the year a board member wishes it would be hit.

Self-hostOpen-weightCost-floor

Vector DBs + retrieval (Pinecone · Qdrant · pgvector · Weaviate)

Strengths

Retrieval is where most AI roadmaps actually live or die. Pinecone Serverless cuts ops to near zero at a premium tier. Qdrant self-hosts cleanly on Kubernetes for the team that already runs one. pgvector is the cheapest, lowest-friction choice when Postgres is already in the stack. Weaviate wins on multi-modal retrieval.

When We Pick

Anywhere the AI value proposition requires grounded answers — clinical, legal, regulated, internal-knowledge-rich. Almost every roadmap we ship recommends a <a href="/services/rag-development/">retrieval-augmented generation</a> pipeline as the first build, not an agent.

When We Don't

Pure reasoning workloads with no enterprise knowledge to ground against. Workloads where the answer is already public-internet-shaped — frontier LLM alone usually wins. Tiny corpora under 10k chunks where in-context retrieval beats a vector store.

Paiteq Pattern

Default recommendation: pgvector when the team already runs Postgres; Pinecone Serverless when ops bandwidth is the constraint; Qdrant when data residency requires self-host. We don't recommend Weaviate unless multi-modal retrieval is the headline requirement.

RetrievalGroundedHybrid-search

Agent + workflow frameworks (LangGraph · CrewAI · n8n · Temporal)

Strengths

LangGraph is the 2026 default for state-graph agent orchestration — the only mature framework with proper state-machine semantics. CrewAI ships fastest for role-based shapes if you don't need state-graph control. n8n covers deterministic-plus-AI workflows for ops teams. Temporal is the durable-execution backbone for high-stakes long-running flows.

When We Pick

Multi-step agentic builds with branching state — LangGraph. Workflow automation with LLM-in-the-loop and a non-engineer ops team — n8n. Long-running orchestration with retry semantics — Temporal. We've shipped all four in production over the last 12 months.

When We Don't

Single-turn chatbots — none of these is the right tool; see <a href="/services/chatbot-development/">chatbot development</a>. Toy POCs — direct API calls win on velocity. AutoGen — stalled relative to LangGraph; we no longer recommend it for new builds.

Paiteq Pattern

Roadmap usually pairs LangGraph (agent runtime) with Temporal (durable execution) for builds where retries and human-approval gates matter. n8n shows up when the buyer is non-engineering and the workflow is more deterministic than agentic.

AgenticStatefulDurable

Observability + eval (Langfuse · Braintrust · LangSmith · Inspect)

Strengths

The audit gate every roadmap we ship requires. Langfuse leads OSS observability with traces, prompt versioning, and a usable eval surface. Braintrust dominates closed-source eval workflows. LangSmith is fine if you're already inside the LangChain ecosystem. Inspect AI (UK AISI-backed) is the rigour pick for safety-critical evals.

When We Pick

Every roadmap recommends observability as a day-one cost line, not a phase-three nice-to-have. Most AI pilots fail because nobody knew which prompts were drifting or which tools were silently dropping — instrumentation is the cheapest insurance in the stack.

When We Don't

Toy projects where the eval loop is a human reading 10 outputs. We don't recommend bare logging-without-traces for anything past prototype — it's the false-economy that creates the month-six pilot stall.

Paiteq Pattern

Default recommendation: Langfuse self-hosted for teams that want open-source plus data control; Braintrust for teams with budget and no ops capacity. RAGAS or DeepEval as the eval harness layer regardless of trace backend. Audit memos always price observability in.

EvalTracingDay-one

Voice + multimodal stack (LiveKit · Pipecat · ElevenLabs)

Strengths

LiveKit Agents and Pipecat both land sub-400ms voice turn-take in production. ElevenLabs leads on voice quality; the open-source side (Whisper Large v3, F5-TTS) is closing fast. Vision-LLMs (Claude Sonnet 4.6, GPT-5 Vision, Gemini 3 Pro) cover document understanding without a custom CV pipeline.

When We Pick

Voice agents — support deflection, clinical intake, sales prospecting — where the latency budget is human-conversational. Document understanding at scale where the alternative is a custom vision stack we'd route to <a href="/services/machine-learning-development/">our ML practice</a> instead.

When We Don't

Roadmaps where voice is a CEO whim, not a buyer journey. Vision tasks with extreme accuracy bars (defect detection, medical imaging) — frontier vision-LLM isn't the answer; a fine-tuned vision backbone is.

Paiteq Pattern

We've taken voice agents from POC to production three times this year. Roadmaps prescribe LiveKit + Claude Sonnet 4.6 + ElevenLabs as the default stack; cheaper open-source substitutes priced as a phase-two option.

VoiceMultimodalLatency

007 / ARCHETYPES

## Four strategy archetypes. Roughly all inbound maps to one of these.

Greenfield, Modernise, RPA-Replace, and Acquisition/Board-DD cover roughly 100% of the ai consulting services engagements we've shipped over the last 18 months. Shape determines deliverable, deliverable determines pricing, pricing determines scope. We won't sell you a Greenfield engagement when you're really in Modernise — the framing call is free and we'll route you to the right shape.

   
01

### GREENFIELD

The board has approved an AI budget for the first time. There's no incumbent AI system, no internal champion with battle scars, and no anchor use case picked. Roughly 40% of our AI consulting services engagements start here. The audit memo names the three highest-leverage use cases against capability + value scoring, sequences them, and prices the first six months in detail so finance can sign without re-reviewing.

Pick when

-   First AI budget cycle
-   No prior pilots or pilots all stalled
-   Multiple business units competing for the budget
-   CTO + CFO + COO all needed on the call
-   You've been pitched by three vendors and trust none of them

Skip when

-   Pilot already in production and earning revenue — different shape
-   Vendor already picked — go straight to RFP support
-   Pure model-routing question — that's a 1-week LLM audit instead

Stack

Capability auditUse-case scoringCosted roadmapVendor longlist

02

### MODERNISE

There's an existing data-science team, an existing ML practice, a few notebooks in production, and a deeply uneven track record. The CEO is asking why the team isn't shipping LLM-shaped wins. Roughly 30% of engagements look like this. The audit reads team capability, data hygiene, eval rigour, and deployment posture — the memo is usually a re-org sequenced with a 12-month capability lift, not a vendor swap.

Pick when

-   Existing ML team but no LLM wins in 12 months
-   Data infrastructure exists but eval rigour is thin
-   CTO suspects the team can lift, CEO is impatient
-   Recent leadership change opens the re-org window

Skip when

-   No existing team at all — that's the Greenfield shape
-   The team is shipping wins and the question is just "which one next?" — much narrower roadmap engagement
-   Board has already decided to replace the team — different shape, often AI DD work

Stack

Team auditEval-rigour readRe-org memo12-mo lift plan

03

### RPA-REPLACE

There's a UiPath or Automation Anywhere estate, a rising licensing bill, and a sense that the bots are brittle. Roughly 20% of engagements. The audit memo picks the 10 highest-volume processes, scores each against an AI-modernisation rubric (judgment density, exception rate, unit economics), and sequences a migration that pays itself back inside 12 months. We've shipped this with our <a href="/services/ai-workflow-automation/">AI workflow automation</a> practice in three industries.

Pick when

-   UiPath/AA/Blue Prism licence renewal in the next 12 months
-   Bots breaking on edge cases that humans handle in a minute
-   Operations leadership wants AI but procurement defaults to RPA renewals
-   A few processes where judgment is the bottleneck

Skip when

-   RPA estate is small (<10 bots) — usually cheaper to migrate piecemeal than to consult on it
-   The renewal already lapsed and the team is firefighting — emergency build engagement, not advisory

Stack

Process scoringMigration sequenceVendor lapse memoHybrid AI/RPA

04

### DD / AI-AUDIT

The corp-dev team has a target in flight and needs an AI-specific due-diligence read, or the board is asking the CISO and CTO to put numbers on AI risk and AI maturity. Roughly 10% of engagements but the highest-value tier. Deliverable is a 20-page memo plus a 90-minute board presentation. The work is half technical audit, half narrative — what's defensible, what's a flag, what's an opportunity hiding inside the technical detail.

Pick when

-   Acquisition target with material AI claims
-   Board has asked for an AI posture review
-   Insurance, audit, or regulator pressure asking for an evidence base
-   Investor due-diligence on a pre-revenue AI startup
-   Board-rotation triggering a strategic re-baseline

Skip when

-   Internal team capability review — use Modernise
-   You want a deep technical code-audit — that's a build-side engagement, often via our <a href="/services/llm-development/">LLM development</a> practice
-   You're hoping for a yes — DD memos call it as we read it, including the no-flags

Stack

AI DD memoBoard deckRisk registerCapability score

008 / BUILD VS BUY

## Build vs buy AI — row-by-row on the dimensions that actually matter.

Build-vs-buy is the most-asked question in the audit room and the most-mis-answered in the slide deck. The grid below is the frame we use — nine rows the spreadsheet usually skips. Every roadmap recommendation gets graded against these rows; the call lands in writing with the dissenting view named.

Buy (vendor / SaaS)

Build (custom + your engineers)

Time to first business value

6–14 weeks (custom build)

2–6 weeks (vendor pilot)

Total 24-month spend (mid use case)

$280k–$650k engineering

$180k–$420k licence + integration

Eval and observability ownership

Yours from day one — Langfuse / Braintrust / RAGAS in your repo

Often vendor-owned; export depth varies; some lock-in

Eval ownership is the hidden variable most buyers miss. When your traces, evals, and latency data live in your own repo, you can catch regression before users do — and you own the dataset needed to fine-tune the next model. Vendor-owned observability means you're debugging against a dashboard the vendor also controls.

Model swap (Claude → GPT-5 → Gemini)

1–2 weeks via routing layer

Often blocked by vendor's single-model architecture

Frontier model performance rankings shift every three to six months. A routing layer (LiteLLM, Portkey, or a thin abstraction over the Anthropic + OpenAI SDKs) means a price drop or a capability leap costs you a config change, not a re-platform. Vendor architectures that bake in a single model create switching costs that compound at renewal.

Customisation ceiling

Anything you can write; LoRA fine-tunes on your data

Whatever the vendor's roadmap allows; 6–18 month wait per feature

Data residency + private deployment

Self-host on Llama 4 / Mistral; full control

Depends on vendor; ~half offer single-tenant; few offer self-host

Team capability gain

Engineering org learns AI as it ships

Vendor-dependent; team learns the vendor, not the domain

For most product companies, building is a talent investment as much as a delivery vehicle. Engineers who ship a RAG pipeline or an agent loop in production become your internal AI bench — they spot the next use case, they review vendor claims with real context, and they're harder to lose to attrition than a vendor relationship.

Failure mode

Engineering burn-rate without product progress

Vendor lock-in, roadmap drift, sudden price hike at renewal

Both failure modes are real and roughly equally costly. The build failure is visible early — sprint velocity without shipped value. The vendor failure is invisible until the renewal conversation, when the price has tripled and migration would take longer than the original build would have. The audit's job is to predict which risk is higher for your specific workload.

Where we recommend it

Differentiator workloads — the AI IS the moat

Table-stakes workloads — chat, support deflection, basic RAG

The honest answer is usually both — buy for table-stakes (chat, basic RAG, support deflection), build for the differentiator workloads (the AI IS the moat). Roadmaps name the line per workload.

Where we recommend buy, we score vendors against the audit rubric and shadow demos with the buyer. Where we recommend build, we route the work to the right Paiteq technical pillar — [agentic systems](/services/ai-agent-development/), [retrieval pipelines](/services/rag-development/), [custom LLM apps](/services/llm-development/), [classical ML](/services/machine-learning-development/) — or to a named third party where their fit is better.

009 / CRITERIA

## Six vendor-evaluation criteria the procurement-grade rubric scores.

When a vendor RFP support engagement runs, the rubric is six criteria, scored 1–5, signed off by the buyer at kickoff. No vendor-flavoured spin in the criteria list; no "innovation" or "thought leadership" cells. The criteria below are the ones that actually predict whether the contract pays itself back in 24 months.

-   01
    
    ### Evaluation depth + export
    
    Does the vendor let you export traces, eval runs, and prompt versions in a structured format your team can analyse in Langfuse or Braintrust? "We have a dashboard" is the wrong answer — you need data out, not screenshots in.
    
-   02
    
    ### Model swap + provider routing
    
    Can you swap the underlying LLM — Claude, GPT-5, Gemini, self-hosted Llama 4 — in two weeks of vendor support effort? If the answer is "we're optimised for our chosen model," that's lock-in dressed as performance. Multi-provider is table stakes in 2026.
    
-   03
    
    ### Data residency + sub-processors
    
    Named region, named sub-processors, named DPAs. For HIPAA, EU AI Act, MAS, FINMA — the auditor is going to ask. Half the vendor pilots we re-evaluate have a residency gap the standard contract can't close.
    
-   04
    
    ### Customisation ceiling
    
    How customisable is the model surface — prompts, tools, fine-tuning, retrieval logic? What can the buyer's engineering team change without a vendor PR? "Configurable" usually means "the vendor's roadmap dictates pace"; "customisable" means you write code.
    
-   05
    
    ### Pricing + volume-elasticity
    
    What's the cost at 10×, 50×, and 100× current volume? Vendors love the entry tier; the real question is the renewal-cycle math when usage scales. We model the spreadsheet at the assumed-growth volume, not the current one.
    
-   06
    
    ### Exit clauses + IP terms
    
    What happens to your data, your fine-tunes, your eval set, your custom prompts when you exit? "We export your data" is the floor; the question is what shape the export takes and whether your engineering team can ingest it into a successor system without a six-month migration.
    

We don't take vendor kickbacks. The only money in our P&L is the consulting fee on the engagement. Where a sibling Paiteq practice could plausibly compete with a vendor we'd recommend, we disclose the conflict in writing inside the memo and recommend the option that wins on the rubric — we've recommended against ourselves three times in 2026.

010 / WHERE

## Six advisory shapes across six industries — where we've shipped.

A capability-by-industry heatgrid for the ai consulting services we've actually run, not what the brochure promises. Strength reflects engagements completed; light cells are honest about depth we haven't built.

Function Industry

B2B SaaS

Fintech

Healthcare

Manufacturing

Logistics

Legal

AI Capability Audit

AI Roadmap (12-mo)

Vendor RFP Support

Build-vs-Buy Memo

AI Due Diligence

Board AI Posture

AI Capability Audit

B2B SaaSFintechHealthcareManufacturingLogisticsLegal

AI Roadmap (12-mo)

B2B SaaSFintechHealthcareManufacturingLogistics Legal

Vendor RFP Support

B2B SaaSFintechHealthcareManufacturingLegal Logistics

Build-vs-Buy Memo

B2B SaaSFintechHealthcareManufacturingLogistics Legal

AI Due Diligence

B2B SaaSFintechLogisticsLegal HealthcareManufacturing

Board AI Posture

B2B SaaSFintechHealthcareManufacturingLegal Logistics

Possible fit Good fit Primary vertical

Dark cells: 3+ engagements completed. Medium: 1–2 engagements. Light: scoped but not yet completed. Empty: not yet relevant.

011 / PROCESS

## Six steps. Three weeks. One written memo.

Eval-first, baseline-anchored, ai capability assessment methodology — refined across engagements in SaaS, fintech, healthcare, manufacturing, logistics, and legal. The sequence below is the standard run; complex multi-BU engagements add a week of discovery; narrow single-use-case engagements collapse weeks 2 and 3. The ai capability assessment doubles as the procurement-gating doc when the engagement converts to RFP.

WEEK 1

### Kickoff + landscape read

60-minute exec session to lock the question. Read of the current AI surface — what's in production, what's stalled, what's in vendor demos, what's in the spreadsheet. The question we're answering gets written down before we look at anything technical.

WEEK 1–2

### Capability + data audit

Technical read of the existing AI surface — model choices, retrieval architecture, eval rigour, observability, MLOps posture. Data hygiene audit for the use cases on the table. Half the audits we run surface a leakage or labelling gap that has to close before any new build.

WEEK 2

### Use-case scoring

Every candidate use case scored on three axes — business value, technical feasibility, organisational readiness. The scoring rubric is shared; nothing is graded on a private spreadsheet. Often the highest-value use case isn't the highest-feasibility — that's the tradeoff the memo names.

WEEK 2–3

### Vendor + build path read

For the two or three use cases that survive scoring, an explicit build-vs-buy frame. Vendor shortlist scored against the same rubric the buyer will face in procurement. Build path scoped, costed, and timeline'd against named tools — Claude, GPT-5, LangGraph, Pinecone, Langfuse.

WEEK 3

### Roadmap + TCO

12-month sequence with named phases, named owners, named exit gates. TCO modelled across hosted, self-hosted, and hybrid postures — we share the spreadsheet, not a sanitised summary. Sensitivity analysis on the three assumptions most likely to change.

WEEK 3

### Memo + readout

20-page written memo plus 90-minute exec readout. The memo names the call, the dissenting view, and the conditions under which we'd change our mind. Board-grade artefact — most clients use it as the procurement gating doc downstream.

012 / WHY PAITEQ

## Why teams pick us as their ai consulting company.

-   01
    
    ### Engineers sign the memo
    
    The partner who signs the audit memo is the engineer who can pick up the phone when the build starts. No analyst-to-partner ladder, no slide-deck-only deliverable. Memos name tools, name gates, name failure modes — the kind of writing a build team can execute against without a translation layer.
    
-   02
    
    ### No kickbacks. Zero. Audited.
    
    We don't take referral fees from Anthropic, OpenAI, Google, Microsoft, Pinecone, Temporal, Langfuse, ElevenLabs, or any other vendor we score. The only money in our P&L is the consulting fee on the engagement. Where a sibling Paiteq practice could compete with a vendor we'd recommend, we disclose the conflict in writing and recommend the option that wins the rubric anyway.
    
-   03
    
    ### Fixed scope, fixed fee, written deliverable
    
    Two to six weeks per engagement; no time-and-materials clock; no six-month strategy retainer. The memo is the artefact. The readout is theatre. Roughly 95% of our ai consulting services engagements close within the original scope; the rest convert to a follow-up shape with a separate SOW.
    
-   04
    
    ### Dissenting view named in writing
    
    Every memo names the call, the conditions under which we'd change our mind, and the dissenting view from inside our team. We don't sand the edges off the analysis to land a follow-on engagement. Audits that conclude "no further action" still ship the memo and bill the engagement at the agreed scope — the methodology, not the recommendation, is what you paid for.
    
-   05
    
    ### Roadmap that survives the engagement
    
    Named phases, named owners, named exit gates, named tools. Procurement-ready. Board-ready. The kind of roadmap an internal director can execute against six months after we've left — without picking up the phone — because the artefact carries the analysis, not just the conclusion.
    
-   06
    
    ### Cross-cutting AI estate, not single-modality
    
    AI consulting services here cover the whole estate — retrieval, generation, agents, classical ML, workflow automation, voice. Modality-specific advisory routes to the sibling practice. The strength is integrating across modalities, not selling deeper into one — buyers in multi-modality strategy land in the right pillar by default.
    

013 / SHAPES

## Four ways to start an ai consulting services engagement.

The four shapes above as picker cards. Fixed-scope, fixed-fee, written deliverable. Pick the closest match — the framing call refines if needed.

[

01 / AUDIT ↗

AI Capability Audit

Two to three weeks, fixed scope. Read of your current AI surface, data hygiene, eval rigour, and deployment posture. Deliverable is a written memo plus a 90-minute exec readout. The most common starting point for an ai consulting services engagement.

2–3 wksFixed

](#engage)[

02 / ROADMAP ↗

AI Roadmap

Three to five weeks, fixed scope. 12-month costed plan against scored use cases, vendor shortlist, build-vs-buy framing, and TCO model. The default sequel to an audit. Goes to your board as a board-grade artefact.

3–5 wksFixed

](#engage)[

03 / RFP ↗

Vendor RFP Support

Four to six weeks, fixed scope. RFP authored against the rubric used in the audit. Vendor demos shadowed, scoring sheets filled, reference checks run, contract terms reviewed. We work for the buyer, not the vendor — we sign nothing kicked back.

4–6 wksFixed

](#engage)[

04 / AI DD ↗

AI Due Diligence

Two to four weeks, fixed scope. Acquisition target or board-mandated AI posture read. 20-page memo plus a 90-minute board presentation. We've shipped this on five M&A targets and four board reviews in 2026 so far.

2–4 wksFixed

](#engage)

014 / EVALUATED

## Vendors we've evaluated in audits this year.

Frontier LLMs, agent runtimes, retrieval, observability, and voice — the surface 2026 roadmaps actually touch.

-   Claude Opus 4.7
-   GPT-5
-   Gemini 3 Pro
-   Llama 4
-   Mistral Large 3
-   LangGraph
-   CrewAI
-   Temporal
-   Pinecone
-   Qdrant
-   pgvector
-   Weaviate
-   Langfuse
-   Braintrust
-   LiveKit
-   ElevenLabs
-   Claude Opus 4.7
-   GPT-5
-   Gemini 3 Pro
-   Llama 4
-   Mistral Large 3
-   LangGraph
-   CrewAI
-   Temporal
-   Pinecone
-   Qdrant
-   pgvector
-   Weaviate
-   Langfuse
-   Braintrust
-   LiveKit
-   ElevenLabs

015 / USE CASES

## Where the memos have landed.

Three anonymized engagements. Function, segment, and outcome metric are real; brand removed under NDA.

Healthcare

Multi-state payer · regulated-data shape

### HIPAA-aware audit before a frontier-vendor procurement

Typical shape: a carrier has a frontier-vendor pilot in late-stage procurement and pulls us in for an independent audit. We pressure-test data residency, BAA coverage, and the deployment's ability to close HIPAA gaps inside the contract window. Where the vendor posture can't close, we re-frame the use case as a self-hosted Llama 4 + pgvector RAG build under our <a href="/services/rag-development/">retrieval-augmented generation</a> practice and re-price the roadmap against the vendor licence.

0

Deliverable: -page memo, residency gap register, re-priced roadmap

Fintech

Pre-Series-B regulated lending · EU

### AI due diligence read on a credit-scoring model

Typical shape: investor diligence on a regulated-lending AI startup. We re-evaluate model claims on a leakage-free holdout, score fairness across protected slices, and write the memo named after the call — proceed, re-build, or walk. The deliverable feeds directly into the term sheet and the regulator briefing.

Deliverable: held-out eval report, fairness register, board-grade memo

Logistics

Last-mile delivery · UK + EU

### RPA-replace roadmap against a renewal cliff

Typical shape: a UiPath or Blue Prism estate is approaching renewal and the AI-modernisation question lands at the wrong time. We score every bot process against a structured rubric, recommend per-process actions (migrate to <a href="/services/ai-workflow-automation/">LangGraph + Temporal workflow</a> / retain as classical RPA / retire), and sequence the migration against the renewal calendar.

0

Deliverable: scored process register, sequenced -month migration plan

016 / FAQ

## What buyers ask before signing.

How is AI consulting services from Paiteq different from McKinsey, BCG, or Deloitte?

Different shape, different deliverable. Tier-one strategy houses produce slide decks; we produce written memos signed by engineers who will still be picking up the phone when the build starts. Our AI consulting services engagements run two to six weeks fixed-scope — not the six-month strategy retainers tier-one houses default to — and they end with a costed roadmap that a real engineering team can execute against, including named tools, named gates, and a TCO sensitivity analysis you can hand to procurement. We don't have a 600-person ML practice to push into the answer; that's a feature, not a bug. Where the question is genuinely about org-design across 40,000 people, McKinsey beats us. Where the question is what to actually build and which vendor to actually sign, we beat them roughly nine times out of ten.

Do you sign off on vendor selection — and how do you avoid kickback bias?

Yes. We score vendors against the same rubric the buyer will face downstream in procurement, we sit in on demos, and the memo names the call by name. We don't take referral fees from any of the vendors we evaluate — Pinecone, Anthropic, OpenAI, Google, Microsoft, LangChain, Temporal, ElevenLabs, none of them. The only money in our P&L is the consulting fee we billed you. Where a sibling Paiteq practice could plausibly compete with a vendor we'd recommend — for example our own [RAG development services](/services/rag-development/) versus a vendor RAG product — we disclose the conflict in writing inside the memo and recommend the option that wins on the rubric anyway. We've recommended against ourselves three times in 2026.

When does it make sense to skip consulting and go straight to a build?

When the use case is clean and the vendor selection is already settled, skip us. Examples we'd genuinely route straight to a build: a single-purpose RAG over a known corpus with one buyer-approved vendor; an agent migration where the destination framework is already chosen by the engineering team; a voice agent build where LiveKit is already procured and the question is just whether to use Claude Sonnet 4.6 or GPT-5 for the brain. Where consulting actually earns its fee is when the question itself isn't yet clean — pre-AI greenfield, stalled pilot, RPA renewal pressure, vendor sprawl, or a board asking for a posture read. Buyers who think they're in the first bucket but are actually in the second usually burn $200k–$500k of engineering before realising it. The audit is cheaper.

What's in a written audit memo, and can I see a redacted one?

Twenty pages, give or take. Cover memo with the call and the three dissenting views; capability score across model selection, retrieval, eval rigour, observability, MLOps; data-hygiene findings with named leaks if any; vendor scorecard against the rubric; build-vs-buy frame with TCO modelled across three postures (hosted, self-hosted, hybrid); 12-month roadmap with named phases, named exit gates, and named tools (Claude, GPT-5, LangGraph, Pinecone, Langfuse — actual names, not categories); risk register; sensitivity analysis on the assumptions most likely to change. We can share a redacted memo under NDA — DM us through the contact form and we'll send one inside two business days. The redacted version covers a multi-state US healthcare payer engagement; brand and dollar figures removed, structure and analysis intact.

How do you price an audit, and how is that different from your AI roadmap pricing?

Fixed scope, fixed fee, both shapes. An AI Capability Audit runs two to three weeks; pricing scales with the technical surface — a single-team single-use-case audit lands at the lower end; a multi-BU multi-pilot audit at the upper. An AI Roadmap engagement runs three to five weeks; pricing scales with the number of use cases sequenced and the vendor evaluation depth. Both ship with a written memo and an exec readout. Neither runs on a time-and-materials clock — we don't sell hours; we sell a written deliverable against a fixed scope. We'll quote exact numbers after a 30-minute scoping call; the AI consulting services pricing range is on the higher end of independent advisory and the lower end of tier-one strategy houses, which is roughly where the value sits.

How is this pillar different from your generative AI consulting or LLM consulting work?

Three siblings, three different questions. AI consulting services here covers cross-cutting AI advisory — vendor selection across modalities, build-vs-buy framing across the whole AI estate, 12-month roadmaps that span retrieval and generation and agentic workloads. Generative AI consulting (the advisory wrap on our [generative AI practice](/services/generative-ai/)) is narrower — image, audio, video, brand-controlled generation; LoRA strategy; safety + watermarking posture. LLM consulting (inside our [LLM development practice](/services/llm-development/)) is hosted-vs-self-hosted decisions, fine-tuning strategy, cost engineering on a known LLM workload. If your question is multi-modality and cross-cutting, you're in the right pillar. If it's modality-specific or model-architecture-specific, route to the sibling.

Do you stay on after the roadmap ships, or hand off cleanly?

Clean handoff is the default. Every memo names exit gates and a recommended owner per workstream — sometimes the recommendation is an internal hire, sometimes a Paiteq sibling practice, sometimes a third-party vendor. About 40% of audit engagements convert to a build engagement with us under one of the technical pillars ([AI Agent Development](/services/ai-agent-development/), [RAG](/services/rag-development/), [LLM](/services/llm-development/), or [Machine Learning](/services/machine-learning-development/)); about 30% convert to roadmap then build; about 30% take the memo, hand it to an existing team or a competing vendor, and execute without us. We don't penalise the third path — the memo is a finished artefact in itself. Retainer engagements exist for clients who want us in the room monthly for the first year, but we don't push them by default.

017 / Related practices

## Adjacent services.

[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

](/services/rag-development/)

018 / Start an engagement

## Ship an *honest* AI audit in three weeks.

AI capability assessment in 2–3. Enterprise ai strategy roadmap in 3–5. AI vendor selection consulting in 4–6. AI readiness assessment + AI due diligence in 2–4.

[Talk to a partner](/contact/) [See a redacted memo](/contact/?topic=redacted-memo)


---

## SECTION: 4.5. Service: ai-integration

_Source: https://www.paiteq.com/services/ai-integration/_

# AI Integration Services & OpenAI Integration Services — Paiteq

> AI integration services and OpenAI integration services for existing SaaS — Claude API integration, Vertex, Azure OpenAI, abstraction, fallback, cost engineering.

**HTML version:** https://www.paiteq.com/services/ai-integration/

## Key facts

- Providers: OpenAI, Anthropic Claude, Vertex AI, Azure OpenAI.
- Posture: abstraction layer, provider fallback, cost engineering, observability.
- Built for teams plugging AI into existing SaaS rather than greenfield apps.

## Related pages

- [LLM Development](https://www.paiteq.com/services/llm-development/)
- [AI Workflow Automation](https://www.paiteq.com/services/ai-workflow-automation/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI INTEGRATION SERVICES

# *Drop-in* AI for the product you've already shipped.

A drop-in ai integration services engagement built for the SaaS product that already has users, an existing UI, and a release train. OpenAI, Claude, Vertex, Azure OpenAI integration, Bedrock, or a self-hosted Llama 4 — wired through a provider abstraction layer with rate-limit, fallback, cost ceiling, and eval-gated rollout. The drop-in ai ships behind a feature flag and the ai api integration runs through the gateway; we don't re-platform your stack to ship the integration seam that survives the second provider outage.

[Scope an integration](/contact/?topic=ai-integration) [See integration patterns](#patterns)

Default stack OpenAI · Anthropic · Vertex · Bedrock

Gateway LiteLLM / Portkey / in-house Worker

Observability Langfuse · Helicone

Engagements 1–8 weeks · fixed scope

001 / PATTERNS

## Four integration patterns. Every ai integration services engagement maps to one.

Most teams arrive with the right feature idea and the wrong integration shape attached. The four patterns below cover roughly every real ai feature integration we've scoped — drop-in, proxy, sidecar, forked fine-tune deploy. Pick the closest, the kickoff call refines if needed. We won't sell you a sidecar microservice when a two-week drop-in fits, and we won't pretend a drop-in survives the second provider outage when it won't.

   
01

### Drop-in

The lightest integration: one provider, one endpoint, one feature wired into an existing screen or job. Auth keys live in your secret store, requests hit the provider directly, responses render in the existing UI. Two-week scope from kickoff to first user. We ship this when the feature is genuinely additive — a summarisation button, a draft-reply panel, a classify-and-route hook — and the downside of a provider outage is degraded UX, not broken core flow.

Pick when

-   One feature, one workload, one provider
-   SaaS product with an existing screen
-   risk tolerance for direct provider dependency
-   team can ship the prompt + the UI in a sprint

Skip when

-   Mission-critical workload
-   multi-tenant where a noisy tenant can blow your rate-limit
-   cost ceiling matters more than latency
-   you'll need a second provider within six months

Stack

OpenAI SDKAnthropic SDKCloudflare WorkersVercel Edge

02

### Proxy

Your app calls a thin gateway (LiteLLM, Portkey, an in-house Cloudflare Worker, or an internal FastAPI service). The gateway speaks every provider's dialect, enforces per-tenant rate-limits, falls back to a secondary provider on 429 / 5xx, logs every call to Langfuse or Helicone, and caps spend per tenant per day. Cost engineering, observability, and vendor portability all converge on one thin layer. This is the integration shape that survives the second provider outage.

Pick when

-   Multiple providers in play
-   multi-tenant SaaS
-   cost guardrails or per-tenant quotas are a contract requirement
-   you need to swap providers without redeploying every microservice
-   observability is a board-level ask

Skip when

-   Single feature, single provider, single tenant — the proxy adds operational surface that doesn't pay for itself yet

Stack

LiteLLMPortkeyLangfuseCloudflare WorkersRedisPostgres

03

### Sidecar microservice

The AI feature ships as a separate service — Python on Fly.io, a Cloudflare Worker, a containerised FastAPI on the existing Kubernetes — with its own deploy cadence, its own observability stack, its own eval suite, its own provider config. The host app calls it like any other internal API. Lets the AI team ship without blocking the host-app release train, lets eval failures fail the AI service without taking down the parent product, lets you scale the AI tier independently.

Pick when

-   Existing product on a tight release train
-   AI feature has its own eval cadence
-   the AI workload's resource profile differs from the host (long-running, bursty, memory-heavy)
-   team boundary between product and AI engineering

Skip when

-   Single-team SaaS where the overhead of a second service costs more than the AI feature earns
-   tightly-coupled UX where latency budget can't absorb an extra network hop

Stack

FastAPIFly.ioCloudflare WorkersLangGraphvLLMLangfuse

04

### Forked fine-tune deploy

The model itself moves inside your perimeter — Llama 4 70B on H100s via vLLM, a fine-tuned Mistral Small 4 on Bedrock or SageMaker, a LoRA-adapted Qwen 3 on Modal. You pay for compute instead of per-token, you control the upgrade cadence, the data never leaves your cloud account. The integration surface looks similar to the proxy pattern — your app hits an internal endpoint — but the endpoint runs your model, not a vendor's. Cross-link: this is where P3 LLM and P12 MLOps overlap on a real engagement; we own the integration seam, P3 owns model build, P12 owns serving infrastructure.

Pick when

-   Data residency or sovereignty constraint (HIPAA, EU, defence-adjacent)
-   per-token cost dominates spend at scale
-   latency budget needs predictable tail
-   provider's behaviour drift is breaking your evals each release
-   you have a fine-tune that genuinely beats frontier on your task

Skip when

-   Volume too low to amortise GPU rent (under ~3M tokens/day)
-   team can't run a model serving stack 24/7
-   your eval set doesn't yet justify owning the model lifecycle

Stack

vLLMLlama 4Mistral Small 4ModalBedrockSageMaker

If your shape doesn't fit, the framing call is free — DM us with the constraint that matters most (latency, cost, residency, portability) and we'll write back inside a business day with which pattern fits and what it'd cost.

002 / SHIP

## OpenAI integration services, Claude, and Vertex — what an ai integration company actually delivers.

Six deliverables that show up in every production-grade ai integration services engagement. The first two — provider API integration and abstraction layer — are the seams; the other four are the engineering that keeps the seams holding under real traffic.

[

01 / API ↗

Provider API integration

OpenAI, Anthropic Claude, Google Vertex, Azure OpenAI, AWS Bedrock — direct SDK integration with auth, retries, structured output, streaming, function calling. The ai api integration shape that ships first in most ai integration services engagements; later retrofits add abstraction.

SDKStreamingFunctions

](#providers)[

02 / LAYER ↗

Provider abstraction layer

Thin gateway in front of every provider — uniform contract, model-agnostic request shape, fallback ladder, cost telemetry, prompt-format normalisation. LiteLLM or Portkey at the edge, or an in-house Worker. The integration that survives the next provider outage.

LiteLLMPortkeyFallback

](#abstraction)[

03 / LIMITS ↗

Rate-limit + cost engineering

Per-tenant quotas, daily cost ceiling, token-budget-per-request, cache tiers (semantic + exact), burst smoothing, exponential back-off, 429 fallback. Cost ceiling and rate-limit live as code, not as a hopeful chart in the post-mortem.

QuotasCacheBack-off

](#cost)[

04 / FALLBACK ↗

Graceful fallback ladder

Primary → secondary provider → degraded mode → human queue. Wired to health checks, latency thresholds, and 429 + 5xx detection. The drill: a vendor goes down, your product holds. We test this once a month on a calendar invite, not as a tabletop exercise.

Multi-providerHealthDrill

](#abstraction)[

05 / ROLLOUT ↗

Eval-gated feature rollout

Shadow → 1% → 10% → 100%, gated by eval scores + cost telemetry + user-feedback signal. Rollback is a feature flag, not a redeploy. Every gate is a named metric, a named threshold, a named owner. The AI feature ships behind a flag from day one.

ShadowCanaryFlags

](#rollout)[

06 / OBSERVE ↗

Observability + drift watch

Langfuse, Helicone, or LangSmith wired before the first user lands. Per-call cost, latency p50/p95/p99, hallucination scoring, prompt versioning, regression alarms. Production traces feed the eval set monthly so the integration doesn't quietly drift after the launch party.

LangfuseDriftAlarms

](#observability)

003 / PROVIDERS

## OpenAI, Claude, Vertex, Azure, Bedrock, or self-hosted — when each one wins.

The provider question is the second-most-asked in every kickoff (after "what'll this cost"). The honest answer is workload-shaped, not brand-shaped. Below: six providers we integrate against in production, what each one wins on, where each one breaks, and the Paiteq pattern we default to. None of this is a vendor pitch — we don't take referral fees from any provider, and the proxy layer means we can swap a provider for the next engagement without breaking the host app.

Decision cards — strengths · when we pick · when we don't · the pattern we default to.

OpenAI

Strengths

Frontier reasoning with GPT-5; broadest function-calling ecosystem; the realtime Voice API ships sub-400ms turn-take with tool calling baked in; Assistants API for stateful threads.

When We Pick

Function calling is the dominant workload; voice agent needs realtime streaming; team already has OpenAI procurement done; product needs the broadest tooling ecosystem on day one.

When We Don't

Hard data-residency requirement; per-token cost dominates spend at scale; you've been burned by behaviour drift between GPT-5 minor releases and your evals can't absorb it.

Paiteq Pattern

Default for voice + function-heavy workloads. Always paired with eval gates and a secondary provider behind the proxy layer.

GPT-5Realtime APIAssistants

Anthropic Claude

Strengths

Claude Opus 4.7 holds the lead on long-context reasoning and instruction-following on contract-grade documents; Claude Sonnet 4.6 is the production workhorse for tool-use agents; computer-use + Skills + MCP support out of the box.

When We Pick

Long-context retrieval workloads; tool-using agent in production; legal / healthcare / finance where instruction-following on long policy docs is the failure mode; an openai integration services engagement that's hit accuracy ceiling on GPT-5 and needs a second opinion.

When We Don't

Voice-agent realtime workloads (OpenAI's Voice API still leads on turn-take); workloads where the cheapest possible per-token cost matters more than reasoning quality.

Paiteq Pattern

Default for agents and long-context. Most claude api integration engagements pair Opus 4.7 for planning with Sonnet 4.6 for tool calls — cost halves, quality holds.

Opus 4.7Sonnet 4.6MCP

Google Vertex AI

Strengths

Gemini 3.0 Pro at 1M+ tokens of context is genuinely useful for whole-corpus retrieval; Vertex AI integration includes Model Garden (Llama, Mistral, Claude on Vertex), built-in eval tooling, and tight BigQuery + AlloyDB hooks for enterprises already on GCP.

When We Pick

Enterprise already on GCP with BigQuery as the data warehouse; multimodal workloads with video; team needs a single procurement / billing surface across frontier + open-weights; whole-document retrieval over PDFs.

When We Don't

Team is on AWS or Azure and procurement of a third cloud is a six-month exercise; workloads where the per-token cost of Pro at 1M context outruns the value.

Paiteq Pattern

Default for GCP shops and multimodal-heavy workloads. Vertex's serverless deployment handles bursty multi-tenant load well — fewer surprises than self-managed inference at small scale.

Gemini 3.0 Pro1M contextVertex

Azure OpenAI

Strengths

Same GPT-5 family with Microsoft's enterprise procurement, data-residency choice, and Azure private endpoints. The compliance + procurement surface most enterprises pre-clear for. Fewer rate-limit headaches than direct OpenAI for high-volume accounts on committed-throughput tiers.

When We Pick

Microsoft-shop enterprise with Azure as the primary cloud; data residency contractually required; procurement timeline rules out a new vendor; existing Azure spend commitments unlock favourable rates.

When We Don't

Model versions you need haven't landed on Azure yet (latency between OpenAI launch and Azure availability still runs weeks); engineering team prefers OpenAI's faster ecosystem cadence.

Paiteq Pattern

Default for Microsoft-shop enterprises. Every azure openai integration we ship is paired with provider abstraction so the swap to direct OpenAI (or Anthropic) doesn't break the app contract. The azure openai integration also unlocks Microsoft's enterprise support tier when something goes sideways at 3am.

GPT-5 on AzurePrivate endpointResidency

AWS Bedrock

Strengths

Single API across Claude, Llama 4, Mistral, Cohere, Nova; AWS IAM + VPC + KMS native; PrivateLink endpoints; SageMaker integration for fine-tuned model deploy. The model-multiplexer Azure customers want and AWS finally ships.

When We Pick

AWS-shop enterprise; multi-model strategy across frontier + open-weights without a second procurement surface; data plane needs to live entirely inside AWS account boundaries.

When We Don't

Workload needs the absolute latest GPT-5 or Claude release the week it ships (Bedrock typically trails by days to weeks); cross-region deployment is the dominant constraint and Bedrock's regional model availability doesn't line up.

Paiteq Pattern

Default for AWS-shop enterprises and multi-model strategies. Bedrock's provisioned-throughput tier is the cleanest answer for the rate-limit anxiety that drives a lot of inbound ai integration services questions.

Multi-modelPrivateLinkProvisioned

Self-hosted vLLM

Strengths

Llama 4 70B / 405B, Mistral Small 4, Qwen 3, DeepSeek V3 on your own H100 cluster (or rented Modal / RunPod / Lambda). Predictable cost per million tokens (~$0.05-0.20 amortised on dedicated GPU), full data residency, fine-tune lifecycle you own.

When We Pick

Per-token cost dominates spend (>3M tokens/day); data sovereignty is a hard contractual constraint; you have a fine-tune that beats frontier on your eval set; latency tail needs predictable bounds.

When We Don't

Volume too low to amortise GPU rent; team can't run a model serving stack 24/7; workload genuinely needs frontier capability that open-weights can't match yet (most agent + voice + complex reasoning still benefit from hosted frontier).

Paiteq Pattern

Default behind the proxy as a cost-tier route — easy queries go to the self-hosted model, hard queries to hosted frontier. Cross-links: P3 LLM owns model build; P12 MLOps owns serving infra; P10 owns the integration seam between your app and the model.

Llama 4vLLMSelf-host

004 / FIT

## Where each provider wins — workload × provider heat-grid.

Eight production workloads across six providers. Three dots = default pick on quality + ecosystem; two = competitive; one = possible but not the first choice; zero = don't. This is the grid we run in the kickoff when a team asks "should we be on OpenAI or Claude" — the answer almost always depends on which row you're optimising for, not which column you've already done procurement on.

Workload Provider

OpenAI

Anthropic

Vertex

Azure OAI

Bedrock

Self-host

Chat / assistant

Function calling / tools

Long-context Q&A (>200k)

Structured extraction

Vision / OCR

Voice agent (realtime)

Code generation

Embedding / retrieval

Chat / assistant

OpenAIAnthropicVertexAzure OAIBedrockSelf-host

Function calling / tools

OpenAIAnthropicVertexAzure OAIBedrock Self-host

Long-context Q&A (>200k)

AnthropicVertexBedrock OpenAIAzure OAISelf-host

Structured extraction

OpenAIAnthropicVertexAzure OAIBedrockSelf-host

Vision / OCR

OpenAIAnthropicVertexAzure OAI BedrockSelf-host

Voice agent (realtime)

OpenAIAzure OAI AnthropicVertexSelf-host

Code generation

OpenAIAnthropicVertexAzure OAIBedrockSelf-host

Embedding / retrieval

OpenAIAnthropicVertexAzure OAIBedrockSelf-host

Possible fit Good fit Primary vertical

Cell ratings reflect 2026-05 production experience and shift with each model release. We re-score the grid quarterly; the live version is the one in the engagement kickoff deck.

005 / ABSTRACTION

## The provider abstraction layer — six things it owns.

The single deliverable that separates a hopeful integration from one that survives the next provider outage. Whether it's LiteLLM behind a Cloudflare Worker, a Portkey managed gateway, or 600 lines of in-house Python, the abstraction layer owns six things — and the engagement isn't done until all six are wired.

-   01
    
    ### Uniform request contract
    
    Your app calls one shape. The gateway translates to OpenAI, Anthropic, Vertex, or Bedrock dialect. Adding a fifth provider doesn't ripple back into every microservice — the contract is the seam, not the SDK. Most in-house gateways land at 400-800 lines of TypeScript or Python; LiteLLM gives this for free with broad coverage.
    
-   02
    
    ### Fallback ladder
    
    Primary → secondary → degraded → human queue. Routing driven by rolling p95 latency, 429 / 5xx rate, and an explicit health probe — not blind retry. The drill is calendarised monthly. Without this, the first real outage is a 3am Slack thread; with it, the page-fix is the user not noticing.
    
-   03
    
    ### Observability seam
    
    Every call traced to Langfuse, Helicone, or LangSmith — request, response, tokens-in, tokens-out, latency, cost, prompt version. Production traces feed the eval set monthly. The integration that's invisible after launch is the one that quietly drifts on the next model release.
    
-   04
    
    ### Prompt-format normalisation
    
    System / user / assistant messages, tool-use schemas, JSON-mode constraints, multimodal blocks — each provider has its own dialect. The gateway normalises so application code stops caring. Swapping Claude for GPT-5 should be a config flip, not a refactor.
    
-   05
    
    ### Cost telemetry
    
    Per-tenant token spend in Postgres or Redis, exposed to product analytics and to the finance team's dashboard. Daily ceilings, weekly trends, anomaly alerts wired from week one. The cost story is a number the CFO can read on any Monday morning, not a surprise on the next invoice.
    
-   06
    
    ### Vendor swap drill
    
    A scripted exercise: kill the primary provider's key, watch the secondary take over, measure the latency hit, read the degraded UX. Once a month, on the calendar. Catches the staleness in the fallback config that nobody noticed for the last quarter. The drill is the cheapest insurance on the integration.
    

006 / LIMITS

## Rate-limit and cost engineering — the five controls that ship as code.

The most common post-launch own-goal in ai feature integration is a runaway bill from a malformed prompt or a noisy tenant. The fix isn't a dashboard; it's five controls wired into the gateway before launch. Without them the post-mortem reads "we'll add rate-limits this sprint"; with them, the launch story is uneventful.

-   01
    
    ### Per-tenant quotas
    
    Daily and monthly token budgets per tenant, stored in Redis with TTL, checked before every request. When a tenant hits 80%, an alert fires to the account owner; at 100%, requests downgrade to the cheaper model or return a polite 429 the host app can handle. The contract reads cleaner than "we'll cap your usage" — there's a number, in writing, enforced at the gateway.
    
-   02
    
    ### Daily cost ceiling
    
    Account-wide daily dollar cap with an alert at 80% and a hard cut at 100%. Worst case a runaway script burns one day's ceiling, not one month's. The ceiling is set in the contract, exposed in the admin UI, and the alert lands in Slack or PagerDuty — not buried in a CloudWatch tab nobody opens.
    
-   03
    
    ### Token budget per request
    
    No single request can exceed N tokens of completion. Protects against a malformed prompt that asks for "summarise this 80MB document" and runs the bill into four figures on one call. Easy to skip on day one; expensive to add after the first surprise invoice.
    
-   04
    
    ### Cache tiers (exact + semantic)
    
    Exact-match cache for identical requests (the same FAQ summarisation hits 50 times an hour); semantic cache via Redis + a small embedding model for near-duplicates. Typical hit rate lands somewhere in the 15-40% range depending on workload shape — that's a 15-40% straight cut in provider spend, paid back over the lifetime of the feature.
    
-   05
    
    ### Burst smoothing + back-off
    
    A leaky-bucket queue in front of the provider — when traffic spikes past your rate-limit, requests queue rather than 429. Exponential back-off with jitter on retries. Lets a marketing spike degrade gracefully into a longer tail of latency rather than a wall of failed requests.
    

007 / ROLLOUT

## Eval-gated feature rollout — shadow to 100%.

Every ai feature integration ships behind a feature flag from day one. The flag stays in code for 90 days minimum post-launch — when a provider has an outage, the fallback ladder runs through the same flag-gated path. Removing the flag prematurely is the most common own-goal in this category; we keep it open longer than feels comfortable.

WEEK 1–2

### Shadow

Wire the AI feature behind a flag. Production traffic runs both the legacy path and the new path; only the legacy result reaches the user. We log AI output, latency, cost, eval scores for two weeks. Catches regressions before any user sees them.

WEEK 3–4

### 1% canary

Flip the flag for 1% of users — usually internal employees plus a small opt-in cohort. Two-week dwell. Per-call cost, latency p95, eval drift, user-feedback signal all watched against thresholds. Rollback is a flag flip; nobody redeploys at 2am.

WEEK 5–6

### 10% expansion

Tenant + geography + plan-tier filters open up. Two-week dwell. Cost telemetry tightens: per-tenant ceiling, daily cap, alert if any tenant exceeds the budget envelope. This is where rate-limit reality bites — provider quotas, burst smoothing, retry-with-back-off all get exercised.

WEEK 7+

### 100%

Flag is open to everyone. The flag itself stays in code for at least 90 days — when a provider has an outage, the fallback ladder runs through the same flag-gated path. Removing the flag prematurely is the most common own-goal in ai feature integration.

Eval gates between stages: regression vs the prior stage on a 20-80 example task-specific eval set in Inspect AI or Promptfoo. A stage doesn't open until the eval score holds and the cost telemetry is inside the budget envelope.

008 / PICK

## Pick the integration pattern.

Two or three questions usually narrow the pattern down to one. The tree below is the same one we run in the framing call — answer the constraint that matters most and the recommendation falls out. If two constraints tie (cost and residency, say), we'll walk both branches and price each in the kickoff memo.

Click an answer to advance. The terminal is the pattern we'd default to — pricing and scope come in the kickoff.

Path

Question

Pick one

Result

009 / COMPARE

## Drop-in vs proxy vs sidecar vs forked deploy — side-by-side.

Same four patterns from the carousel, rendered as a comparison grid for the procurement spreadsheet. Pull this into the kickoff memo verbatim if it's useful.

### Drop-in vs Proxy

Drop-in

Proxy

Setup time

1–2 weeks

3–5 weeks

Operational surface

Provider SDK only

Gateway + fallback + cache

Vendor portability

Low — direct binding

High — swap at gateway

A direct SDK binding means a provider deprecation or acquisition is a **rewrite event**, not a config change. The proxy's single outward contract lets you swap OpenAI for Anthropic — or add a self-hosted fallback — without touching the host application. Most teams don't feel this until month nine; the ones who wired a proxy at month one don't panic.

Cost shape

Per-token, ungated

Per-token + ceiling at edge

Ungated per-token billing is fine until a malformed prompt or a runaway background job hits the API at 3am. A gateway ceiling — daily spend cap per tenant, per-request token budget, burst smoothing — turns a cost-spike incident into a logged alert. The gating pays for itself on the first abuse event that would otherwise have tripled the monthly bill.

Latency tail

Provider's tail = yours

+30–80ms gateway hop

Data residency

Provider's policy

Policy + log control

With a direct integration, every request-response pair is governed exclusively by the provider's data processing agreement. A gateway layer lets you **strip PII before the payload leaves your perimeter**, log only the fields you're permitted to retain, and route regulated tenants to an EU-region endpoint — all without changing the host application's contract.

Break-even volume

Any

~500k tokens/day

The proxy's operational overhead — gateway infra, fallback drill, telemetry — only amortises above roughly 500k tokens per day. Below that threshold a drop-in integration ships faster and costs less to run. The right call is drop-in now with a proxy retrofit planned as a named line item when volume or vendor-risk conversations hit the board.

Proxy wins as soon as you hit a second provider, a second tenant, or a second outage.

### Sidecar vs Forked deploy

Sidecar

Forked deploy

Setup time

4–8 weeks

8–16 weeks

Operational surface

Sidecar service + own deploys

Inference cluster + model lifecycle

Vendor portability

Medium — sidecar isolates provider

Highest — own the model

Cost shape

Per-token + per-service budget

Per-GPU-hour amortised

Neither shape is cheaper by default — the crossover depends on volume profile. Sidecar per-token billing is cheaper at **bursty, irregular loads** because you pay only for what you use. A forked self-hosted model amortises its GPU-hour cost at **sustained high volume** — the savings show up once utilisation stays above ~60% across the run window.

Latency tail

+50–120ms service hop

You set the tail with your hardware

A hosted provider's p95 latency is outside your control — you inherit whatever tail the provider's fleet produces at peak load. A forked self-hosted model lets you **provision for your exact SLA**: dedicate the right GPU tier, tune the batching parameters, and own the p99. This matters most for synchronous, user-facing workloads where a 300ms tail is a UX problem.

Data residency

Sidecar policy + provider behind

Fully inside your perimeter

A sidecar still routes inference through a hosted provider — your sidecar controls the *application* perimeter, but the LLM call crosses the wire. A forked deploy means **no token leaves your VPC**: weights live on your infrastructure, inference runs on your hardware, and the data processing agreement is with yourself. This is the deciding factor for HIPAA, FedRAMP, and EU-AI-Act Article 28 workloads.

Break-even volume

~1M tokens/day or eval-cadence pressure

~3M tokens/day on hosted frontier baseline

Forked deploys reach cost parity with hosted frontier models only around 3M tokens/day on rented H100s — below that threshold the GPU-hour commitment exceeds per-token billing. Sidecar reaches its break-even earlier (~1M tokens/day) because the added cost is a service hop, not an inference cluster. Eval-cadence pressure is the non-volume trigger: if your team runs nightly eval suites at scale, the sidecar isolation pays off independently of token count.

Forked deploy wins on residency, predictable cost, and latency tail — at the price of running an inference cluster.

010 / USE CASES

## Where ai integration services land in production.

Six typical-shape engagements across SaaS, fintech, healthcare, ecommerce, and edtech. Function, segment, and deliverable shape are real engagement framings; the cards describe scope and shipped artefact rather than client-specific numbers.

PROVIDER INTEGRATION

B2B SaaS · enterprise tier

### openai integration services for a CRM-adjacent workflow

Typical shape: existing product, one summarisation feature against meeting notes, request to ship a Claude fallback when the primary OpenAI integration hits 429. Deliverable: provider abstraction layer (LiteLLM behind Cloudflare Worker), eval set against domain-specific examples, fallback drill calendarised monthly.

Deliverable: working drop-in plus proxy retrofit · eval harness · fallback drill

ABSTRACTION LAYER

Fintech · regulated DACH

### claude api integration with strict cost ceiling

Typical shape: regulated lender adding a draft-decision-rationale feature to an existing underwriting workspace; per-tenant cost cap mandatory; provider must be EU-region; fallback to Azure OpenAI Sweden Central. Deliverable: gateway with per-tenant token budget, daily cost ceiling enforced at edge, Langfuse traces wired before launch.

Deliverable: proxy layer · cost ceiling · per-tenant quotas · regulator-readable audit log

VERTEX INTEGRATION

Healthcare · multi-region

### vertex ai integration for whole-record clinician Q&A

Typical shape: clinical workflow needs to answer questions over the full record, often >300k tokens of context. Gemini 3.0 Pro at 1M context handles the read; Claude Opus 4.7 as the secondary on Bedrock for the cases where Vertex's tail latency spikes. Deliverable: routing layer, eval set with named clinician-graded gold answers, drift alarms.

Deliverable: multimodal retrieval · routing · clinician-graded eval set

SELF-HOSTED RETROFIT

DTC ecommerce · high-volume catalogue

### Llama 4 self-host behind a proxy for description generation

Typical shape: bulk product-description generation runs millions of tokens nightly; hosted frontier cost runs into the high four-figures monthly. Retrofit: Llama 4 70B on vLLM (rented H100s, scaled down out of run-window), routed to from the existing proxy for the workloads where the eval holds.

Deliverable: vLLM serving · proxy route · eval gate against frontier baseline

ROLLOUT

B2B SaaS · multi-tenant

### ai feature integration shipped through a four-stage rollout

Typical shape: a draft-reply feature inside a customer-support inbox; risk tolerance for hallucination is low. Wired through shadow → 1% → 10% → 100% with eval gates at each stage and per-tenant cost ceiling enforced from canary. Rollback is a feature flag; nobody redeploys at 2am.

Deliverable: flag-gated rollout · eval harness · drift watch wired before launch

PROVIDER SWAP

Edtech · single-feature

### Migration off ChatGPT integration to a multi-provider proxy

Typical shape: original chatgpt integration was a direct OpenAI SDK call, a provider outage took the feature down twice in a quarter, the second outage drove the SLA conversation. Deliverable: Portkey proxy retrofit, secondary on Anthropic, fallback drill, no change to the host app's contract.

Deliverable: proxy retrofit · multi-provider fallback · host contract preserved

011 / ENGAGE

## Four ways to start an ai integration services engagement.

Fixed scope, fixed fee, written deliverable. We don't sell hours; we sell the integration seam. The four shapes below cover almost every inbound — Drop-in, Abstraction-Layer Retrofit, Sidecar Build, Provider-Migration Audit. Mixed engagements bill as two consecutive shapes, not an open retainer.

01 DROP-IN API INTEGRATION Fixed scope

1–2 weeks

### One feature, one provider, shipped in two weeks.

In scope

-   60-minute kickoff to lock the feature + provider
-   Direct SDK integration (OpenAI / Anthropic / Vertex / Azure / Bedrock)
-   Auth + secrets wired into the existing secret store
-   Eval set of 20–40 task-specific examples in Inspect AI or Promptfoo
-   Feature flag in the host app — rollout starts at 1% canary
-   Handover doc + 60-minute review session

Out of scope

-   Abstraction layer / multi-provider (Shape 02)
-   Sidecar microservice (Shape 03)
-   Self-hosted model serving (route to P12 MLOps)

02 ABSTRACTION-LAYER RETROFIT Fixed scope

3–5 weeks

### Existing integration → multi-provider proxy with fallback + cost ceiling.

In scope

-   Audit of the current integration shape + provider-binding surface
-   Gateway build (LiteLLM, Portkey, or in-house — pick at kickoff)
-   Fallback ladder + monthly drill schedule
-   Per-tenant quotas + daily cost ceiling enforced at edge
-   Langfuse / Helicone / LangSmith wired (pick at kickoff)
-   Cutover plan with rollback to the prior direct integration

Out of scope

-   New AI feature build (Shape 01)
-   Sidecar architecture (Shape 03)
-   Production-ops platform at scale (route to P12 MLOps)

03 SIDECAR INTEGRATION BUILD Fixed scope

4–8 weeks

### AI feature as a standalone service alongside the host app.

In scope

-   Sidecar service in Python or TypeScript on the host's existing runtime
-   Independent deploy + eval cadence
-   Own observability stack (Langfuse / Helicone)
-   Provider abstraction layer baked into the sidecar
-   Contract with the host app documented + versioned
-   Runbook for the host-app team to consume the sidecar

Out of scope

-   Host-app changes beyond the API contract
-   Net-new agentic workflows (route to P1 Agent)

04 PROVIDER-MIGRATION AUDIT Fixed scope

2–4 weeks

### Decision memo + cutover plan for moving providers.

In scope

-   Audit of the current provider integration + eval baseline
-   Candidate-provider longlist scored against the current eval set
-   Cost + latency + residency modelled for each candidate
-   Cutover sequencing with named rollback gates
-   Risk register across IP, data residency, behaviour drift
-   Procurement-ready recommendation memo

Out of scope

-   The cutover itself (route to Shape 01 or 02 after the audit)
-   Ongoing retainer (separate engagement)

012 / STACK

## Vendors we integrate against in production.

Frontier providers, gateway tooling, observability, and self-hosted serving — the surface a real ai integration company touches every week.

-   OpenAI
-   Anthropic
-   Google Vertex
-   Azure OpenAI
-   AWS Bedrock
-   LiteLLM
-   Portkey
-   Langfuse
-   Helicone
-   vLLM
-   Modal
-   Cloudflare Workers
-   OpenAI
-   Anthropic
-   Google Vertex
-   Azure OpenAI
-   AWS Bedrock
-   LiteLLM
-   Portkey
-   Langfuse
-   Helicone
-   vLLM
-   Modal
-   Cloudflare Workers

013 / WHY PAITEQ

## Why teams pick us as their ai integration company.

-   01
    
    ### Engineers ship the integration
    
    The partner who signs the scope is the engineer who writes the abstraction layer. No analyst-to-engineer ladder, no slide-deck-only deliverable. The handover is a pull request, not a presentation.
    
-   02
    
    ### No vendor kickbacks
    
    We don't take referral fees from OpenAI, Anthropic, Google, Microsoft, AWS, Portkey, LiteLLM, Langfuse, or any other vendor we recommend. The only money in our P&L is the engagement fee. Provider recommendations follow the eval, not the rebate sheet.
    
-   03
    
    ### Fixed scope, fixed fee, written deliverable
    
    One to eight weeks per engagement; no time-and-materials clock; no open-ended retainer. The integration is the artefact. The handover doc names the gateway, the providers, the eval set, and the fallback drill cadence.
    
-   04
    
    ### The fallback drill is calendarised
    
    Monthly outage simulation, every active integration. Most ai integration services teams ship the fallback config and never test it; ours runs the full ladder on a calendar invite so the staleness gets caught before a real outage finds it.
    
-   05
    
    ### Eval set ships with the integration
    
    Inspect AI or Promptfoo, 20-80 task-specific examples, versioned in your repo, gating every rollout flag. Without the eval the integration ages out the week after the next model release. With it, the regression catches itself in CI.
    
-   06
    
    ### Cross-cutting AI estate, not single-provider
    
    We integrate against every major provider in production every quarter — OpenAI, Anthropic, Vertex, Azure, Bedrock, and self-hosted on vLLM. The recommendation in each engagement comes from current-quarter experience, not last year's blog post.
    

014 / FAQ

## What buyers ask before signing an ai integration services contract.

What's the difference between AI integration services and AI development services?

Integration assumes you have a product already. There's an existing UI, an existing data model, an existing release train, an existing team — and the question is how to add an AI feature into that surface without re-architecting the product. The deliverables look different too: an integration engagement ships an **abstraction layer**, a **rate-limit posture**, a **fallback drill**, and an **eval-gated rollout**. A development engagement ships a new AI app from scratch — different shape, different scope, different risk profile.

If you're building the AI product itself (the chat, the agent, the retrieval pipeline as the central UX), route to [LLM development](/services/llm-development/) or [AI agent development](/services/ai-agent-development/). If you have a SaaS product and you want to add Claude or GPT-5 features without re-platforming, that's the ai integration services shape and you're on the right page.

OpenAI integration services or Claude API integration — which provider do you default to?

It depends on the workload, not the brand preference. We default to **OpenAI** for function-calling-heavy agents, voice realtime workloads, and broadest tool-ecosystem needs — GPT-5 plus the Realtime API still leads on those shapes. We default to **Anthropic Claude** for long-context retrieval, contract-grade instruction-following, and tool-using agents where Sonnet 4.6's behaviour is more predictable across runs.

In practice, most production integrations end up with both behind a proxy. Easy traffic routes to Haiku 4.5 or GPT-5 mini; hard traffic to Opus 4.7 or GPT-5; fallback ladder configured against 429 and 5xx. Single-provider integrations are fine for two-week MVPs; they age badly once the first outage hits.

How long does a typical ai integration services engagement take?

The four shapes we ship cover most of the inbound. **Drop-in API integration** runs 1-2 weeks — single feature, single provider, single tenant. **Provider abstraction layer build** runs 3-5 weeks — gateway, fallback ladder, cost telemetry, observability wired before launch. **Sidecar service build** runs 4-8 weeks — separate service for the AI feature with its own deploy and eval cadence. **Forked fine-tune deploy** runs 8-16 weeks because it pulls in P3 LLM (model build) and P12 MLOps (serving infrastructure) as adjacent practices.

Every engagement starts with a 60-minute scoping call. If the surface looks too narrow or too broad to map cleanly to one shape, we say so before any contract gets signed.

Do you build a provider abstraction layer in-house or use LiteLLM / Portkey?

Depends on three signals: tenant count, regulatory posture, and how much of the contract is provider-specific. For small-to-mid SaaS teams under 50 tenants with relaxed compliance needs, **LiteLLM** behind a Cloudflare Worker is usually the right call — Apache 2.0, broad provider coverage, low operational surface. For multi-tenant SaaS with quotas, observability requirements, and a compliance team in the loop, **Portkey**'s managed gateway earns its fee. For regulated enterprises where the gateway itself needs to live inside a VPC with audit-logged config changes, we build an in-house thin gateway in TypeScript or Python — usually 400-800 lines of code that owns the contract, the fallback ladder, and the cost telemetry.

None of these are "the right answer" globally. The wrong shape is the one that doesn't fit your compliance + ops surface.

What about rate-limit and cost engineering — what does that actually mean?

It means the cost ceiling and the rate-limit live as code. Concretely: **per-tenant token budgets** stored in Postgres or Redis with TTL, checked before every request; **daily cost ceiling per tenant** with an alert at 80% and a hard cut at 100%; **per-request token budget** (e.g. "no single request can exceed 8k tokens of completion") to prevent runaway costs from a malformed prompt; **exponential back-off** with jitter on 429 and 5xx; **cache tier** — exact-match cache for repeated identical requests, semantic cache (Redis + a small embedding model) for near-duplicates; **burst smoothing** via a leaky-bucket queue when provider rate-limits would otherwise drop traffic.

Without these, the post-launch story usually goes: the feature ships, a customer abuses it accidentally, the bill triples, the team panics, the feature gets pulled until the rate-limit work that should've been week 1 finally happens. Better to do it before launch.

Can you integrate AI into a product without changing the existing tech stack?

Mostly yes. The integration surface is usually three places: an **auth-and-secrets seam** (provider API keys live in your existing secret manager — Vault, AWS Secrets Manager, GCP Secret Manager, Doppler — not in a new system), a **network egress seam** (calls to the provider's endpoint go through your existing egress controls), and a **data-plane seam** (request and response payloads pass through your existing logging / audit pipeline). All three commonly slot into the existing stack — Node, Python, Go, Ruby, Java, .NET — using the provider's SDK or a thin HTTP client.

What does sometimes change: observability gets a new tier (Langfuse or Helicone next to the existing APM), the gateway adds a hop (Workers / FastAPI service / Portkey), and the eval suite is genuinely new tooling (Inspect AI, Promptfoo, RAGAS) because most existing test infrastructure doesn't grade LLM output.

How do you handle provider outages without breaking the product?

A fallback ladder, wired before launch, drilled monthly. The shape: primary provider (e.g. OpenAI GPT-5) → secondary provider (e.g. Claude Sonnet 4.6) → degraded mode (e.g. a deterministic non-LLM path that returns a "this feature is temporarily reduced" UX) → human queue (e.g. the request lands in a support inbox). The gateway routes based on health checks (rolling latency p95, 429 / 5xx rate, explicit health endpoint), not blind retry — blind retry on a degraded provider just stacks the same failure twice.

The drill: a calendarised monthly exercise where we simulate the primary provider being down (kill the key, blackhole the endpoint, set the gateway health check to fail), watch the fallback chain run through every hop, measure the latency hit, and read the user-facing UX in degraded mode. Most production-grade ai integration services engagements ship this; most that don't, hit their first real outage as a Slack-channel-3am-fire-drill.

Where does ai integration services overlap with workflow automation, MLOps, or AI agent development?

Integration is the seam between your existing product and an AI model. [Workflow automation](/services/ai-workflow-automation/) is the seam between your existing tools (CRM, ERP, ticketing, data warehouse) and event-driven orchestration with LLM-in-the-loop — different shape, different deliverable. [MLOps](/services/mlops/) is the production-ops layer (model serving, drift detection, feature stores, observability platform) that the integration plugs into at scale. [AI agent development](/services/ai-agent-development/) is when the AI is the product, not a feature inside another product — multi-step tool-using agent with planner / executor / verifier separation.

Most real engagements straddle two of these. We're explicit about the seam: P10 owns "we have a SaaS product, we want to add AI features"; P3 owns "we're building the AI app from scratch"; P12 owns "the AI feature is in production, monitor and operate it." When an engagement spans, we name the seams in writing before scope gets locked.

Do you ship the eval set as part of the integration, or is that separate?

Part of the integration, always. An ai feature integration without an eval set is a one-shot demo that ages out the week after a provider releases a new minor version and the behaviour drifts. The eval set lives in **Inspect AI** or **Promptfoo**; it gets versioned in your repo; it runs in CI on every model upgrade; it gates the rollout flag flips at 1% → 10% → 100%. We ship a starting set of 20-80 examples graded against the actual task — task-specific, not generic — and a process for adding to it from production traces monthly.

Without the eval set, the failure mode is silent regression: the model changes, the integration still returns 200s, and a user-facing quality bug shows up three weeks later in a support ticket. The eval set is the cheapest insurance you can buy on an integration.

015 / Related practices

## Adjacent services.

[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

AI CONSULTING

AI Consulting

AI strategy, audits, roadmap.

](/services/ai-consulting/)

016 / Scope an integration

## Ship the integration seam that *survives* the second outage.

Drop-in API integration in 1–2 weeks. Abstraction-layer retrofit in 3–5. Sidecar integration build in 4–8. Provider-migration audit in 2–4.

[Scope an integration](/contact/?topic=ai-integration) [See integration patterns](#patterns)


---

## SECTION: 4.6. Service: ai-migration

_Source: https://www.paiteq.com/services/ai-migration/_

# Legacy Software Modernization Services — Paiteq

> Legacy software modernization services and AI transformation services. Migrate RPA, chatbots, and monoliths to AI-native stacks. Eval-validated cutover.

**HTML version:** https://www.paiteq.com/services/ai-migration/

## Key facts

- Migrations: RPA → agent + LLM, rule-bot → RAG chatbot, monolith → AI-native services.
- Eval-validated cutover — old and new systems graded against the same eval set before traffic shifts.

## Related pages

- [RPA Development](https://www.paiteq.com/services/rpa-development/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

LEGACY SOFTWARE MODERNIZATION SERVICES

# Modernize legacy systems to an *AI-native* stack.

A legacy software modernization services engagement for teams whose RPA bots, support chatbots, or monoliths cost more to maintain than they return. We migrate to a LangGraph or LLM-agent replacement, run both systems in parallel, pass four cutover gates before users see the AI side, and write the decommission runbook so the legacy actually gets turned off.

[Scope a modernization](/contact/?topic=ai-migration) [See migration paths](#paths)

Migration shapes RPA · Chatbot · Monolith · Pipeline

Target stack LangGraph · Claude · GPT-5 · FastAPI

Cutover gates Accuracy · Latency · Error · Rollback

Engagements 6–24 weeks · fixed scope

001 / SERVICES

## Legacy software modernization services — four shapes a real engagement ships.

A legacy software modernization services engagement isn't one thing. The four shapes below cover roughly every legacy application modernization services scope we've taken on — bot estate replacement, chatbot rewrite, monolith carve-out, pipeline reshape. Every shape ships with an eval set, a tested rollback, and a decommission runbook.

[

01 / RPA → AGENT ↗

RPA bots to tool-calling agents

UiPath, Automation Anywhere, Blue Prism estates replaced by LangGraph or AutoGen agents with tool-use. The bots that handle easy cases stay; the ones generating 15%+ exceptions get rebuilt.

LangGraphUiPath exit

](#paths)[

02 / CHATBOT → LLM ↗

Rasa, Watson, DialogFlow to LLM agents

Intent-trees that took eighteen months to label get replaced by a Claude Sonnet 4.6 or GPT-5 agent with RAG over the same support corpus.

Claude 4.6RAG retrofit

](#paths)[

03 / MONOLITH → AI ↗

Django, Rails, Java monolith to AI services

Carve the legacy monolith into FastAPI microservices with an AI routing layer in front. The four-week deploy cycle shortens because the AI surface ships independently.

FastAPIStrangler-fig

](#paths)[

04 / PIPELINE → ML ↗

Airflow ETL to a feature platform

Airflow DAGs with hand-rolled feature SQL reshape around Feast and Vertex AI Pipelines, or Kubeflow on-prem. Features become versioned artefacts three teams can reuse.

FeastVertex AI

](#paths)

002 / PATHS

## The four legacy-to-AI transition paths — when each one fits.

Most teams arrive with the right legacy problem and the wrong migration shape attached. The four paths below cover the bulk of legacy modernization company inbound. Cards name the trigger conditions, timeline shape, and the Paiteq pattern. If your shape doesn't fit, the framing call's free.

Decision cards — strengths · when we pick · when we don't · the pattern we default to.

RPA → Agent

Strengths

UiPath, Automation Anywhere, Blue Prism estates replaced by LangGraph or AutoGen agents with tool-use and an eval set graded by the same ops leads who own the existing bots.

When We Pick

Bot maintenance cost is climbing; exception rate sits above 15% on a representative slice; a recent bot break took longer to fix than the last three combined.

When We Don't

Bots are stable, exception rate is under 5%, and the workload is genuinely rule-shaped — we'll say so on the kickoff call.

Paiteq Pattern

Migrate the top three exception-generating bot families first; keep the rest.

8–16 weeksPer-bot eval

Chatbot → LLM Agent

Strengths

Rasa, Watson, DialogFlow, Microsoft Bot Framework replaced by a Claude Sonnet 4.6 or GPT-5 agent with RAG over the existing support corpus. Intent labelling work goes in the bin.

When We Pick

CSAT on bot traffic sits under sixty; intent coverage caps below seventy; the bot escalates more than thirty percent of routine flows.

When We Don't

Bot has a narrow domain with high accuracy and the cost of changing it exceeds the cost of leaving it.

Paiteq Pattern

Build RAG retrieval first against tickets and KB articles, then layer the agent. Shadow-mode for at least two weeks.

6–12 weeksClaude 4.6 / GPT-5

Monolith → AI services

Strengths

Django, Rails, or Java monolith carved into FastAPI microservices with an AI routing layer in front. Strangler-fig pattern; monolith stays running while traffic shifts.

When We Pick

Deploy cycles exceed four weeks and the ML team's queued behind unrelated monolith changes; access to production data takes a week.

When We Don't

Monolith is small, deploys often, and the team is happy with it — carving is overhead we won't sell for its own sake.

Paiteq Pattern

Carve the highest-AI-affinity endpoint first — usually retrieval, classification, or extraction.

12–24 weeksFastAPI

Data pipeline → ML platform

Strengths

Airflow plus hand-rolled SQL reshape around Feast or Tecton plus Vertex AI Pipelines, SageMaker Pipelines, or Kubeflow on-prem. Features stop being copy-pasted across notebooks.

When We Pick

Model training cycle is longer than three days; feature reuse sits under twenty percent; data and ML teams have parallel definitions of the same feature.

When We Don't

Single model in production with one team owning it end-to-end — a feature platform is overhead at that scale.

Paiteq Pattern

Feature store first, then training surface, then serving. The order matters: feature contract is what every other tier reads.

10–18 weeksFeast / Vertex

003 / PROCESS

## Our AI modernization process — Assess, Architect, Migrate, Validate.

Four phases. Two-week Assess at the front, four cutover gates at the back. Assess ends in a one-page modernization shape; if the migration doesn't survive Assess we say so before scope locks. Migrate is where parallel-run earns its keep — both systems live, only legacy reaches users until the gates hold.

WEEK 1–2

### Assess

Inventory the legacy end-to-end — code, data contracts, integrations, on-call runbooks, the spreadsheet someone keeps on a personal drive that turns out to be load-bearing. The deliverable is a one-page modernization shape and a risk register.

WEEK 2–4

### Architect

Design the target AI stack against contracts surfaced in Assess. Pick the agent framework, the model surface, the data contracts, and the eval set that gates cutover. The output is a runnable spec, not a slide deck.

WEEK 4–N

### Migrate

Build the AI replacement alongside the legacy. Both systems run against production traffic in shadow mode; only the legacy reaches users. Regressions caught before any user sees them. Usually two to four weeks of parallel run.

WEEK N–N+2

### Validate & cut over

Run the four cutover gates. Route a small slice of real traffic, watch telemetry, expand. Legacy decommission goes on a calendar with a named owner — not the Friday after launch.

004 / CUTOVER

## Eval-validated cutover — four gates the AI replacement clears before users see it.

The gap we keep finding on legacy modernization company pages: nobody describes how the new system gets validated before cutover. The four gates below are the answer. Each names a metric, a threshold, a methodology, and a fail state. A gate that doesn't hold means the migration extends — we'd rather ship six weeks late than ship the regression silently.

Accuracy parity · latency budget · error budget · tested rollback. Every gate has a named owner and a fail state — the fail state isn't 'we'll talk about it'.

1.  01 Accuracy parity
    
    ≥95% task-completion vs legacy baseline
    
    Domain-expert-graded eval set of 200–500 real historical cases. Same input runs through legacy and AI replacement; results scored by the ops leads who'll own the system.
    
    If parity drops under 90% on any cohort, we don't ship. Migrate extends and the prompts, retrieval, or tool boundary rework until the gate holds.
    
2.  02 Latency budget
    
    p95 ≤ legacy p95 + 20% (or contractual SLA)
    
    Forty-eight-hour parallel run on production traffic. Latency measured at the user-visible seam — not just the model call. Includes retrieval and the full agent loop.
    
    Over budget means a smaller model on the hot path or a faster cache, or it doesn't ship. We won't trade a four-second response for a smarter one if the legacy was a half-second.
    
3.  03 Error budget
    
    ≤0.5% error rate over 48h parallel run
    
    Error rate on the AI side measured against the legacy's error definition — not a new looser one. Includes hallucinations, malformed tool calls, schema breakages.
    
    Over the threshold means the eval set was wrong or the retrieval was thin. We trace it; we don't paper it.
    
4.  04 Rollback ready
    
    Tested rollback in under 10 minutes, runbook signed
    
    Feature flag or routing rule cuts traffic back to legacy in under ten minutes. Runbook reviewed line-by-line by the team that'll own the page at 3am. Tested once before cutover and monthly after.
    
    Rollback longer than ten minutes in the dry run means cutover doesn't open. Without a tested rollback, the migration's a one-way bet.
    

005 / DELTAS

## AI transformation services — what changes, what stays, on a real engagement.

The question we hear most after "how long" is "what's still ours afterwards". On an ai transformation services engagement the answer's mostly "most of it" — business logic, team, monitoring stack, workflow boundaries. What changes is where rules live, how quality gets graded, and whether data contracts are explicit. An honest ai transformation consultant maps the deltas before the contract.

Legacy stack

AI-native stack after migration

Business logic rules

RPA bot sequences, intent trees, monolith conditionals

System prompts, tool definitions, deterministic gates around LLM steps

Workflow orchestration

UiPath Orchestrator, Airflow, Bot Framework runtime

LangGraph state machine, Temporal for long-running, or in-house

UiPath Orchestrator and Airflow are step-sequencers — they execute a fixed DAG. LangGraph runs a **stateful decision loop**: the agent picks the next tool call based on what just came back, not on a pre-drawn path. That loop is what makes multi-step exception handling tractable; a step-sequencer requires a developer to anticipate every branch at design time, which is exactly the cost the RPA estate was accumulating.

Data contracts

Implicit — field names, COBOL exports, undocumented CSVs

Explicit — versioned schemas, validated at the seam

Implicit contracts are the most common reason migrations slip timeline. The CSV the legacy bot consumed had a column order that drifted twice in three years, both times fixed silently. An LLM agent surfacing that ambiguity at inference time will produce wrong answers confidently rather than throwing an exception. **Versioned schemas validated at the seam** catch the drift before the agent sees bad data — the contract is the integration test.

Eval and quality grading

Manual QA, sampled spot checks, a Confluence page nobody reads

Inspect AI or Promptfoo, 200–500 graded examples, in CI

The legacy system had no eval because it was deterministic — the same input always produced the same output. An LLM agent is non-deterministic, so a prompt change that improves the common case can silently regress an edge case that manual spot-checks never touched. **200–500 domain-expert-graded examples in CI** means every prompt change runs against the full eval set before merge — the same discipline code has had for decades, applied to model behaviour.

Monitoring

Datadog or New Relic on infra; nothing on model behaviour

Same APM plus Langfuse or Helicone on every model call

Team and on-call

Trained on legacy bot console, intent editor, monolith ORM

Same team plus a 2–4 week shadow period and a new runbook

Legacy license + infrastructure

Active — UiPath seats, Watson contract, database licenses

Decommissioned on a calendar runbook with a named owner

The migration cost case usually lives in the license turnoff, not the build. UiPath enterprise seats run $8k–$15k per bot per year; Watson Orchestrate contracts are typically six-figure annual commitments. **A named owner and a calendar date** are the difference between a license that gets cancelled and one that auto-renews because the person who knew about it left the team. The decommission runbook is why this column wins.

Most teams keep their operators, their monitoring vendor, and their business-logic intent. What changes is the surface — and the eval set and rollback runbook the legacy never had. Cross-link: [add AI to your existing apps without replacing them](/services/ai-integration/) if the legacy is doing its job and the question is just "add features".

006 / RISKS

## Legacy modernization risks — and how we manage them.

Three failure modes show up on roughly every legacy software modernization engagement. None are surprising; all are routinely under-resourced. The mitigations below get wired into Assess so they don't show up as a 3am page after cutover.

-   01 / DATA
    
    ### Legacy data contracts the AI can't read
    
    The biggest hidden cost is the contract nobody wrote down — XML payloads with optional fields that aren't optional, COBOL exports with drifting header rows, Sybase views the warehouse team's been meaning to deprecate. **Mitigation**: explicit data-contract mapping during the two-week Assess, versioned schemas validated at the seam, contracts owned by the data team.
    
-   02 / REGRESSION
    
    ### Edge-case divergence during parallel run
    
    The AI replacement matches legacy on the obvious cases and quietly diverges on cases nobody thought to test. **Mitigation**: an eval set of 200–500 real historical cases graded by the ops leads who'll own the system — not a sampled fifty, not synthetic. Inspect AI or Promptfoo as the harness; the gold-set lives in the repo and runs in CI on every prompt change.
    
-   03 / ADOPTION
    
    ### Operators trained on the legacy surface
    
    The team editing Watson intent trees for three years doesn't automatically know how to read a Langfuse trace; the bot ops team doesn't automatically triage a LangGraph state machine. **Mitigation**: a two-to-four-week shadow period where operators read AI output before users see it, a runbook for the new on-call, a fallback path the operators are trained on.
    

007 / SHAPES

## Typical engagement shapes for legacy software modernization services.

Three archetypes that cover the bulk of inbound. The framing names segment, deliverable, and timeline shape — not invented metrics. Our team has shipped engagements in adjacent practices for a decade; the Paiteq brand is new, the experience isn't, and we'd rather lead with the shape than borrow a number we can't stand behind.

RPA → AGENT

Enterprise ops · 30+ UiPath bots

### Legacy software modernization services for a brittle UiPath estate

Typical shape: thirty-plus bots, the top five generating the bulk of the exception queue, two recent breakage incidents pushed bot maintenance into the board pack. Deliverable: a LangGraph tool-calling agent replacing the highest-exception bot family, eval rubric graded by ops leads, shadow-mode parallel run, decommission runbook.

Deliverable: LangGraph agent · eval rubric · cutover runbook

CHATBOT → LLM

Support ops · Watson or DialogFlow · CSAT < 60

### AI transformation services for a stalled support chatbot

Typical shape: support leadership debating 'add fifty intents' versus a rewrite for two quarters, escalation rate above thirty percent on routine flows. Deliverable: a Claude Sonnet 4.6 agent with RAG over the existing support corpus, eval surface graded by support leads, shadow period, fallback to a deterministic FAQ during outages.

Deliverable: Claude agent · RAG corpus · graded eval · fallback path

MONOLITH → AI

Platform engineering · Django or Java monolith · 4+ wk deploys

### Legacy application modernization services for a monolith blocking AI

Typical shape: monolith is the source of truth for customer data, the ML team's queued behind unrelated changes, deploys require a release train owned elsewhere. Deliverable: a FastAPI microservice carve-out for the highest-AI-affinity endpoint, AI routing layer, eval-gated cutover, on-call runbook for the new service. The monolith stays running; the AI surface deploys on its own cadence.

Deliverable: FastAPI carve-out · AI routing · eval gates

008 / PICK

## AI migration or AI integration — which one fits your situation.

A short decision flow for the question most teams arrive with. Two answers in, the tree lands on a named migration shape — or routes you to the integration practice instead. Bring an ai transformation consultant into the framing call; the wrong shape costs a quarter, and a legacy application modernization services scope picked from a website FAQ is the most expensive bet you can make.

Path

Question

Pick one

Result

009 / STACK

## The destination stack on a typical AI migration.

Frameworks, models, orchestration, and the eval harness — the surface that replaces the legacy estate at the end of a modernization engagement.

-   LangGraph
-   AutoGen
-   Anthropic
-   OpenAI
-   Temporal
-   FastAPI
-   Feast
-   Vertex AI
-   Kubeflow
-   Inspect AI
-   Promptfoo
-   Langfuse
-   LangGraph
-   AutoGen
-   Anthropic
-   OpenAI
-   Temporal
-   FastAPI
-   Feast
-   Vertex AI
-   Kubeflow
-   Inspect AI
-   Promptfoo
-   Langfuse

010 / FAQ

## What buyers ask before signing a legacy software modernization contract.

How do we choose between AI migration and adding AI to our existing system?

If the legacy is genuinely blocking new AI capability — exception rates climbing, deploy cycles holding up the ML team, intent trees that need eighteen months of labelling to extend — that's a migration. If the legacy works and the question is "how do we add a Claude or GPT-5 feature alongside it", that's [drop-in AI for the product you've already shipped](/services/ai-integration/). About a third of inbound that starts with "we need to modernize" ends up as integration scope; an honest ai transformation consultant says so on the kickoff call rather than selling the larger engagement.

What legacy systems can you modernize to an AI-native stack?

The four shapes on this page cover most of it. **RPA estates** on UiPath, Automation Anywhere, Blue Prism, or Power Automate. **Chatbots** on Rasa, Watson Assistant, DialogFlow, or Microsoft Bot Framework. **Monoliths** on Django, Rails, or Java. **Data pipelines** on Airflow with hand-rolled feature SQL.

Rarer scopes we've taken: COBOL extracts feeding a Sybase warehouse that needed to land in a feature store, a custom Java rules engine that became a Claude agent with a deterministic gate. If your stack isn't on the list, the framing call is free.

How long does a typical legacy software modernization engagement take?

The shape sets the timeline. **RPA → Agent** runs eight to sixteen weeks. **Chatbot → LLM agent** runs six to twelve weeks. **Monolith → AI services** runs twelve to twenty-four because strangler-fig takes time. **Pipeline → feature platform** runs ten to eighteen.

Every engagement opens with a two-week Assess that ends in a one-page modernization shape and a risk register. If the shape doesn't survive Assess, we say so before scope locks — better a paid two-week framing than a sunk twelve-week build.

How do you prevent production regression during the AI cutover?

The four cutover gates above — accuracy parity, latency budget, error budget, rollback ready — are the answer. The AI replacement runs in shadow mode alongside the legacy for two to four weeks; only the legacy reaches users. We log AI output, latency, cost, and eval scores against 200–500 graded historical cases. Regressions caught before a user sees them.

The rollback gate is the one most legacy modernization company shops skip. We test rollback in a dry run before cutover and monthly after — a calendarised exercise where traffic gets cut back to legacy at the gateway. Without a tested rollback, the migration's a one-way bet.

Can we run legacy and AI in parallel during migration?

Yes — parallel run is the default, not an upgrade. The legacy stays up; the AI replacement comes online behind a flag. Both process the same production traffic; only the legacy's response reaches users for the first two to four weeks. The four gates above hold, then traffic shifts at one percent, then ten, then a hundred.

Legacy decommission goes on a runbook with a named owner, a license-cancellation step, and a calendar date. Most of the post-migration cost case lives in the legacy turnoff; the turnoff stays paid for if nobody writes the runbook. Bring in an ai transformation consultant who'll write it.

011 / Related practices

## Adjacent services.

[

AI INTEGRATION

AI Integration

Drop-in AI for existing apps — OpenAI / Anthropic / Vertex.

](/services/ai-integration/)[

RPA DEVELOPMENT

RPA Development

Intelligent automation — beyond rule-based RPA.

](/services/rpa-development/)[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)

012 / Scope a modernization

## Modernize the legacy estate that's *costing more* than it's earning.

Two-week Assess. Four-path migration shapes. Four cutover gates. Decommission runbook with a named owner.

[Scope a modernization](/contact/?topic=ai-migration) [See migration paths](#paths)


---

## SECTION: 4.7. Service: ai-workflow-automation

_Source: https://www.paiteq.com/services/ai-workflow-automation/_

# AI Automation Agency · n8n, Make, Temporal — Paiteq

> Paiteq is an AI automation agency shipping LLM-in-the-loop workflows on n8n, Make, and Temporal. Eval-graded, observable, rollback-ready AI automation services.

**HTML version:** https://www.paiteq.com/services/ai-workflow-automation/

## Key facts

- Stacks: n8n, Make, Temporal.
- Posture: LLM-in-the-loop, eval-graded, observable, rollback-ready.

## Related pages

- [RPA Development](https://www.paiteq.com/services/rpa-development/)
- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI Automation Agency

# An ai automation agency shipping *workflows that actually adapt*.

Paiteq builds LLM-in-the-loop automation on n8n, Make, Temporal, and custom orchestrators. Deterministic actions where rules win, Claude or GPT-5 judgment where rules break. Every workflow eval-graded, observable, rollback-ready before any live action ships.

[Talk to engineering](/contact/) [See engagement shapes](#engage)

Stack n8n · Make · Temporal · LangGraph

Practice Sales · Back-office · Support · Eng

Engage Pilot · Build · Migrate · Scale

Compliance Self-host · SOC 2 · HIPAA

001 / SURFACES

## Eight workflow surfaces we automate.

Each surface below is a workload shape we've shipped in production. The rule layer, the LLM judgment layer, and the action layer get worked out per surface — there's no one-size-fits-all shape for ai workflow automation work.

The most common request we get as an ai automation agency starts with "can you automate this thing?" — and the honest first answer is almost always "show me the work." Eight surface categories cover roughly 95% of the workloads we end up shipping as an ai automation company. They're not categories of tools; they're categories of work where LLM judgment and deterministic orchestration combine in a particular shape. We've shipped at least three production ai automation solutions in each of the eight surface categories during 2026.

[

01 / SALES OPS ↗

Sales operations

Lead enrichment from web sources, CRM hygiene, pipeline forecast prep, follow-up sequence drafting. n8n triggers on Salesforce or HubSpot events; Claude Sonnet 4.6 does the judgment; routes low-confidence to a human queue.

SalesforceHubSpotn8n

](#use-cases)[

02 / BACK OFFICE ↗

Back-office automation

Invoice routing with 3-way match, AP / AR reconciliation, expense classification, contract clause extraction. LLM reads the documents, the workflow engine moves money or routes for approval. Most common engagement shape we ship.

AP / ARGPT-5 Vision

](#use-cases)[

03 / SUPPORT ↗

Support workflows

Ticket triage, draft replies on top of grounded knowledge, escalation policies. Triage logic on the LLM side; actions on the Zendesk or Intercom side. Confidence-routed — drafts gate on human approval below the threshold.

ZendeskIntercomTriage

](#use-cases)[

04 / DATA OPS ↗

Data operations

ETL with LLM-driven schema inference for messy sources, anomaly explanations, automated docstring and lineage capture, dashboard summary generation. Temporal handles durability when a pipeline run lasts hours.

TemporalBigQuerySnowflake

](#use-cases)[

05 / MARKETING ↗

Marketing operations

Brief to draft to review to publish, with brand-controlled generation and mandatory human approval steps on customer-facing copy. Cuts the brief-to-published cycle without losing brand voice — we evaluate on a brand-fidelity eval set, not a vibes check.

BrandHuman gate

](#use-cases)[

06 / ENGINEERING ↗

Engineering automation

PR triage and classification, doc-gen against the diff, changelog drafting, incident summarisation, on-call note drafts. GitHub Actions triggers an n8n workflow; LLM judgment lives at the classify step; humans gate before merge.

GitHubPR-gate

](#use-cases)[

07 / RPA REPLACE ↗

RPA modernisation

Replace fragile selector-based RPA with event-driven AI workflows that survive UI changes. UiPath, Automation Anywhere, and Blue Prism estates migrate workflow-by-workflow with eval-validated cutover and the old bots staying live until parity is proven.

UiPathBlue Prism

](#use-cases)[

08 / OBSERVABILITY ↗

Workflow observability

Every node logged, every LLM call traced via Langfuse, replay-any-run from the trace store, per-workflow cost ledgers, drift alarms on judgment accuracy. The same instrumentation we run in eval — kept on, in production.

LangfuseOTel

](#use-cases)

002 / SERVICES

## AI automation services — pick where to start.

Four fixed-scope engagement shapes. Pilot first, Build second, Migrate when you're replacing a legacy RPA estate, Scale when an existing automation practice has outgrown its tooling.

The n8n freelancer market is huge and varied. Most clients we talk to have already tried hiring a single contractor or a workflow automation consulting boutique to ship one flow — sometimes it worked, often it didn't. The gap is usually eval: a single workflow shipped without an eval set has no honest way to answer "is the LLM doing the right thing?" three months in. Our ai automation services and intelligent automation services exist to fill that gap — engagement shapes that bake the eval methodology in from day one, plus the orchestration, observability, and rollback paths that turn a one-off Zap into a production-grade ai automation solutions practice.

[

01 / PILOT ↗

Workflow Pilot

One workflow, eval-graded, dry-run shipped in 2–4 weeks. A safe option for clients evaluating whether to engage us as their ai automation agency.

2–4 wks

](#engage)[

02 / BUILD ↗

Production Build

Multi-workflow system with auth, observability, error handling, rollback paths. The bulk of our ai automation services revenue. Includes four weeks of post-launch iteration.

6–12 wks

](#engage)[

03 / MIGRATE ↗

RPA → AI Migration

Replace selector-based RPA with LLM-augmented workflows. Eval-validated cutover; old bots stay live until parity proven. Typical 8–14 weeks for a defined slice.

8–14 wks

](#engage)[

04 / SCALE ↗

Workflow Scale-up

Take an n8n or Make estate that has outgrown its tooling and harden it — self-hosting, multi-tenant, cost engineering, ops + on-call. About a third of our work.

4–8 wks

](#engage)

003 / STACK

## Orchestrators, model routers, observability — the workflow stack.

Stack choices follow the workload shape — visual canvas, code-first durability, or hybrid composition.

-   n8n
-   Make.com
-   Temporal
-   Inngest
-   Trigger.dev
-   Zapier
-   Pipedream
-   Activepieces
-   LangGraph
-   LangChain
-   Composio
-   Vercel AI SDK
-   Claude
-   GPT-5
-   Langfuse
-   OpenTelemetry
-   n8n
-   Make.com
-   Temporal
-   Inngest
-   Trigger.dev
-   Zapier
-   Pipedream
-   Activepieces
-   LangGraph
-   LangChain
-   Composio
-   Vercel AI SDK
-   Claude
-   GPT-5
-   Langfuse
-   OpenTelemetry

The stack above isn't a fashion list; it's the set of tools we've shipped production workflows on in the past 12 months. n8n leads our self-hosted work because the licence math beats every alternative for engineering-owned regulated estates. Make leads our ops-team-owned work because the visual canvas converts non-engineers into workflow authors without retraining. Temporal carries the long-running durable workflows where state has to survive a multi-day sleep. The LLM layer is Claude Sonnet 4.6 by default — strongest tool-call accuracy in our eval set, prompt-caching cuts ~80% of stable-prompt cost — with GPT-5 mini or Haiku 4.5 as the cheap-tier on routed workloads. Observability lives in Langfuse for the LLM traces and OpenTelemetry for the workflow spans; we don't ship a Production Build without both. The opinionated take: the orchestrator is the easy choice, the LLM layer is the moderate choice, and the observability stack is the choice that decides whether the workflow is debuggable in production. Skip observability and the system is unmaintainable by month three.

004 / TOOLS

## When n8n beats Make beats Temporal — picking the orchestrator.

Each orchestrator wins on a different workload shape. The picker below is what we use internally on every new engagement before we sign anything.

n8n

Strengths

Open-source, self-hostable, node-based canvas with code-when-you-need-it. Excellent LLM node ecosystem — OpenAI, Anthropic, custom HTTP, LangChain bridges, vector-store nodes. Fair-code licence means you can self-host indefinitely without per-seat fees.

When We Pick

Default for self-hosted regulated workloads. Default when the client wants to own the runtime and the licence math beats Make. Default for engineering-led teams that will end up editing function nodes.

When We Don't

Non-engineering ops teams that need a pure no-code experience — Make's visual canvas is more forgiving. Very high throughput (>100 runs/sec sustained) — we'll add Temporal at that point for the durability guarantees.

Paiteq Pattern

About half of our n8n consulting and n8n agency work in 2026 runs self-hosted on the client's AWS or GCP, with Langfuse instrumenting every LLM call. The other half runs n8n Cloud on the official hosted plan.

Self-hostOSSLLM nodes

Make.com

Strengths

The strongest visual canvas in the category — branching, error handlers, and iterators are all first-class primitives. Thousands of pre-built connectors. Ops teams without engineering staff ship more workflows on Make than on any other tool we've benchmarked. Native AI modules for OpenAI and Anthropic.

When We Pick

Ops-team-owned workflows where the runtime is a SaaS bill, not an infra commitment. Workloads with heavy branching that benefit from the visual debugger. Marketing and revenue-ops teams who don't have a platform engineer.

When We Don't

Regulated data — Make is hosted-only, no self-host option. High-volume workloads — the per-operation pricing flips above ~1M ops/month. Engineering teams who want runtime under version control.

Paiteq Pattern

We pick Make for about a quarter of new builds, almost always for ops teams in mid-market companies (200–800 employees) where the licence fee is dwarfed by the engineering time it saves.

VisualConnectorsSaaS

Temporal

Strengths

Durable execution — workflow state persists across worker crashes, deploys, and arbitrary delays. Workflows can sleep for weeks then resume exactly where they stopped. Code-first (Go, TypeScript, Python, Java) — version-controlled, testable, type-safe. Production-scale at Stripe, Netflix, Coinbase.

When We Pick

Long-running workflows (hours to weeks) where state durability is non-negotiable. Workflows with complex compensation logic (sagas). Engineering-owned automation that needs to live in the same repo as the app code. Any workflow where a half-completed run is worse than no run.

When We Don't

Short linear workflows — Temporal's setup tax doesn't pay back below ~10 step durability. Ops teams without engineering capacity — Temporal is not a no-code tool. Workloads where the SaaS hosted options are good enough.

Paiteq Pattern

We reach for Temporal on ~1 in 6 enterprise builds — mostly durable backoffice work (claims processing, multi-day approval chains) and the agentic-workflow shape where retries and checkpoints aren't optional. Our blog on <a href="/blog/n8n-vs-make-vs-temporal/">picking the orchestrator</a> covers the break-even math.

DurableCode-firstEnterprise

Inngest / Trigger.dev

Strengths

Code-first like Temporal but serverless-native — workflows live in your existing Next.js or Node app, no separate cluster. Excellent TypeScript ergonomics. Step functions, retries, scheduling, and event triggers in a single SDK. Good fit for product-engineering teams that already ship on Vercel or Fly.

When We Pick

Product engineering teams who want workflows in the same repo as the app. Lighter-weight than Temporal when durability matters but cluster ops doesn't. Background jobs that have grown into something that looks like a workflow.

When We Don't

Truly long-running workflows (multi-day with complex sagas) — Temporal is still the reference. Ops-team-owned automation — these are developer tools.

Paiteq Pattern

We use Inngest as the workflow layer inside Next.js apps where a Production Build engagement is also shipping app code. Not a first choice for standalone automation, but excellent when the workflow lives next to the product.

ServerlessTS-firstIn-app

Zapier

Strengths

The deepest connector library on the market. Reliability is the strongest in the category — Zaps that have run for years without a failure are common. Recently shipped real AI features. The platform of choice when a workflow just needs to connect two well-known SaaS tools and not break.

When We Pick

Simple linear flows with stable APIs and modest volume. Workloads where reliability outweighs feature richness. When the client's stack is heavy on long-tail SaaS connectors that Make or n8n don't cover.

When We Don't

Anything complex enough to need branching or iterators — Make is more capable. Self-hosted or regulated data — Zapier is hosted-only. High volume — the per-task pricing collapses the economics.

Paiteq Pattern

We rarely lead with Zapier on a new build but we leave existing Zaps alone in scale-up engagements — they work, and replacing them is rework for no gain. About 1 in 4 Production Builds keeps an existing Zap layer for one specific integration.

ConnectorsReliableLinear

LangGraph (in-app)

Strengths

State-graph orchestration for agentic workflows where the path through nodes depends on LLM judgment. Native to LangChain. First-class for agentic-workflow-automation shapes where pure DAG orchestrators struggle. Checkpointing, replay, and human-in-the-loop interrupts all built in.

When We Pick

Workflows where the next step is decided by an LLM, not by a static graph — pure agent loops, multi-agent supervisor patterns, or workflows with judgment-driven branching depth. Always paired with an orchestrator (n8n or Temporal) that triggers the agentic run.

When We Don't

Deterministic workflows — LangGraph adds complexity that pure orchestration doesn't need. The Python-only API (TypeScript is less mature) constrains the team it fits.

Paiteq Pattern

We treat LangGraph as the agentic layer inside a larger workflow, not as a replacement for the orchestrator. See <a href="/services/ai-agent-development/">the agent development practice</a> for the full agentic-workflow patterns we ship.

StatefulAgenticPython

In practice, most of our n8n consulting and n8n agency work as an ai automation company — and we do a meaningful amount of both — happens because n8n is the right fit, not because the client asked for it by name. The same is true for Make and Temporal. A typical mid-market estate ends up with two of these in production: an ops-team-owned Make workspace for the marketing and revenue-ops flows, plus a self-hosted n8n cluster for the engineering-led workflows that touch regulated data. Enterprises with long-running durable work add Temporal on top. The wrong pattern, which we see often in audits, is single-orchestrator dogmatism — usually n8n forced into a long-running shape it isn't built for. We rebalance estates like that on Scale-up engagements about once a quarter.

005 / CAPABILITY × INDUSTRY

## Where workflow automation ships — function × industry.

A heatgrid of the function × industry combinations where we've shipped workflows in 2026. Darker cells are repeatable engagements; pale cells are workloads where the shape doesn't yet justify automation in that vertical.

Function Industry

B2B SaaS

Fin-tech

Health-tech

Legal

Mfg

E-comm

Ed-tech

Logistics

Sales / Revenue ops

Back-office (AP/AR/HR)

Support / Success

Engineering / DevOps

Data / Analytics ops

Marketing / Content

Sales / Revenue ops

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Back-office (AP/AR/HR)

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Support / Success

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Engineering / DevOps

B2B SaaSFin-techMfgE-commLogistics Health-techLegalEd-tech

Data / Analytics ops

B2B SaaSFin-techHealth-techLegalMfgE-commEd-techLogistics

Marketing / Content

B2B SaaSE-commEd-tech Fin-techHealth-techLegalMfgLogistics

Possible fit Good fit Primary vertical

Back-office automation is the densest column — invoice routing, reconciliation, contract ops, claims intake, AP/PO matching, customs filings — because the work is judgment-heavy on structured documents, which is exactly where LLM-in-the-loop wins. Sales / revenue-ops is the second densest because enrichment and routing are universal. Marketing and content automation lags in regulated industries (health, legal, finance) not because the workflow isn't possible but because the human-gate cost dominates the savings; the math only works in lower-risk verticals like e-commerce and ed-tech.

006 / PATTERNS

## Four LLM-in-the-loop patterns we ship.

Most production workflow automation collapses onto one of these four architectural shapes. The shape decides where the eval gates land, where the cost lives, and where the failure modes hide.

   
01

### Trigger → enrich → judge → act

The simplest production shape. An event triggers a workflow; deterministic enrichment pulls context from CRMs, databases, or HTTP endpoints; one LLM call applies judgment (classification, extraction, routing decision); the workflow validates the schema and commits the action to a system-of-record. About 60% of the workflows we ship live here. The fewest moving parts and the easiest shape to reason about.

Pick when

-   One judgment node is enough — the LLM is reading a document, classifying intent, or extracting structured fields. Latency tolerant of one round-trip (typically sub-3s trigger-to-settle). Cost is bounded per run because the LLM call is single.

Skip when

-   Multi-step judgment where the output of one LLM call shapes the prompt of the next. Workflows that need to sleep for hours or days — pull in Temporal for durability. Branch counts above ~10 — switch to confidence-routing or agentic shapes.

Stack

n8n / Make / Inngest (orchestrator)Claude Sonnet 4.6 or GPT-5 (judgment)Pydantic / Zod (schema validation)Langfuse (trace)

02

### Confidence-routed

The LLM emits a structured judgment plus a confidence score (calibrated during the eval phase, not a vibes number). High-confidence outputs trigger the auto-action path; low-confidence outputs route to a human queue with the model's draft attached. The shape is the right balance between automation rate and false-positive risk on judgment-heavy workloads.

Pick when

-   Workloads where some inputs are easy and some are hard, and the cost of a wrong auto-action is bigger than the cost of a queued review. Sales-ops scoring, support triage, AP three-way match — anywhere a human queue absorbs the residual.

Skip when

-   Workloads where every output needs human review anyway — the confidence routing adds complexity without value. Workloads where the false-positive cost is so high that even the high-confidence path needs a human gate (that's the human-gate pattern below).

Stack

Claude Sonnet 4.6 with structured output + confidence calibrationRouting rule on confidence thresholdSlack or queue for low-confidence draftsEval set with the threshold pinned

03

### Durable agentic

Temporal workflow wraps a LangGraph agent. The workflow handles durability (state persists across deploys and crashes), retries (idempotent activities), and long sleeps (the workflow can wait three days for an external dependency without consuming compute). The agent handles dynamic path-finding through the work — tool selection, replanning, multi-step reasoning. The shape is overkill for short linear workflows and exactly right for claim adjudication, multi-day approval chains, and agentic-workflow-automation patterns where path count is too high to enumerate.

Pick when

-   Long-running workflows (hours to weeks). State durability is non-negotiable — half-completed runs are worse than no runs. Path through the work depends on LLM judgment at multiple decision points. Engineering team can own the Temporal cluster (or wants Temporal Cloud).

Skip when

-   Short linear workflows — Temporal's setup tax doesn't pay back below ~10 durable steps. Workflows that finish in seconds — async orchestration adds latency for no benefit. Teams without engineering capacity to run the agent layer.

Stack

Temporal (workflow engine, durability)LangGraph (agent, state graph)Claude Sonnet 4.6 or GPT-5 (agent model)Langfuse (agent trace)Postgres (state)

04

### Human-in-the-loop on irreversible

Mandatory human gate before any irreversible action — sending money, deleting records, closing accounts, publishing customer-facing content, executing trades. The LLM drafts the action plus a written rationale; the workflow posts the draft to Slack or an internal review queue; a human approves, edits, or rejects with feedback. Approved drafts commit through the workflow; rejected drafts close out and the feedback feeds back into the eval set. The architectural choice is human-gate-always, not human-gate-on-low-confidence.

Pick when

-   Actions with no undo: money movement, contract execution, customer comms at scale, legal filings, clinical decisions, irreversible data deletions. Compliance workloads where audit trail demands a human approver per action. High blast-radius surfaces in the early production weeks of any workflow.

Skip when

-   High-volume routine actions where the human gate becomes the bottleneck — switch to confidence-routed instead. Reversible actions where rollback is cheap — full auto with a rollback path is faster and equally safe.

Stack

Any orchestratorSlack-block-kit or custom review UIClaude or GPT-5 draftingAction ledger with full traceRejection feedback into eval set

In practice, most production workflow systems compose two or three of these patterns. A confidence-routed workflow with a human-gate fallback below threshold is the most common shape we ship — it absorbs the long tail of edge cases without throwing every run at a reviewer. Durable agentic wrapped around a linear inner workflow is the enterprise shape for long-running claims work. The wrong pattern is single-pattern dogmatism: forcing every workflow into the agentic shape because it sounds modern, when 60% of real work is linear and would ship 3× faster with the simpler pattern. The pattern falls out of the workload during week 2 of any engagement, not from a template.

007 / DECIDE

## Workflow, agent, chatbot, or RPA — pick the right pattern.

The single most common mistake in this category is solving a workflow problem with the wrong pattern. The 3-question picker below is what we run on every discovery call before scoping.

Path

Question

Pick one

Result

The decision matters because building the wrong shape is expensive to undo. Building an agentic system when a confidence-routed workflow would have shipped in half the time costs you 4–8 weeks of engineering and adds a debugging surface you didn't need. Building a deterministic workflow when the work genuinely needs agentic path-finding caps the system at a brittle 20-branch DAG that breaks every time the work shape shifts. The picker isn't theoretical — it's the same questions we ask on every contact call before talking budget.

008 / EVAL

## Four gates on every workflow before live actions.

Eval-first isn't a slogan; it's a build-order decision. The eval set lands in week 2, before any orchestrator or model is picked. You can't pick the right tools without a way to measure what "good" means on the actual work.

All four gates green before any live action enables. If one's amber, we rework that node in place; if it's red, we re-baseline the model on that judgment step or rethink the workflow shape. The gates are the most important part of our ai workflow automation services — they're what stops 'looks fine in the demo' from shipping wrong actions to production.

1.  01 Judgment accuracy
    
    ≥95%
    
    Every LLM-judgment step is scored against a domain-expert-graded eval set (typically 40–100 examples per judgment node). The eval set is built before the model is picked. Inspect AI as the harness; LLM-as-judge with human spot-check for the disputed cases. We grade per-judgment, not per-workflow — bottling judgment quality at the node level catches drift earlier.
    
    If judgment accuracy drops below 92% on the production trace sample, we re-baseline the prompt or swap to a stronger model on that node. Confident wrong decisions in a workflow are worse than refusals — refusal queues are cheaper to staff than recovery from a wrong AP / AR action.
    
2.  02 Run success rate
    
    ≥99% on dry-run
    
    Trigger-to-settle workflow completion without error or unintended rollback, measured on dry-run mode against the eval scenarios. Includes integration failures, timeout cascades, schema mismatches — every non-judgment failure mode. Tracked per workflow per day in production.
    
    If the production success rate dips below 97% for 48 hours, we pause live actions and reproduce the failure in dry-run. Most success-rate failures trace to a downstream API contract change — observability catches them before they compound.
    
3.  03 Median cost per run
    
    Modelled at discovery
    
    Per-run cost — LLM tokens, compute, integration calls — tracked per workflow per week via Langfuse and an internal cost ledger. Modelled during the Pilot using the expected traffic shape, not a marketing average. Surprise bills aren't a surprise because the modelling lands in week 2.
    
    If median cost drifts more than 25% over the baseline for two consecutive weeks, we audit the model routing on judgment nodes and the prompt cache hit rate. Most cost-runaway incidents trace to one of those two, not to volume spikes.
    
4.  04 P95 trigger-to-settle latency
    
    Under per-workflow SLA
    
    Full trigger-to-settle time, including async queue waits where the workflow includes them. SLA varies — sub-3s for sales-ops enrichment, sub-30s for AP routing, sub-60m for long-running data ops. Tracked p50/p95/p99 separately because workflow latency tails matter.
    
    Breach of p95 SLA for 72h triggers a routing review on judgment-heavy nodes. Usually the fix is moving an easy judgment to a faster model (GPT-5 mini or Haiku 4.5), not replatforming the orchestrator.
    

The four gates above are the floor on a workflow build. For specific workloads we add more — **action reversibility audit** (what fraction of run actions can be rolled back, by run, by day), **schema-compliance rate** (% of LLM outputs that parse against the Pydantic / Zod schema first try, before retry), **human-gate response time** (how long drafts sit in the queue, which is the operational metric that decides whether the queue is staffed correctly). Add gates only when the workload demands them; gate proliferation slows iteration without lifting quality.

009 / OBSERVABILITY

## Workflow observability — what we instrument.

Production workflow automation is debuggable only if the observability lands on day one. The same instrumentation we run during eval stays on in production — that's how you catch drift in week 6 instead of month 3.

Three layers, instrumented from the first dry-run. **LLM traces** via Langfuse — every model call captured with input, output, token counts, latency, cost, model identifier, and the upstream workflow run ID. Production traces feed a sampled trace store; that store feeds the eval set monthly. We've found 1 in 6 production-drift bugs by reading sampled traces against the eval-set baseline; that's an investment we'd rebuild on every engagement.

**Workflow spans** via OpenTelemetry — every workflow node, every retry, every sleep, every queue wait, captured as an OTel span with the run ID as the trace key. Spans tie back to the LLM traces; one trace ID joins both stores. We ship to Datadog, Honeycomb, or self-hosted Tempo depending on what the client's platform team already runs. Sentry catches the workflow exceptions that don't make it to OTel — usually integration timeouts that need rebudgeting.

**Cost ledgers**. Per-workflow per-run cost, separated by LLM tokens, compute, and integration calls (Twilio messages, BigQuery scans, Snowflake credits). The ledger is queryable per workflow per day; the alarm fires on 25% drift over two weeks. Helicone is a useful add-on for granular LLM-cost slicing when the client wants per-tenant or per-user attribution.

The opinionated take: observability is the choice that separates a workflow estate that survives team turnover from one that doesn't. We've audited estates where the engineer who built them left the company and the new team can't tell which workflows are running, let alone which are failing. Three months of instrumentation on day one avoids that failure mode entirely.

010 / PROCESS

## Map, eval, dry-run, ship.

Every workflow ships through the same six-step process. Eval cases land in week 1–2 — the workflow doesn't get built before we know what "right" looks like on the actual work.

WEEK 1

### Workflow map

Current state of the process: triggers, steps, integrations, decision points, exception paths. We sit with the team that runs it today, not just the team that owns it.

WEEK 1–2

### Spec + eval

Eval cases for the LLM-judgment steps, action surface, blast-radius assessment, rollback policy. Eval set lands before any prompting.

WEEK 2–4

### Prototype

Working workflow against real integrations on a dry-run flag. Real APIs, real data, no live actions yet. Built on the picked orchestrator (n8n / Make / Temporal).

WEEK 4–6

### Eval + dry-run

Judgment accuracy, run success, cost-per-run, p95 latency — all four gates green before any live action ships. Production traffic mirrored to dry-run for a full week.

WEEK 6–8

### Deploy

Live actions enabled. Auth, rate limits, Langfuse instrumentation, error routing, rollback playbook in the runbook. SOC 2 alignment if the workload touches it.

ONGOING

### Running

Weekly eval review, drift alarms, cost-tracking dashboard, monthly workflow audit for new automation candidates. Ownership transfers to the client's ops team.

011 / TIMELINE

## What a 6-week Production Build looks like, week by week.

The default Production Build timeline. Pilot is similar but compressed; Migration is similar but with a parallel-run phase added on the back.

Production Build · 6 weeks 6 phases

WEEK 1 Workflow map

Process trace, integration inventory, decision-point map, exception catalogue

Workflow scope signed off

WEEK 2 Eval set

40–100 graded judgment examples, dry-run scenario library

Domain-expert grading complete

WEEK 2–3 Orchestrator pick

n8n / Make / Temporal decision memo, infra provisioned

WEEK 3–5 Build + dry-run

Workflow live in dry-run mode, eval gates wired, observability on

All four eval gates green

WEEK 5–6 Cutover

Live actions enabled progressively, rollback playbook, runbook handover

WEEK 6+ Post-launch

Four weeks of paid iteration baked in; weekly eval review, drift alarms

012 / VS

## Rule-based RPA versus LLM-driven workflow automation.

Classical RPA still has a place — but the place is narrower than the RPA vendors' marketing suggests. The side-by-side below is what we walk through on every RPA modernisation discovery call.

Most of our RPA modernisation engagements come from manufacturers and back-office-heavy services firms with a UiPath, Automation Anywhere, or Blue Prism estate that's hit the brittleness wall. The pattern is consistent: the original bot estate was built in 2019–2022 against UI surfaces that have since changed eight times, the engineering team that built it has rolled over twice, and each UI change costs a sprint to fix. The migration question isn't "RPA versus AI" in the abstract; it's "which bots in this specific estate are worth migrating, and to what shape?"

Rule-based RPA (UiPath / Blue Prism)

LLM-driven workflow (n8n / Make / Temporal)

Triggers

Scheduled scrape, selector watch, file drop

Event-driven (webhook, queue), API call, scheduled, Slack command

Rule-based RPA is typically clocked or file-drop driven — it polls rather than reacts. LLM-driven orchestrators receive webhooks, subscribe to queues, and fan out in parallel, cutting end-to-end latency on time-sensitive flows (e.g., new-lead enrichment, invoice receipt) from minutes to seconds.

Brittleness

Breaks on UI / format change — silent failure common

Adapts via LLM judgment, alerts on novel input shapes

A Blue Prism bot reading a vendor portal fails silently the moment a column moves; the error surfaces only when someone notices bad data downstream. An LLM extraction step sees an unfamiliar layout, returns a low-confidence score, and routes the record to a human queue — the workflow degrades gracefully instead of corrupting the dataset.

Work shape

Repetitive structured data movement on stable interfaces

Structured + judgment-heavy work — extraction, classification, routing, drafting

Maintenance

Constant — each UI tweak is a fix-and-redeploy

Bursty — most input changes absorbed by the LLM step, redeploy only on integration drift

Cost model

Per-bot licence + RPA platform fee (UiPath, Blue Prism)

Per-run (LLM tokens + compute + integration calls) — usually variable, often cheaper at low volume

At high, predictable volume (e.g., 500k invoice lines/month), a flat UiPath licence outperforms variable LLM inference costs — the per-unit math is settled. LLM-driven runs win on variable or low-volume workloads and when maintenance cost is folded in, but the licence model genuinely wins on predictability for large-scale, stable back-office estates.

Setup time

Days–weeks per workflow once the bot pattern is set

Hours–days per workflow once the orchestrator + eval pattern is set

Observability

Logs per bot run, screen recordings — hard to query at scale

Structured traces (Langfuse + OTel), per-node cost, replay-any-run

Best fit

Truly stable structured workflows where input shape never varies

Anything with judgment, free text, document extraction, branching, or volatile input

Full RPA-vs-workflow decision guide — [when each one wins](/blog/ai-vs-rpa-when-to-use-which/)

Our honest take: 1 in 5 bots in a typical estate isn't worth migrating. Some workloads are genuinely stable — the input shape hasn't changed in three years and won't change in the next three. Keep those bots. Migrate the ones that broke twice this quarter, the ones that need judgment your rules engine can't express, and the ones that touch document formats your supplier base controls. The full method lives in our [RPA modernization practice](/services/rpa-development/) page, which goes deeper on the migration framework for clients with large rule-based estates.

013 / USE CASES

## Where teams have shipped — real workflow automation work.

Five anonymized engagements. Workflow shape, segment, outcome metric are real; client identity removed under NDA. Numbers are the actual measured outcomes, not modelled estimates.

Sales ops

B2B SaaS · 200+ emp

### Lead enrichment + CRM hygiene workflow on n8n

n8n triggers on new lead in Salesforce. Claude Sonnet 4.6 enriches from web sources, runs an ICP scoring rubric against the company profile, returns structured fields with a confidence score. High-confidence routes auto-write to Salesforce; low-confidence routes to a human queue with the model's draft attached. Langfuse instruments every LLM call; per-rep cost capped at $0.40/day.

0 %

SDR research time

AP automation

Mfg · 400+ emp

### Invoice routing + 3-way match on Temporal

Inbound invoice arrives via email — GPT-5 Vision extracts header and line items to a Pydantic schema. Temporal workflow matches against open POs and receipts in NetSuite. High-confidence three-way matches auto-approve under the threshold; everything else routes to the AP lead with an annotated diff in Slack. Replaced a Blue Prism estate that broke twice a month on layout changes.

0

AP cycle time: 6 days →

Engineering

Dev tools SaaS · 60 eng

### PR triage + changelog generation on Inngest

GitHub webhook into an Inngest workflow. Claude classifies the PR (feature / fix / chore / breaking) against the diff and the linked issues. Workflow tags the PR, requests appropriate reviewers, drafts a changelog entry. Drafts gate on a human approver before merge — the workflow doesn't ship anything customer-facing without an engineer's nod.

0 %

PR triage overhead

Pre-authorisation

Health-tech · payer-side, 250-emp

### Prior-auth intake on n8n with HIPAA isolation

Faxed prior-auth requests land in a HIPAA-aligned email mailbox — n8n triggers on receipt. A self-hosted Llama 4 70B (no PHI leaves the VPC) extracts the clinical request, matches against the payer's policy library via grounded retrieval, and drafts a coverage determination with citations. Clinical reviewer signs off in 3 minutes instead of 18; clear-cut approvals route directly to the claims system.

Avg review time: 18m → 3m, 0 PHI leaks

Compliance ops

Fin-tech · 800-emp

### Regulatory Q&A workflow on Make + Pinecone

Compliance team submits a question through a Slack form. Make orchestrates a grounded retrieval over a 2,400-document regulatory corpus indexed in Pinecone; Claude Sonnet 4.6 drafts an answer with mandatory citations. Drafts route to a senior compliance officer for sign-off before they go back to the requester. Average response time fell from 2 business days to 40 minutes.

0 %

2 days → 40m response, cited

014 / CONSULTING

## AI automation consulting — when to engage before building.

About a third of our P4 work doesn't start with a build at all — it starts with a 1–3 week advisory engagement that outputs a prioritised roadmap. Four standard shapes below.

The pattern: a client has an automation estate (n8n flows, RPA bots, manual SOPs, half-shipped Zaps) that's grown organically over 2–4 years. Nobody owns the whole estate. Some workflows are quietly business-critical; others were shipped for a single problem that no longer exists. Before we agree to build anything new, we audit what's there. The advisory engagement outputs a workflow-by-workflow scorecard, a tool-stack recommendation, and a sequenced roadmap with budget bands. Some clients run the build with us afterwards; some take the roadmap to an internal team. Either is fine — we charge for the advisory whether the build follows or not. The honest version of ai automation consulting services means saying "don't build this" when the audit says so. Intelligent automation services done well include the courage to recommend less automation, not more.

For clients shopping by tool — workflow automation consulting for an existing n8n estate, ai workflow automation services for a new build, or specifically an agentic workflow automation engagement for a long-running multi-agent orchestration — the four shapes below absorb each of those framings. We don't define the engagement by the buzzword the client found on Google; we define it by the shape of the work after week one of discovery. Most clients shopping for one shape end up scoping a different one once the audit lands.

01

#### Automation portfolio audit

2-week engagement: trace the existing automation estate (RPA bots, Zaps, n8n flows, manual SOPs), score each by ROI and brittleness, output a prioritised AI-automation roadmap. The default ai automation consulting service for clients with sprawl.

02

#### Tool selection memo

1-week head-to-head: orchestrator pick (n8n / Make / Temporal / Inngest) against the actual workload shape, cost model over 12 months, ops capacity assessment. Useful before procurement signs anything.

03

#### RPA modernisation strategy

3-week deep-dive on an existing UiPath, Automation Anywhere, or Blue Prism estate: bot-by-bot triage, migration-vs-keep-vs-kill recommendations, phased modernisation plan with risk and budget bands.

04

#### AI workflow eval design

2-week engagement building the eval methodology before any workflow build: what counts as judgment accuracy on your workload, how to ground-truth the eval set, who in your org grades. The hardest part of any ai workflow automation services engagement — and the thing competitors skip.

Cross-discipline strategic work — when AI automation is one part of a broader AI initiative spanning [agent development](/services/ai-agent-development/), [custom LLM applications](/services/llm-development/), or [grounded retrieval pipelines](/services/rag-development/) — runs through our [consulting practice](/services/ai-consulting/). The four shapes above are scoped to the workflow / automation surface specifically.

015 / ENGAGE

## Four engagement shapes — Pilot, Build, Migrate, Scale.

01 Workflow Pilot Fixed scope

2–4 weeks

### Pilot one workflow, trigger to settle.

In scope

-   One workflow, real integrations
-   Eval cases for LLM-judgment steps
-   Dry-run mode with action queue
-   Demo + go/no-go memo

Out of scope

-   Live actions enabled
-   Multi-workflow orchestration
-   Long-running durable shapes

02 Production Build Fixed scope

6–12 weeks

### Multi-workflow system.

In scope

-   All Pilot deliverables
-   Multi-workflow orchestration
-   Live actions with rollback
-   Auth, rate limits, error routing
-   Observability via Langfuse + OTel
-   Four weeks of post-launch iteration

03 RPA → AI Migration Fixed scope

8–14 weeks

### Replace fragile rule-based RPA.

In scope

-   Bot-by-bot migration plan
-   Eval-validated cutover
-   Old bots stay live until parity proven
-   New workflows on n8n / Temporal
-   Documentation + ops handover

04 Workflow Scale-up Fixed scope

4–8 weeks

### Take what works and 10× it.

In scope

-   Existing-workflow audit
-   Self-hosting / multi-tenant migration
-   Cost + reliability engineering
-   Ops + on-call setup

016 / FAQ

## Common ai workflow automation questions.

n8n vs Make vs Temporal — how do you pick the orchestrator?

Three questions decide it. **One**: who owns the runtime — engineers or an ops team? Engineers gravitate to n8n (self-host, code nodes, version control) or Temporal (code-first, type-safe, durable). Ops teams ship more on Make's visual canvas. **Two**: is the data regulated? If yes, n8n self-hosted or Temporal on your cloud. Make and Zapier are SaaS-only. **Three**: how long does a single workflow run? Sub-minute linear flows fit anywhere; multi-day workflows with retries and sagas want Temporal's durability.

Defaults we ship in 2026: ops-team-owned mid-market with under ~500k ops/month — Make. Engineering-owned with self-host or regulated data — n8n. Enterprise with long-running durable processes (claims, multi-day approval chains, agentic workflows) — Temporal, almost always wrapped around LangGraph for the agentic step. The full break-even math lives in our [orchestrator picking guide](/blog/n8n-vs-make-vs-temporal/). Most real estates end up with two of these tools in production for different workloads, not one.

Why use an LLM in a workflow at all? Aren't deterministic rules cheaper?

Rules are cheaper when the input is structured and stable. The moment you need to read free text, classify intent against a fuzzy taxonomy, extract a field from a layout the source team changes monthly, or handle the long tail of edge cases the rules engine can't cover, rules become a maintenance fire. The LLM judgment lives in those specific steps — the workflow engine still handles the deterministic actions around it. That's the LLM-in-the-loop pattern.

The economics flip on the maintenance side, not the inference side. We've watched clients pay engineers to babysit 40 fragile regex rules for invoice extraction; replacing the regex layer with a single Claude call that returns a Pydantic-validated schema dropped both the bug rate and the engineering hours, even though the per-run cost went up by a few cents. The cost-per-run is the visible number; the cost-of-maintenance is the one that matters. Our [LLM-in-the-loop patterns](/blog/llm-in-the-loop-patterns/) piece covers when to add judgment versus keep rules deterministic, with the failure-mode comparison.

One opinionated take: don't add an LLM to a workflow that doesn't need judgment. We turn down about one in eight automation engagements because the workload is genuinely deterministic and adding AI is engineering theatre.

How do you stop a workflow from taking a wrong action?

Four layers, applied per workflow based on the blast-radius assessment.

1.  **Dry-run mode before live**. Every workflow ships in dry-run for at least one production-traffic week. The workflow runs against real inputs and writes its proposed actions to a queue instead of executing them. We diff dry-run output against an expert-graded sample; live actions enable only when the diff rate is below threshold.
2.  **Confidence-routed human gate on judgment-heavy actions**. The LLM emits a confidence score; below the threshold (calibrated during the eval phase), the workflow drops a draft into Slack or a queue with the rationale attached. Human approves, edits, or rejects. Routing thresholds are tuned to your false-positive budget.
3.  **Mandatory human-in-the-loop on irreversible actions**. Sending money, deleting records, publishing customer-facing content, closing tickets — these don't get auto-actions, period. The workflow drafts; a human commits. Architectural choice, not a configuration.
4.  **Full action log with one-click rollback for reversible actions**. Every workflow run is fully traced in Langfuse with the action payload. For reversible writes (CRM updates, ticket reassignments, Slack DMs), there's a rollback action in the runbook keyed off the trace ID.

Every Production Build engagement comes with a documented blast-radius assessment in the SOW that decides which layers apply where. We won't ship a workflow that doesn't have one.

How do you prevent runaway costs on AI automation services?

Per-run budget caps, model routing, prompt caching, batch API, and dry-run-first economics. Per-run budgets are hard — the workflow refuses to execute the LLM call if the input is bigger than expected, rather than silently emit a $4 response when the budget says $0.05. Model routing sends easy classifications to GPT-5 mini or Claude Haiku 4.5; only the genuinely hard judgment goes to Sonnet 4.6 or GPT-5. We've seen routing alone cut the LLM bill 40–70% on production workflows.

Prompt caching on stable system prompts kills another 60–85% of cached-token cost — workflows tend to re-use the same orchestration prompt across runs, which is the ideal cache shape. Batch APIs apply to non-interactive workloads: AP reconciliation overnight, marketing draft generation in scheduled bursts, doc enrichment pipelines. Batch usually runs at 50% off list price.

The cost lives in the modelling, though. Every Pilot we ship has a cost model in week 2 using the actual expected traffic shape, not a vendor's marketing average. We size the per-workflow cost ledger before we ship a single live action, and the alarm threshold is set at 25% drift over two weeks. When clients tell us their previous vendor "surprised them with a bill," the consistent story is no cost modelling and no production cost ledger. Both are 2-day engineering tasks. There's no reason to ship without them.

Can you replace our existing UiPath, Automation Anywhere, or Blue Prism estate?

Yes — RPA → AI Migration is the dedicated engagement shape for this. Typical scope is 8–14 weeks for a defined slice (~6–12 bots), and the pattern is migrate-bot-by-bot with the old bots staying live until parity is proven. We don't do big-bang cutovers on RPA estates because the failure modes are too expensive to roll back from.

The migration sequence: audit the existing estate (which bots run, what they touch, how often they break), score each by ROI and brittleness, pick the migration order by risk-adjusted ROI, build the new workflow alongside, prove parity on a 2-week dry-run, then cut over. Old bot stays running in parallel for 1–2 weeks post-cutover as the rollback option. We've found about 1 in 5 bots in a typical estate aren't worth migrating — the workload is genuinely stable enough that the existing bot will keep running for years. We say that out loud; we don't migrate for the sake of it.

For clients who only want the strategy work without the build, the [RPA modernisation strategy](#consulting) consulting engagement above outputs the prioritised plan in 3 weeks with no commitment to the build. About half our migration work starts with the strategy engagement first.

Self-hosted or managed? Where does the workflow runtime live?

Depends on three things: data residency, ops capacity, and the licence math at your volume. Self-hosted n8n or Temporal on your AWS / GCP / Azure when you have regulated data (HIPAA PHI, financial MNPI, EU AI Act high-risk workloads) or when steady-state ops volume above ~500k runs/month makes the per-run pricing on a hosted plan painful. Managed (Make, Zapier, n8n Cloud, Temporal Cloud) when the data isn't regulated and the engineering team would rather not run another service.

In our experience, the right answer almost always tracks the regulatory question first, the ops-capacity question second, and the price question third. Teams self-host for control of regulated data and accept the operational tax; teams that don't have regulatory pressure rarely benefit from self-hosting once you cost out the ops time. We've shipped both shapes; the choice follows the data rules and the ops capacity, not preference. The full break-even math by orchestrator lives in the [orchestrator pick guide](/blog/n8n-vs-make-vs-temporal/).

What's an agentic workflow, and when does it beat a regular workflow?

An agentic workflow is one where the path through the graph is decided by an LLM at runtime, not by a static DAG. The workflow asks the LLM "what's the next step?" at one or more decision nodes; the LLM picks from a set of tools or branches. It composes with classical workflow primitives — the orchestrator (usually Temporal) handles durability, retries, and checkpointing; an inner agent (usually LangGraph) handles the dynamic path-finding.

Agentic workflow automation wins when the work has too many branches to enumerate in advance — claim adjudication where the path through the rules depends on document content, customer support where the resolution depends on the customer's history and the issue category, sales-research where the next data source depends on what the previous one returned. Pure DAG orchestration starts to thrash when branch count crosses ~20 in our experience.

It loses on cost and observability — every agent decision is an LLM call, and traces are harder to read than a linear workflow run. Default to deterministic workflow shapes first; reach for agentic only when the branch count justifies it. The agentic patterns we ship are detailed in [our agent development practice](/services/ai-agent-development/); this pillar focuses on the workflow side and treats the agent as an inner component.

What's a realistic budget and timeline for an ai automation services engagement?

Four engagement shapes, fixed scope, fixed duration. Rough order of magnitude on each:

-   **Workflow Pilot** (2–4 weeks): small enough to stop after the pilot if the eval shows the workload isn't a fit. About 1 in 5 pilots end at pilot because the workload turned out to be genuinely deterministic (better served by an off-the-shelf Zap), or because the eval surface wasn't measurable enough to ship safely. Pilots cost less than a single quarter of a senior engineer's salary, and that's the threshold most clients use to decide.
-   **Production Build** (6–12 weeks): the bulk of our ai automation services revenue, including the four-week post-launch iteration. Multi-workflow orchestration with auth, observability, error handling, and rollback paths baked in.
-   **RPA → AI Migration** (8–14 weeks): one defined slice of a legacy estate (~6–12 bots typically). Includes the parallel-run period and the documentation handover.
-   **Workflow Scale-up** (4–8 weeks): existing n8n or Make estate, hardened for self-host, multi-tenant, ops handover. Usually triggered when an internal workflow practice has outgrown the ops capacity.

For ai automation consulting (audit, tool selection, RPA modernisation strategy, eval design), 1–3 week engagements at flat fees. Strategic cross-service work — when AI automation is one part of a broader AI initiative — runs through [our consulting practice](/services/ai-consulting/).

017 / Related practices

## Adjacent services.

[

RPA DEVELOPMENT

RPA Development

Intelligent automation — beyond rule-based RPA.

](/services/rpa-development/)[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

AI INTEGRATION

AI Integration

Drop-in AI for existing apps — OpenAI / Anthropic / Vertex.

](/services/ai-integration/)

018 / Start a project

## Find an ai automation agency that ships *workflows that actually adapt*.

Pilot in 2–4 weeks. Build in 6–12. RPA migration in 8–14.

[Talk to engineering](/contact/) [Workflow audit](/contact/?topic=workflow-audit)


---

## SECTION: 4.8. Service: chatbot-development

_Source: https://www.paiteq.com/services/chatbot-development/_

# Chatbot Development Services — Paiteq

> Chatbot development services from an engineering-led AI chatbot development services agency. Custom chatbot development, enterprise chatbot development, LLM chatbot development, RAG chatbot, and voice chatbot development. Fixed scope, eval-instrumented, multi-channel.

**HTML version:** https://www.paiteq.com/services/chatbot-development/

## Key facts

- Types: custom, enterprise, LLM, RAG, voice.
- Channels: web, WhatsApp, Slack, Teams, voice.
- Eval-instrumented; fixed scope.

## Related pages

- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

P8 · Services

# *Chatbot development services* — grounded chat, tool-using chat, voice chat, hybrid AI + human handoff

Chatbot development services from an engineering-led custom chatbot development team. Conversational AI development for customer service, sales, internal Q&A, voice deflection, and regulated-grounded chat — eval-instrumented, multi-channel, with a warm-handoff layer to the human agent queue. We default to Claude Sonnet 4.6 with a multi-vendor routing layer above; we don't ship single-vendor chat stacks.

[Talk to a partner](/contact/) [See engagement shapes](#engage)

Practice Conversational AI development

Shapes Grounded · Tool-using · Voice · Hybrid

Default Claude Sonnet 4.6 + GPT-5 mini fallback

Engagements 3–12 weeks · fixed scope

001 / FRAME

## Chatbot or agent? The most expensive misframing in this category.

Most buyers arrive at chatbot development services with an agent-shaped problem, or vice versa. Pricing the wrong shape costs roughly $80k–$240k of engineering before anyone notices. The grid below is the first frame we run — single-turn-ish conversational chat on the left, multi-step autonomous agent on the right, scored on the dimensions that actually drive the architectural call. Read it once; the rest of this page assumes you've landed on the chat side. If you're on the agent side, route to our [AI agent development]("/services/ai-agent-development/") practice.

Chatbot (this pillar)

Agent (sibling pillar)

Decision shape

Single-turn-ish — answer or escalate within one conversation

Multi-step autonomous — plan, act, observe, iterate over N turns

Default response budget

Sub-2-second perceived latency; voice ≈ sub-400ms turn-take

5–60 seconds is normal; long-running tasks measured in minutes

Speed is a feature in customer-facing chat — users abandon threads after 4–5 seconds of silence. The sub-2-second budget isn't arbitrary; it's the threshold where **chat still feels like chat** rather than a loading screen. Agents trade latency for capability by design, which is the right call for back-office work but the wrong call on a live support surface.

Tool calls per session

Zero to three — retrieve, lookup-by-id, maybe a write

Often ten-plus; explicit state machine over the tool surface

Tool depth is the clearest capability signal. Three tool calls is the practical ceiling for a chatbot before the state tracking collapses and latency blows the budget — if your build needs more than that, you're not designing a chatbot, you're designing an agent with a chat surface. The explicit state machine that agents require isn't overhead; it's the mechanism that keeps tool-call ten coherent with tool-call one.

Grounding source

RAG over enterprise corpus or product KB; tight context window

Open-ended retrieval, web search, multi-corpus, internal tools

Tight scope is a chatbot advantage, not a limitation. Grounding against a controlled corpus means every retrieval chunk can be audited, every answer traced back to a source document. Agents need open-ended retrieval because their tasks demand it — but that breadth trades away the faithfulness guarantees that [RAG-grounded chatbots](/services/rag-development/) can provide and regulated industries require.

Failure mode

Hallucinated answer, missed escalation, wrong tone

Tool-call loops, plan drift, partial state, idempotency bugs

Right model class

Claude Sonnet 4.6 or GPT-5 mini — fast, cheap, well-grounded

Claude Opus 4.7 / GPT-5 reasoning — for the plan layer

Right framework

Direct SDK + a thin orchestrator (or LangGraph for state)

LangGraph with proper state-graph control; never CrewAI for serious builds

Eval surface

Faithfulness, escalation precision, tone, latency p95

Trajectory eval, tool-call success, plan coherence, end-state correctness

Eval surface is where chatbot builds most often cut corners. Faithfulness and escalation precision sound simple but measuring them at scale requires a dedicated harness — RAGAS for retrieval quality, Langfuse traces for latency p95, and a held-out escalation set that fires on every deploy. Agents have it harder (trajectory evals are genuinely more complex), but chatbot teams who skip the eval build are the ones regretting it in month three.

Where we recommend it

Customer service, sales-qualifying, internal Q&A, voice deflection

Multi-system workflows, research agents, ops automation — see siblings

Most production builds aren't pure one-or-the-other — a chatbot can call a small agent loop for a sub-task; an agent can expose a chat surface to its operator. The point of the frame is to lock the primary shape so the architecture, the model class, the framework, and the eval surface all line up.

If your buyer is asking for "an AI agent" but the actual user-facing surface is a single-turn customer support widget, you're in the chat pillar — the agent vocabulary is procurement-side language, not architecture. We'll make that call on the framing call and route accordingly.

002 / SHAPES

## Five chatbot shapes. Every chatbot development services engagement maps to one.

Conversational AI development isn't a single shape — it's five distinct engineering envelopes with different default models, different frameworks, different eval surfaces, and different cost curves. The five shapes below cover roughly 100% of inbound. We won't sell you a voice chatbot development engagement when the deflection use case lives on web; we won't sell you a tool-using build when the actual ask is RAG-grounded.

-   01 · GROUNDED
    
    ### RAG chatbot over a curated corpus
    
    Customer service deflection, internal Q&A, regulated policy chat, product knowledge base. The corpus is curated, the answer is grounded, the eval surface is faithfulness plus retrieval precision. Roughly 60% of engagements. Default stack: Claude Sonnet 4.6, pgvector + BM25 hybrid retrieval, RAGAS eval, Langfuse traces.
    
-   02 · TOOL-USING
    
    ### Chat that calls structured tools
    
    Order-status, return-initiation, appointment-booking, CRM-aware chat. Three to eight tools is the typical envelope; tool-call success rate matters more than reasoning depth. Different from agents — the conversation stays single-thread. Default stack: GPT-5 mini or Sonnet 4.6, Zod / Pydantic schemas, function-calling SDK, eval on tool arg correctness.
    
-   03 · VOICE
    
    ### Sub-400ms voice chatbot development
    
    Phone-channel deflection, clinical intake, sales prospecting outbound, IVR replacement. Latency budget is the headline constraint. LiveKit Agents or Pipecat as the runtime, Deepgram for streaming STT, ElevenLabs for TTS, Sonnet 4.6 as the chat brain. Eval includes turn-take p50/p95 and barge-in recovery — neither shows up in a text eval suite.
    
-   04 · HYBRID + HANDOFF
    
    ### AI tier-one with warm transfer to humans
    
    Enterprise support where AI handles tier-one and escalates to a human agent with full conversation context. Roughly 40% of enterprise chatbot development engagements. The handoff layer is the highest-value engineering decision — warm transfer with context wins, cold handoff that re-asks the customer's name loses. Default: LangGraph state-graph, Zendesk Sunshine handoff, co-pilot suggestions in the agent's queue.
    
-   05 · MODERNISE
    
    ### Legacy chatbot replacement
    
    Drift, Ada, Intercom Fin, first-gen LLM chatbot, or rule-based legacy. Migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline. Target stack usually keeps the channel surface (Zendesk, Intercom, Twilio) and replaces the brain. Ships behind progressive rollout — cutover is reversible until the eval baseline is matched-or-exceeded.
    

A chatbot can call a small agent loop for one sub-task and still be a chatbot. The primary shape is the one that owns the user-facing surface — if the user sees a single chat thread that resolves or escalates, you're in chat; if the user kicks off a job and comes back later for the result, you're in [agents](/services/ai-agent-development/). Most inbound for a support chatbot or customer service chatbot lands in shape 01 (grounded) or shape 04 (hybrid handoff); a support chatbot with a real human queue almost always wants 04.

003 / NUMBERS

## What production chatbot development services look like at the latency level.

Performance targets we hold ourselves to before any chatbot development services engagement ships. The numbers below are typical-workload defaults — actual targets land in writing during the discovery phase and become the ship gate. We don't ship below the latency or faithfulness threshold; we'd rather miss the launch date than ship a chat that hallucinates under production traffic.

p95 < 1.8s

Grounded chat response

RAG + Claude Sonnet 4.6, typical workload

p50 ≈ 320ms

Voice turn-take

LiveKit + Pipecat, deepgram streaming

$0.004

Average grounded reply

Sonnet 4.6 + pgvector retrieval, mid-volume

0 –94%

Tier-1 deflection target

Per use case; eval-validated, not vendor-claimed

004 / STACK

## The six families of the modern chatbot stack.

A modern enterprise chatbot development build touches six families — chat model, conversation framework, retrieval, voice (if voice is in scope), observability + eval, and channel + handoff. The categories below name the default pick, the cost-floor alternative, and the conditions under which we'd revisit. Per-family recommendations land in writing during discovery; the chat-stack inventory is the artefact that survives the engagement.

Chat model layer (Claude Sonnet 4.6 · GPT-5 mini · Gemini 3 Flash)

Strengths

The mid-tier hosted models are where production chat lives in 2026. Claude Sonnet 4.6 leads on grounded answers and tone control. GPT-5 mini wins on raw throughput at scale. Gemini 3 Flash is competitive at the cheap end. Pricing is in the $0.3–3 input / $1.5–15 output per million tokens band — about an order of magnitude cheaper than the frontier tier and indistinguishable from it on most chat workloads we audit.

When We Pick

Default pick for almost every chatbot development services engagement. Customer service deflection, internal Q&A, sales qualifying, support knowledge bases. Below ~500M monthly tokens the hosted-mid economics beat self-hosted by 3–8×. C-suite buyers who want a named vendor on the contract.

When We Don't

Strict data residency where the provider's region map can't close — route to <a href="/services/llm-development/">self-hosted Llama 4</a> on your infra. High-frequency low-value chat (FAQ-style triage) where a smaller distilled model wins on cost. Reasoning-heavy multi-step shapes — that's not chatbot, that's an <a href="/services/ai-agent-development/">agent</a>.

Paiteq Pattern

Default architecture: Sonnet 4.6 for grounded chat, GPT-5 mini as the multi-vendor backstop with a thin routing layer above. Costs roughly a week of engineering and saves the renegotiation when one provider hikes prices in month nine. We've never recommended a single-vendor chat stack for any production engagement.

Mid-hostedMulti-vendorGrounded

Conversation framework (LangGraph · OpenAI Agents SDK · raw SDK)

Strengths

Chat doesn't always need a framework. A thin orchestrator over the SDK is the cleanest pattern for single-turn-ish chat — a system prompt, a retrieval call, a response, a handoff branch. LangGraph earns its keep when conversation state matters across turns: long context, multiple tool calls, structured handoff to a human agent. The OpenAI Agents SDK is fine for narrow OpenAI-only builds; we don't recommend it for multi-vendor chat.

When We Pick

LangGraph when state across turns matters — escalation flows, multi-step diagnostic chat, conversational forms with branching logic. Raw SDK + thin orchestrator when chat is genuinely single-turn-ish. Always wire eval and observability before the second sprint, regardless of framework.

When We Don't

CrewAI for serious chat — the role abstraction starts to fight you once you need state-graph control, and we've yet to ship a CrewAI-based chat into production without a re-platforming sprint. AutoGen — stalled relative to LangGraph; not recommended for new builds.

Paiteq Pattern

Roughly two-thirds of our chatbot development services engagements ship on a thin SDK orchestrator. The other third use LangGraph, usually because the chat has a handoff layer (warm transfer to human, escalation router, multi-channel context sync) where the state graph carries real weight.

State-machineThin-SDKLangGraph

Retrieval layer (pgvector · Pinecone · Qdrant · BM25 hybrid)

Strengths

RAG chatbot work lives or dies on the retrieval layer. pgvector is the cheapest, lowest-friction pick when Postgres is already in the stack — and Postgres is already in roughly nine out of ten enterprise stacks. Pinecone Serverless cuts ops bandwidth to near zero at a premium tier. Qdrant self-hosts cleanly when data residency is non-negotiable. BM25 hybrid (sparse + dense) wins on technical-product corpora where the exact-keyword match still beats embeddings half the time.

When We Pick

Any grounded chatbot — clinical, legal, regulated, internal knowledge base, product Q&A. Almost every chatbot engagement we ship recommends retrieval-augmented chat as the spine, not a generic conversational LLM blowing answers from training data. We route the deeper retrieval-pipeline work to our <a href="/services/rag-development/">RAG development services</a> practice when the scope earns it.

When We Don't

Pure tone-and-brand chat with no enterprise knowledge to ground against (rare, usually a sign the product team hasn't found the use case yet). Tiny corpora under 5k chunks where in-context retrieval beats a vector store and the index is overkill.

Paiteq Pattern

Default recommendation: pgvector when Postgres is already there; Pinecone Serverless when ops capacity is the constraint; Qdrant self-hosted when data residency requires it. We don't recommend Weaviate for a chat-only build unless multi-modal retrieval is the headline requirement.

RAGHybridpgvector-first

Voice stack (LiveKit Agents · Pipecat · Deepgram · ElevenLabs)

Strengths

Voice chatbot development is its own engineering discipline. LiveKit Agents and Pipecat both land sub-400ms voice turn-take in production — and the latency budget is the whole game. Deepgram leads on streaming STT accuracy at scale. ElevenLabs leads on voice quality; the open-source side (Whisper Large v3, F5-TTS) is closing fast but still trails on edge cases. Sub-second perceived turn-take is the difference between a voice agent that feels conversational and one that feels like an IVR with extra steps.

When We Pick

Support deflection at scale where call-centre cost is the headline number. Clinical intake, sales prospecting, after-hours triage. Anywhere the buyer journey involves a human picking up a phone. Roughly 25% of our voice chatbot development engagements in 2026 have replaced an existing IVR rather than building net-new.

When We Don't

Voice as a CEO whim with no buyer-journey evidence — the build is hard, the deflection economics are real but specific, and a voice agent without a deflection use case is theatre. Multi-step task automation — that's an <a href="/services/ai-agent-development/">agent</a> with a voice front, not a voice chatbot.

Paiteq Pattern

Default voice stack: LiveKit Agents + Claude Sonnet 4.6 + Deepgram + ElevenLabs. Open-source substitutes priced as a phase-two option when the volume crossover lands inside 12 months. We've shipped this exact stack three times in 2026 — never had to re-platform the voice layer.

VoiceSub-400msDeflection

Observability + eval (Langfuse · Braintrust · RAGAS · Inspect)

Strengths

Every chatbot we ship costs more in instrumentation than the buyer expected and earns it back inside the first month. Langfuse leads OSS observability with traces, prompt versioning, and a usable eval surface. Braintrust dominates closed-source eval workflows with a clean diffing UX. RAGAS is the default retrieval-and-faithfulness harness for grounded chat. Inspect AI (UK AISI-backed) is the rigour pick for safety-critical chat in regulated industries.

When We Pick

Every chat engagement. Day-one cost line, not a phase-three nice-to-have. Most stalled chat pilots we audit failed because nobody knew which prompts were drifting, which retrievals were silently returning irrelevant chunks, or which escalations were silently being miss-routed. Instrumentation is the cheapest insurance in the stack.

When We Don't

Never. We've never shipped a production chatbot without observability wired before the second sprint. Toy demos and POCs are the only exception — and the moment the demo earns a budget, observability lands in the next ticket.

Paiteq Pattern

Default: Langfuse self-hosted for teams with data-control concerns; Braintrust for teams with budget and no ops capacity. RAGAS as the retrieval eval harness regardless of trace backend. We don't recommend bare logging-without-traces for anything past prototype — it's the false-economy that creates the month-six stall.

Day-oneTracesEval-first

Channel + handoff (Twilio · Intercom · Zendesk · Slack · web)

Strengths

The channel layer is where most chatbot pilots quietly fail. Twilio carries SMS, WhatsApp, and voice transport. Intercom and Zendesk are the dominant support-channel anchors — Zendesk Sunshine is the canonical handoff API for enterprise support chat. Slack is the default for internal-Q&A chat. The web widget is the simplest channel and the easiest to instrument; mobile native chat carries half the engineering cost and twice the polish.

When We Pick

Multi-channel chat is the default ask in 2026 — buyer expects the same chatbot across web, Intercom, Slack, and a voice line. The handoff layer to a human agent is the single highest-value engineering decision: warm transfer with conversation context wins; cold handoff with a ticket creation loses every time.

When We Don't

Single-channel chat where the channel is fixed and there's no plausible expansion path — usually the buyer hasn't priced the multi-channel ask yet. We'd still build the abstraction; cost is roughly a week and the option value is enormous.

Paiteq Pattern

Default: channel abstraction layer above whichever vendor (Intercom / Zendesk / Twilio) carries the surface. Warm-transfer-with-context as the handoff default. We've migrated chat across three channel vendors for two clients this year — the abstraction earns its keep on the migration.

Multi-channelWarm-handoffTwilio

005 / ARCHITECTURES

## Four conversation architectures. Pick the one that fits the buyer journey.

RAG-grounded, tool-using, voice, and hybrid-with-handoff are the four architectures we ship across roughly 100% of chatbot development services engagements. The architecture determines the model class, the framework, the eval surface, and the team shape. We won't sell you a tool-using build when the buyer journey is information retrieval; we won't sell you grounded chat when the buyer journey is task completion. The framing call is free.

   
01

### RAG-GROUNDED

The most common chatbot shape we ship. A retrieval pipeline (pgvector + BM25 hybrid is the typical default) feeds the chat model context from a curated corpus — product docs, knowledge base, ticket history, regulated policy. Roughly 60% of chatbot development services engagements land here. The headline failure mode is silent retrieval drift: chunks rank well on cosine similarity but answer the wrong question. Eval is faithfulness + retrieval precision tracked over time, not a quarterly cherry-pick.

Pick when

-   Customer service deflection on a known product surface
-   internal Q&A over a curated KB
-   regulated chat where every answer must be grounded in policy
-   sales-assist over a product catalogue
-   B2B onboarding where the corpus is stable

Skip when

-   Open-ended chat with no curated corpus — that's a generic LLM, not a chatbot we'd build
-   Multi-step workflow execution — that's an agent
-   Pure entertainment chat — different design vocabulary entirely

Stack

Claude Sonnet 4.6pgvector + BM25RAGAS evalLangfuse traces

02

### TOOL-USING

Chat that needs to look up an order status, file a return, schedule a callback, or update a CRM record. Different shape from agents — the tool surface is narrow (typically three to eight tools), the conversation stays single-thread, and the tool-call success rate matters more than reasoning depth. Roughly 20% of engagements. Eval is tool-call success rate, argument correctness, and refusal precision when the user asks for something the tool surface doesn't cover.

Pick when

-   Order-status + return chat for ecommerce
-   appointment-booking chat with calendar write
-   CRM-aware sales chat with lead-write
-   support chat with ticket-creation
-   banking-info chat with read-only account lookups

Skip when

-   Multi-step task automation — that's an agent loop, not a tool-using chatbot
-   Pure information retrieval — RAG-grounded is cheaper and faster
-   Workflows that need branching execution state — escalate to a state-graph architecture

Stack

GPT-5 mini · Sonnet 4.6Function-calling SDKZod / Pydantic schemasEval on tool args

03

### VOICE

Voice chatbot development is its own engineering discipline. The latency budget is the headline constraint — anything past 500ms perceived turn-take feels like an IVR. LiveKit Agents and Pipecat dominate the production stack; streaming STT (Deepgram) plus parallel TTS (ElevenLabs) plus a small-context chat model is the canonical shape. Roughly 15% of engagements but the highest-value tier per minute of build. Eval is turn-take p50/p95, interruption handling, and barge-in recovery — none of which are visible in a text-chat eval suite.

Pick when

-   Support deflection where the existing channel is voice
-   clinical intake at scale
-   sales prospecting outbound
-   after-hours triage routing
-   IVR replacement where the legacy DTMF tree has lost the war on customer patience

Skip when

-   Voice as a CEO whim with no deflection use case
-   Visual-context-required interactions (returns with photo evidence, document review)
-   Anything where sub-second turn-take isn't actually the constraint — text chat is cheaper

Stack

LiveKit AgentsPipecatDeepgram streaming STTElevenLabs TTS

04

### HYBRID + HANDOFF

Production support chat is rarely AI-only. The hybrid shape — AI handles tier-one, escalates to a human agent with full conversation context, often with AI co-pilot suggestions in the agent's queue — is roughly 40% of enterprise chatbot development engagements. The headline engineering choice is the handoff layer: warm transfer with conversation context wins; cold handoff that re-asks the customer's name loses every customer who's been re-asked their name. We've migrated five clients off cold-handoff vendors in 2026 — every one of them lifted CSAT inside the first month.

Pick when

-   Enterprise support with a real human agent team
-   regulated chat where AI deflects but a licensed human signs the resolution
-   complex products with long-tail questions AI can't reliably answer
-   brands where escalation latency is itself a customer-experience metric

Skip when

-   Pure self-service products with no human support team
-   Volume-only deflection where the human agent doesn't exist
-   Workflows where AI cannot escalate quickly enough — that's an architecture flag, not an engagement

Stack

LangGraph state-graphZendesk Sunshine handoffCo-pilot suggest layerEval on handoff precision

006 / PHASES

## What a six-week custom chatbot development engagement actually ships.

A grounded chatbot pilot is roughly five distinct engineering phases against a fixed timeline. The phases below are the standard shape for a single-channel grounded chat; multi-channel adds a parallel integration sprint, voice chatbot development adds a barge-in and turn-take tuning phase, legacy modernisation prepends a contract-and-baseline read. The phases ship in series; eval lands in code before the chat model touches the corpus.

1.  01
    
    ### Discovery + use-case lock
    
    Sixty-minute exec session locks the use case, the channel, the deflection target, and the handoff posture. Existing channel data — call recordings, ticket archive, chat transcripts — read for tone, escalation patterns, and the long-tail of questions the human agent team currently absorbs. Output is a one-page chatbot spec everyone signs off in writing before any code lands. Some engagements end at this phase because the right answer is "do a discovery sprint first, build later" — we still ship the spec and bill the phase.
    
2.  02
    
    ### Corpus + retrieval scaffold + gold-set eval
    
    Ingest the corpus, chunk, embed, build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns the engineering time. First eval run on a 50-question gold set built with the buyer team — questions the human agent team actually fields, with the correct answer in the corpus and the wrong answer adjacent. Faithfulness and retrieval precision land in writing before the chat model touches the corpus. If the gold set fails at this stage, the corpus needs work before the chat does.
    
3.  03
    
    ### Chat surface + tool wiring
    
    System prompt iteration, tool-call wiring, response shaping. Chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface stays small and explicit — three to eight tools is the envelope; ten-plus tools is an agent in disguise. For grounded chat, the system prompt names the corpus, the failure modes, and the escalation triggers in writing.
    
4.  04
    
    ### Eval harness + handoff layer
    
    RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, a custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context — Zendesk Sunshine or the channel-native equivalent — never cold handoff. Eval thresholds locked in writing as the ship gate. The harness is the artefact that survives the engagement; the buyer's team owns it on day one of phase five.
    
5.  05
    
    ### Channel integration + UAT + ship
    
    Web widget, Intercom, Zendesk, Slack, or WhatsApp — whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass — voice can't reuse text eval, ever. Progressive rollout behind a feature flag; first-month tuning loop wired with the on-call team. Engagement closes with a written handover memo plus a thirty-day check-in.
    

Clean handoff is the default. The chatbot, the eval harness, the observability dashboards, the channel integration code — all owned by the buyer's engineering team on day one of phase five. About a third of pilot engagements convert to a build engagement under shapes 02–04; we don't push that conversion, the memo names the call either way.

007 / CHANNEL

## Channel picker. Where the chatbot lives, by use case.

The channel decision drives the engagement scope more than buyers realise. A single-channel pilot is roughly three to four weeks; a multi-channel build is six to eight; a voice build is eight to twelve. The grid below is the call we'd make on the rubric — channel across the top, use case down the side. Yes-cells are the default; maybe-cells need a follow-up call; no-cells are usually a misframing we'd route differently.

Use case

Web widget

Mobile native

Voice (phone)

Messaging (Slack / WhatsApp)

Customer service deflection

Default

Strong

High-value

Common

Sales qualifying / lead-capture

Default

Useful

Outbound only

WhatsApp wins

Internal employee Q&A

Possible

Rare

Skip

Slack default

Clinical / regulated intake

Yes — gated

Yes — app-context

High-value

Avoid SMS

Onboarding / activation

Default

In-product

Skip

Useful nudge

Booking / scheduling

Default

In-app

Strong fit

WhatsApp common

Knowledge-base / search-replace

Default

Useful

Voice search rare

Slack for internal

Yes = default channel for this use case. Maybe = depends on buyer-journey specifics. No = usually a misframing — we'd route to a different shape.

Multi-channel chat is the default ask in 2026. The channel abstraction layer is roughly a week of engineering and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year — every one validated the abstraction call we made at week two.

008 / GATES

## Six eval gates an honest chatbot ship clears.

A chatbot eval memo is only as honest as the gates the team runs before ship. Below is the screen we apply to every chatbot development services engagement — and the same screen we use when we're hired to second-opinion a chatbot a different vendor already shipped. Second-opinion work routinely flags at least one gate the original build silently skipped, usually faithfulness or handoff precision.

-   01
    
    ### Faithfulness against retrieval
    
    Does the answer match what retrieval returned, or is the model blowing answers from its training corpus? RAGAS plus a custom harness against a clinician- or domain-expert-built gold set. Tracked over time, regression flagged. Faithfulness is the headline failure mode for grounded chatbot work — and the easiest one for a vendor to hide behind a quarterly cherry-pick.
    
-   02
    
    ### Retrieval precision per query
    
    Did the right chunks rank above the wrong ones? Per-query precision tracked over time, with the long-tail of low-precision queries flagged for corpus work. Retrieval drift is the silent failure that turns a six-month-old chatbot into a ticket generator. We don't ship without a regression gate on this metric.
    
-   03
    
    ### Escalation precision
    
    When the chat should hand off to a human, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set built with the buyer's support leadership. Escalation false-negatives hurt CSAT; false-positives drown the human agent queue. Both directions matter and both get a threshold.
    
-   04
    
    ### Tone + brand alignment
    
    Does the chat sound like the brand? Human-rated against a rubric the buyer's brand and CX teams sign off in writing. Scored over time. The rubric is shared, not run on a private spreadsheet. Tone is the failure mode buyers feel before they notice the faithfulness gap — a chat that sounds wrong loses customer trust before the eval team catches the drift.
    
-   05
    
    ### Latency p50 + p95
    
    Sub-2-second p95 for grounded chat. Sub-400ms turn-take p50 for voice. Measured in production, alerted on regression. Latency regression usually traces to retrieval-layer drift or a model swap — and both are the kind of thing observability surfaces in week one of the regression, not month three.
    
-   06
    
    ### Failure mode named in writing
    
    What's the single most likely way this chatbot fails at month six? If the memo can't answer that question, it isn't a ship-ready eval — it's a vendor demo. We name the failure mode, the leading indicator, and the threshold at which the trigger fires. Roughly one in five engagements names a failure mode that catches the regression before the customer-facing incident; that's the metric we measure ourselves on.
    

Six-out-of-six clean is the ship gate; we don't launch below threshold. Two or fewer clean is the trigger for a methodology intervention — the build needs more eval work before any prompt tuning. Eval rigour is the cheapest insurance in the chatbot stack and the most-skipped line item in the vendor proposal.

009 / WHERE

## Six chatbot shapes across six industries — where we've shipped.

Capability-by-industry heatgrid for chatbot development services we've actually built, not what the brochure promises. Strength reflects engagement depth — dark cells are repeat patterns; light cells are honest about depth we haven't built yet.

Function Industry

B2B SaaS

Ecommerce

Healthcare

Fintech

Logistics

EdTech

Customer service chatbot

Sales / lead-qualifying

Internal Q&A bot

Voice chatbot

Regulated grounded chat

Hybrid AI + human handoff

Customer service chatbot

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Sales / lead-qualifying

B2B SaaSEcommerceFintechEdTech HealthcareLogistics

Internal Q&A bot

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Voice chatbot

B2B SaaSEcommerceHealthcareFintechLogistics EdTech

Regulated grounded chat

B2B SaaSHealthcareFintechLogisticsEdTech Ecommerce

Hybrid AI + human handoff

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Possible fit Good fit Primary vertical

Dark cells: repeat engagement pattern. Medium: shipped at least once. Light: scoped but not yet completed. Empty: not yet relevant to the industry.

010 / PROCESS

## Six steps. Six weeks. One shipped chatbot.

Eval-first, gold-set-anchored, channel-aware custom chatbot development methodology — refined across grounded chat, tool-using chat, voice chat, and legacy modernisation engagements. The sequence below is the standard six-week build for a single-channel grounded chat. Voice adds a barge-in tuning phase; multi-channel adds a parallel integration sprint; legacy modernisation prepends a contract-and-baseline read. None of these run on a time-and-materials clock — fixed scope, fixed fee, fixed timeline.

WEEK 1

### Discovery + use-case lock

60-minute exec session to lock the use case, the channel, the deflection target, and the handoff posture. Read of the existing channel data — call recordings, ticket archive, chat transcripts. Output is a one-page chatbot spec everyone signs off in writing before any code lands.

WEEK 1–2

### Corpus + retrieval scaffold

For grounded chat: ingest the corpus, chunk, embed, and build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns it. First eval run on a 50-question gold set built with the buyer team. Faithfulness and retrieval precision land in writing before the chat model touches the corpus.

WEEK 2–3

### Chat surface + tool wiring

System prompt iteration, tool-call wiring, response shaping. The chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface is small and explicit — three to eight tools is the typical envelope.

WEEK 3–4

### Eval harness + handoff layer

RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context — Zendesk Sunshine or the channel-native equivalent. Eval thresholds locked in writing as the ship gate.

WEEK 4–5

### Channel integration + UAT

Web widget, Intercom, Zendesk, Slack, or WhatsApp — whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass; voice can't reuse text eval.

WEEK 5–6

### Ship + observability tuning

Live launch behind progressive rollout. Langfuse dashboards tuned with the on-call team. First-month tuning loop wired in — drift detection on retrieval, faithfulness regression flags, escalation precision regression flags. Engagement closes with a written handover memo plus a 30-day check-in.

011 / WHY PAITEQ

## Why teams pick us for enterprise chatbot development.

-   01
    
    ### Eval before prompt-tuning
    
    We ship the eval harness in code before the chat model touches the corpus. Most stalled chatbot builds we audit failed because the team prompt-tuned for three months without a regression gate. Faithfulness, retrieval precision, escalation, tone, latency — all measured in code, all regression-flagged, all owned by the buyer's team on day one.
    
-   02
    
    ### Multi-vendor chat by default
    
    We don't ship single-vendor chat. The routing layer above the model SDKs costs roughly a week of engineering and saves the contract renegotiation that always arrives in month nine. Sonnet 4.6 plus a GPT-5 mini fallback is the default; the buyer's team can swap providers in two weeks of vendor support effort, not six months of re-platforming.
    
-   03
    
    ### Warm-handoff, never cold
    
    Every chatbot we ship into a support context wires a warm-handoff layer with full conversation context. Cold handoff that re-asks the customer their name loses the customer who's been re-asked their name. The handoff layer is the highest-value engineering decision in any hybrid chat — and the most-skipped line item in legacy vendor proposals.
    
-   04
    
    ### Voice as its own discipline
    
    Voice chatbot development isn't text chat with a TTS layer slapped on. Sub-400ms turn-take, barge-in handling, interruption recovery, call-recording-as-eval-set — all engineering surface that doesn't exist in a text-chat build. We've shipped voice on LiveKit + Pipecat across healthcare, fintech, and logistics; never had to re-platform the voice layer.
    
-   05
    
    ### Channel abstraction from week two
    
    Multi-channel chat is the default ask in 2026. The channel abstraction is roughly a week of engineering at week two and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year — the abstraction call holds up every time.
    
-   06
    
    ### Fixed scope, written deliverable
    
    Three to twelve weeks per engagement; no time-and-materials clock; no vendor lock-in. The chatbot, the eval harness, the observability dashboards, the channel integration code — all owned by the buyer's team at handoff. About a third of pilot engagements convert to a follow-up shape; we don't push the conversion, the memo names the call either way.
    

012 / SHAPES

## Four ways to start a chatbot development services engagement.

The four shapes as picker cards. Fixed-scope, fixed-fee, written deliverable. Pick the closest match — the framing call refines if needed.

[

01 / PILOT ↗

Chatbot pilot — single channel

Three to four weeks, fixed scope. One channel, one use case, one corpus. Grounded chat with eval harness wired, observability live, handoff layer scoped. Deliverable is a working chatbot behind a feature flag plus a written handover memo. The default entry shape for ai chatbot development services — usually a customer service chatbot or a support chatbot pilot.

3–4 wksFixed

](#engage)[

02 / BUILD ↗

Custom chatbot development

Six to eight weeks, fixed scope. Multi-channel, tool-using or RAG-grounded, with eval and observability wired. Handoff layer included. Most enterprise chatbot development engagements land here. Ships behind progressive rollout with first-month tuning loop.

6–8 wksFixed

](#engage)[

03 / VOICE ↗

Voice chatbot development

Eight to twelve weeks, fixed scope. Sub-400ms turn-take voice chat with LiveKit + Pipecat + Deepgram + ElevenLabs. Includes barge-in handling, interruption recovery, and an IVR-replacement path where the legacy is in scope. Eval rigour at production-voice depth.

8–12 wksFixed

](#engage)[

04 / MODERNISE ↗

Legacy chatbot modernisation

Six to ten weeks, fixed scope. Replace a stalled rule-based or first-gen LLM chatbot with a grounded, eval-instrumented modern stack. Migration plan against the existing channel and handoff vendor contracts. We've shipped this against four legacy chatbot vendors in 2026.

6–10 wksFixed

](#engage)

013 / USE CASES

## Where the chat has landed.

Three typical-shape engagement patterns. Function, segment, and deliverable are real shapes; specific client metrics land in case studies once shipped engagements clear NDA.

Healthcare

Multi-state payer · HIPAA-grounded chat

### Grounded member-services chatbot against a regulated policy corpus

Typical shape: a US healthcare payer needs a member-services chatbot that grounds every answer in the actual benefit policy document, never the model's training corpus. We build the chat surface on Sonnet 4.6 + pgvector hybrid retrieval against the policy library, wire RAGAS eval against a clinician-built gold set, and ship the handoff layer warm-transfer-with-context into the existing Zendesk queue. Faithfulness is the headline gate; we don't ship below the threshold.

Deliverable: grounded chat + RAGAS eval harness + warm-handoff into Zendesk Sunshine

Ecommerce

DTC retail · multi-channel post-purchase chat

### Returns + order-status chatbot across web, Intercom, and WhatsApp

Typical shape: a DTC retailer wants to deflect tier-one post-purchase volume across three channels without re-asking the customer their order number twice. We build a tool-using chatbot on GPT-5 mini with order-lookup, return-initiation, and tracking tools; ship the channel abstraction above Intercom plus a WhatsApp Business adapter; wire the warm-handoff layer to the human agent queue with full conversation context. Eval on tool-call accuracy and escalation precision against a stratified live-traffic shadow.

Deliverable: multi-channel tool-using chat + channel abstraction + handoff layer

Fintech

Pre-Series-B lending · regulated voice chat

### Voice chatbot for KYC pre-fill and after-hours intake

Typical shape: a regulated lending platform wants to compress KYC intake to a five-minute voice conversation with a sub-second turn-take target. We ship LiveKit Agents + Deepgram + ElevenLabs + Claude Sonnet 4.6 as the chat brain. Barge-in handling, interruption recovery, and call-recording-as-eval-set wired before launch. Handoff layer warm-transfers to a licensed loan officer when the conversation reaches a regulated decision point.

Deliverable: voice chat stack + sub-400ms turn-take + regulated-handoff layer

014 / STACK

## The stack we ship against.

Chat models, conversation frameworks, retrieval, voice, observability, and channel — the surface a 2026 chatbot build actually touches.

-   Claude Sonnet 4.6
-   GPT-5 mini
-   Gemini 3 Flash
-   Llama 4
-   LangGraph
-   OpenAI Agents SDK
-   pgvector
-   Pinecone
-   Qdrant
-   Langfuse
-   Braintrust
-   RAGAS
-   LiveKit
-   Pipecat
-   Deepgram
-   ElevenLabs
-   Twilio
-   Intercom
-   Zendesk Sunshine
-   Slack
-   Claude Sonnet 4.6
-   GPT-5 mini
-   Gemini 3 Flash
-   Llama 4
-   LangGraph
-   OpenAI Agents SDK
-   pgvector
-   Pinecone
-   Qdrant
-   Langfuse
-   Braintrust
-   RAGAS
-   LiveKit
-   Pipecat
-   Deepgram
-   ElevenLabs
-   Twilio
-   Intercom
-   Zendesk Sunshine
-   Slack

015 / FAQ

## What buyers ask before signing.

What's the difference between chatbot development services and ai agent development?

Different shape, different engineering discipline. Chatbot development services here cover single-turn-ish conversational systems — the user asks, the chat answers (often grounded in a retrieval pipeline), the conversation either resolves or hands off to a human. [AI agent development](/services/ai-agent-development/) covers multi-step autonomous task execution — plan, act, observe, iterate, often over minutes or hours of runtime with ten-plus tool calls per session. The model class is different (Sonnet 4.6 for chat, Opus 4.7 for the plan layer of agents); the framework is different (thin SDK or LangGraph for chat, full state-graph for agents); the eval surface is different (faithfulness and escalation for chat, trajectory and tool-call success for agents); the latency budget is different (sub-2-second for chat, 5–60 seconds is normal for agents). If your use case is customer service, sales-qualifying, internal Q&A, voice deflection — you're in the right pillar. If it's multi-system workflow automation — route to the agent practice.

Why do you default to Claude Sonnet 4.6 instead of GPT-5 for grounded chat?

Tone and grounding. Sonnet 4.6 leads on faithfulness-against-retrieved-context in our eval runs — measurably less hallucinated content when the retrieval layer feeds it on-topic chunks, and a noticeably better refusal posture when retrieval misses. GPT-5 mini wins on raw throughput and is our backstop in the routing layer; we ship a multi-vendor abstraction above both. Gemini 3 Flash is competitive at the cheap end for low-stakes chat — we've shipped it on two internal-Q&A engagements where the cost economics flipped the spreadsheet. The honest answer is: it depends on the use case, and the default is Sonnet 4.6 plus a routing fallback. We've never recommended a single-vendor chat stack for any production engagement — vendor risk is real and the abstraction costs roughly a week of engineering.

How do you eval a chatbot before it ships?

Five surfaces. Faithfulness — does the answer match what retrieval returned? RAGAS plus a custom harness against a gold set built with the buyer team. Retrieval precision — did the right chunks rank above the wrong ones? Tracked per query, regression flagged. Escalation precision — when the chat should hand off, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set. Tone — does the chat sound like the brand? Human-rated against a rubric, scored over time. Latency p50 / p95 — measured in production, alerted on regression. For voice chat add turn-take p50 / p95, barge-in success, and interruption-recovery. We ship eval in code before the chat model touches the corpus — the harness is the artefact that survives the engagement.

Do you build voice chatbot development on LiveKit or build it custom?

LiveKit Agents or Pipecat for every voice chatbot we ship. We don't build the voice transport layer custom — Twilio carries the PSTN side, LiveKit handles the WebRTC and the agent runtime, Deepgram does the streaming STT, ElevenLabs handles TTS. The custom work is the chat brain — system prompt, retrieval over the right corpus, tool wiring, handoff layer, and the eval harness. Building voice transport custom is a six-month engineering yak-shave that no buyer has ever earned the ROI on. LiveKit lands sub-400ms turn-take out of the box; the engineering work is making the chat brain not stupid inside that latency budget, which is where every voice chatbot development engagement actually lives.

What does an enterprise chatbot development engagement cost?

Fixed scope, fixed fee. A single-channel chatbot pilot runs three to four weeks at the lower end of the band. A full multi-channel custom chatbot development engagement runs six to eight weeks. Voice chatbot development runs eight to twelve weeks (the engineering surface is bigger). Legacy chatbot modernisation runs six to ten weeks depending on the migration depth. We quote exact numbers after a 30-minute scoping call. None of our chatbot development services engagements run on a time-and-materials clock — we sell a working chatbot against a fixed scope, not hours. Pricing scales with channel count, corpus complexity, and handoff depth; the lower end is single-channel grounded chat, the upper end is multi-channel voice + handoff + IVR replacement.

Can you migrate us off an existing chatbot vendor (Drift, Ada, Intercom Fin, etc.)?

Yes — legacy chatbot modernisation is a defined engagement shape. We've migrated chat off four legacy vendor stacks in 2026, every one for the same reasons: eval depth was vendor-controlled, model swap was vendor-blocked, channel abstraction didn't exist, and renewal economics had walked away from the value. The migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline (where one exists). The target stack is usually Sonnet 4.6 + pgvector + Langfuse + your existing channel anchor (Zendesk, Intercom, Twilio) — vendor surface kept, brain replaced. Migration runs six to ten weeks; we ship behind a progressive feature flag so the cutover is reversible until the eval baseline is matched-or-exceeded.

How is this different from your conversational ai development or llm chatbot development work?

Same pillar, different language. Conversational AI development is the head-term — the discipline of building conversational systems, regardless of channel. Chatbot development services is the buyer-side phrase — what the procurement team types into the search bar. LLM chatbot development is the architectural sub-category — chat where the brain is an LLM (which is roughly 100% of 2026 builds; rule-based chat is a legacy modernisation target, not a greenfield shape). All three terms describe work we ship under this pillar. The sibling practices: [AI agent development](/services/ai-agent-development/) for multi-step autonomous task execution; [RAG development services](/services/rag-development/) for retrieval pipelines deeper than the chat use case requires; [LLM development services](/services/llm-development/) for the model-engineering layer (fine-tuning, hosted-vs-self-hosted, cost engineering) beneath a chat surface.

016 / Related practices

## Adjacent services.

[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

](/services/rag-development/)[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)

017 / Start a chatbot engagement

## Ship a *grounded* chatbot in six weeks.

Grounded chatbot pilot in 3–4. Custom chatbot development in 6–8. Voice chatbot development in 8–12. Legacy chatbot modernisation in 6–10.

[Talk to a partner](/contact/) [See engagement shapes](#engage)


---

## SECTION: 4.9. Service: generative-ai

_Source: https://www.paiteq.com/services/generative-ai/_

# Generative AI Development Services — Paiteq

> Paiteq is a generative AI development services and generative AI consulting agency — brand-controlled image, video, audio, and multimodal pipelines for enterprise.

**HTML version:** https://www.paiteq.com/services/generative-ai/

## Key facts

- Modalities: image, video, audio, multimodal.
- Posture: brand-controlled outputs, eval-graded quality, enterprise guardrails.

## Related pages

- [LLM Development](https://www.paiteq.com/services/llm-development/)
- [Machine Learning Development](https://www.paiteq.com/services/machine-learning-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

Generative AI Development Services

# *Generative AI development services* — brand-controlled image, video, audio at scale.

Paiteq is a generative ai development services and generative ai consulting agency. We ship production pipelines on Flux, SDXL, Sora, Runway, ElevenLabs and Cartesia — brand-controlled with LoRA training, eval-graded, with provenance and human-review gates built in. Not a vendor of a single house model; we pick the stack against your output spec, your data rules, and your unit economics.

[Talk to engineering](/contact/) [See engagement shapes](#engage)

Modalities Image · Video · Audio · Multimodal

Stack Flux · SDXL · Sora · Runway · ElevenLabs

Engage Pilot · Build · Brand-LoRA · Advisory

Provenance C2PA + watermark

001 / OUTCOMES

## Four numbers that travel with every generation pipeline.

The eval rubric is what separates a generative ai company from a prompt-engineering hobbyist. We grade every pipeline on brand-fidelity, safety, cost, and latency — measured weekly, published to the client in a shared dashboard. These four numbers determine whether a pipeline ships and whether it stays live.

0 %

Brand-fidelity score

Human-graded on the brand rubric weekly post-launch.

< 0 %

Safety failures

NSFW or off-brand outputs reaching publish. Hard gate.

0 %

Asset cost cut

Median per-asset cost vs. agency / stock-photo baseline.

< 0 s

P95 image latency

Prompt to rendered file. Video and audio carry their own SLAs.

Numbers above are medians across shipped engagements on Flux, Stable Diffusion 3, Sora, Runway, ElevenLabs, and Cartesia for clients in e-commerce, ed-tech, fintech, and B2B SaaS. Spread by surface type is wide — we model your specific workload at discovery rather than quote an average.

-   01
    
    ### The design lead is the eval judge
    
    Not the engineering lead, not us, and definitely not the model. Brand-fit is a judgment call — the person who owns the brand has to own the grading rubric. Pipelines fail at three months when this is reversed.
    
-   02
    
    ### The cost curve isn't the marketing slide
    
    Hosted API pricing is fine until you cross the volume cliff where self-hosted SDXL or Flux Dev becomes cheaper. Most pipelines we audit are sitting on the wrong side of that cliff. Benchmarks below carry the real break-even math.
    

002 / CAPABILITIES

## Six generation capabilities across six modalities.

A capability-by-modality heatgrid showing where Paiteq's generative ai services have shipped at production scale. Strength scores reflect what we've taken to production, not what we've experimented with — the gaps are honest.

Function Industry

Image

Video

Audio

3D / depth

Vision-LLM

Brand LoRA

Marketing / brand assets

Product photography

Personalised video

Voice / narration

Synthetic training data

Multimodal apps

Marketing / brand assets

ImageVideoAudioVision-LLMBrand LoRA 3D / depth

Product photography

Image3D / depthVision-LLMBrand LoRA VideoAudio

Personalised video

ImageVideoAudioBrand LoRA 3D / depthVision-LLM

Voice / narration

AudioBrand LoRA ImageVideo3D / depthVision-LLM

Synthetic training data

ImageVideoAudio3D / depthVision-LLM Brand LoRA

Multimodal apps

ImageVideoAudio3D / depthVision-LLM Brand LoRA

Possible fit Good fit Primary vertical

Dark cells: shipped at production scale. Medium: shipped in pilot. Light: experimented but not yet production. The empty cells are real — we don't claim depth we haven't built.

-   Image · Brand~50% of engagements
    
    Hero imagery, campaign variants, social-first cuts of approved hero shots, and the long tail of brand-consistent variations a design team can't economically produce by hand. The bar is brand fidelity, not photorealism — the failure mode is "off-brand" not "obviously AI". Brand-controlled generation via a custom LoRA is the differentiator against a prompt-only pipeline.
    
-   Product photographyHigher-difficulty cousin
    
    Same stack — SDXL or Flux with IP-Adapter and ControlNet — but the brand bar is steeper because a customer is looking at a specific SKU. The methodology we run here can replace a meaningful share of an in-house photo-studio's output, with tactile work (lifestyle, regulated categories) still routed to a human photographer — the goal is augmentation against catalogue volume, not full replacement.
    
-   VideoResearch → production in 2026
    
    AI video generation services is now a real category — Sora API, Google Veo, Runway, Kling. We've shipped explainer-video pipelines that take a script and produce hero plus scene cuts with brand-consistent characters. Cinematic video at SKU scale drives most of our video work. Live and long-form over 30s are still research; we don't ship those yet.
    
-   Voice & audioMost operationally mature
    
    ElevenLabs for studio-grade narration; Cartesia for conversational sub-150ms streaming. Voice cloning is gated on consent — we won't ship a clone without a documented opt-in chain, regardless of what the API permits. Synthetic data generation for ML training is a niche but high-leverage cousin — covering edge-case synthesis for vision and OCR pipelines where labelled data is scarce.
    
-   Multimodal~30% of pipelines
    
    A multimodal ai company in 2026 means running vision-LLMs (GPT-5 Vision, Claude Sonnet 4.6, Gemini 3.0 Pro) on the input side and generation models on the output side, in one pipeline. Image-in, decision via Claude, asset-out via Flux — this is where the practice meets our [llm development services](/services/llm-development/) sibling.
    

003 / SERVICES

## Four engagement shapes, fixed-scope.

Pilot, full Build, Brand-LoRA training, or Advisory. Every generative ai development services engagement maps to one of the four. Mixed engagements are billed as two consecutive shapes, not as a single open-ended retainer.

[

01 / PILOT ↗

Generation Pilot

One modality, one use case, eval-graded, demoed in 2–4 weeks. The way most clients start a generative ai development services engagement before committing to a full build.

2–4 wks

](#engage)[

02 / BUILD ↗

Production Build

Full pipeline with brand controls, human-review gates, observability, provenance. The bulk of our generative ai agency revenue. Includes four weeks of post-launch iteration.

8–14 wks

](#engage)[

03 / BRAND-LORA ↗

Brand-LoRA Training

Custom style LoRAs trained on your brand assets. Eval-validated against the design lead before deployment. Weights stay on your infrastructure.

4–6 wks

](#engage)[

04 / ADVISORY ↗

Generative AI Consulting

Model-selection audit, build-vs-API decision, compliance review, roadmap. Our generative ai consulting shape — outcome is a costed plan, not a prototype.

2–3 wks

](#engage)

003B / ADVISORY

## Generative AI consulting — when the question is build-vs-buy, not how-to-build.

A separate shape from the Build engagements. Generative ai consulting is a 2–3 week intervention where the deliverable is a costed decision memo, not a prototype. We field about a third of our inbound for this — most often from a CTO or CMO who has an internal team running prompt-only pipelines and needs an outside read before committing to a Flux self-hosting decision or a Brand-LoRA programme.

01

### Model selection

Flux vs Stable Diffusion 3 vs Imagen 4 for image; Sora vs Veo vs Runway for video; ElevenLabs vs Cartesia for audio. We bring our internal benchmark numbers and your eval scenarios into one room and pick.

02

### Build-vs-API economics

When does the GPU bill on self-hosted SDXL beat the per-asset bill on Replicate Flux Pro at your projected volume? Answer depends on idle time, batch size, and cache hit rate — most internal teams don't have the data to model it correctly.

03

### Compliance & IP review

C2PA provenance, watermarking, EU AI Act high-risk classification (phasing in through 2025–2027), HIPAA-aligned data handling where healthcare data is in the loop, and a clean read on what the model providers actually allow under commercial terms.

04

### Generative AI strategy memo

A costed roadmap that maps generation surfaces against your business model with priority order, budget, and timeline. This is the deliverable a board pack is built from.

Not a prototype, not a workshop, not a discovery sprint that produces nothing. Three weeks in you have the memo, the numbers, the model picks, the compliance assessment, the costed plan. If you build it, we roll the consulting fee into the Production Build. If your team builds it with our patterns, you take the memo and run.

004 / PATTERNS

## Four deployment patterns. We pick on cost, control, and compliance.

The decision between hosted-API, brand-LoRA, hybrid, and fully-self-hosted isn't religious — it's economics plus regulation plus brand control. About 60% of new engagements start as API-first; a quarter graduate to Brand-LoRA in Production Build; a tenth end up Hybrid or Self-hosted because volume or compliance forced it.

   
01

### API-FIRST

The fastest path to a working generation pipeline. Calls Flux Pro, Imagen, or DALL-E behind a thin service. Brand consistency comes from a constrained prompt library and reference images, not a custom model. A human-approval queue gates publish on customer-facing surfaces. Where we start about 60% of new engagements.

Pick when

-   Setup measured in days, not weeks
-   Editorial / blog illustration where brand drift is recoverable
-   Pilot phase to validate output spec before committing to a LoRA
-   Workloads under 10,000 assets/month where hosted economics win

Skip when

-   Regulated data — outputs leave your perimeter
-   Product photography where prompt-only fails brand QA
-   High-volume workloads where the per-asset bill flips against self-hosted
-   Markets needing tight ControlNet conditioning

Stack

Flux ProDALL-E 3Imagen

02

### BRAND-LORA

When prompt-only generation drifts off-brand, we train a style LoRA on 200–1,000 brand-approved assets. The LoRA captures the visual identity that a prompt can't describe. Paired with IP-Adapter for reference-image conditioning and ControlNet for compositional control. We've shipped this for e-commerce hero generation where the previous prompt-only pipeline failed brand QA at a 40% rate.

Pick when

-   Brand-fit scores reliably 88–94 on our rubric versus 60–75 for prompt-only
-   Weights deploy to your infrastructure
-   you own the artefact
-   Product photography or hero campaigns at SKU scale
-   Re-training cadence usually quarterly as brand evolves

Skip when

-   One-off editorial illustration — overkill
-   Brand assets under 200 examples — too thin to train against
-   Greenfield brand without locked visual identity
-   Workloads where prompt-only already clears brand QA

Stack

Flux LoRASDXLIP-AdapterControlNet

03

### HYBRID

A pragmatic middle path. Hosted models (Flux Pro, Imagen, Sora) handle the heavy generation. A self-hosted model on your GPUs does the brand-finishing step — applying the LoRA, watermarking, safety classifier, and any IP-sensitive transformations. Cuts the provider bill on high-volume workloads while keeping the regulated steps inside your perimeter.

Pick when

-   Break-even versus pure hosted lands around 50,000 assets / month for image
-   Modal or Replicate as the GPU substrate for the in-perimeter step
-   Mixed regulated / non-regulated content in the same pipeline
-   E-commerce, ed-tech, and publishers at scale

Skip when

-   Very low volume — the platform tax doesn't pay back
-   Pure-regulated workloads — go straight to fully-self-hosted
-   Workloads where the hosted models don't expose the LoRA hook we need

Stack

ModalReplicateFlux ProSDXL

04

### SELF-HOSTED

For regulated data or sustained six-figure-monthly hosted bills, we move the full pipeline onto your infrastructure. Stable Diffusion 3 or Flux base models, custom LoRAs, watermarking, NSFW classifier — all running on A100s or H100s on your cloud. Slower to stand up but the unit economics flip past a threshold and the data never leaves.

Pick when

-   HIPAA-aligned, defence, or EU AI Act high-risk workloads
-   Sustained six-figure-monthly hosted bills make the math flip
-   Need fine-grained ControlNet or custom LoRA hooks the hosted providers don't expose
-   Data residency requires inside-perimeter execution

Skip when

-   Low or spiky volume — GPU idle time eats the savings
-   Greenfield pilots where time-to-first-asset matters
-   Teams without MLOps capacity for the platform-engineering side
-   Workloads where commercial-license hosted models are easier IP-wise

Stack

SD3Flux DevH100vLLM

004B / BRAND

## Brand-controlled generation — the LoRA approach we ship for enterprise.

Generic image models produce generic outputs. That works for editorial illustration. It fails fast on product photography, hero campaigns, or any surface where a customer can tell that the brand is shaped by the model and not the other way around. Brand-controlled generation is how we close that gap — and it's a genuine technical gap in the SERP. None of the top-ranking generative ai agency pages cover the methodology at this depth, because most haven't shipped enough LoRAs to have one.

1.  01 Asset curation
    
    200–1,000
    
    Sit with the design lead, pull 200–1,000 brand-approved examples that represent the visual identity — composition, lighting, palette, characters, typography. Volume matters less than consistency; a 300-asset set with tight brand coherence beats a 1,500-asset set with drift. We clean EXIF, strip metadata that might confuse training, normalise resolution.
    
    If the asset set has too much visual drift across examples, we don't proceed to training. Tightening the curation set by a third is a better outcome than training a LoRA that captures the average instead of the brand.
    
2.  02 Training
    
    12–48 GPU-hr
    
    Usually on Modal or Replicate against a Flux .1 dev base — sometimes SDXL when the ControlNet ecosystem is mission-critical. Training runs 12 to 48 GPU-hours depending on rank and base resolution. We log every run; nothing trains without a tagged manifest of the source set.
    
    If the training loss curve doesn't converge within expected bounds, we adjust rank and learning rate and rerun rather than proceed to eval with a suspect checkpoint. Wasted GPU-hours beat a bad LoRA shipped to eval.
    
3.  03 Eval gate
    
    ≥88 brand-fit
    
    The phase where most brand-LoRA programmes silently fail at agencies without a methodology. Fixed rubric authored with the design lead before training started — palette adherence, composition coherence, typography rendering, brand affordance ("does this feel like our brand or like a Flux output?"). 0–100 scale. We don't ship under 88; we re-train rather than tune around it.
    
    If brand-fidelity drifts under 85 on the weekly sample, we re-baseline the prompt library or re-train the LoRA. Confident off-brand output is worse than a missed deadline because the design team loses trust in the pipeline.
    
4.  04 Deployment
    
    Weights on your infra
    
    Weights ship to your infrastructure. Pipeline wiring lands behind your auth. IP-Adapter and ControlNet stages chain in for reference conditioning and compositional control. Re-training cadence quarterly — brands evolve, model bases improve, the eval rubric itself tightens as design standards do.
    
    We retainer the re-training when the client wants ongoing ownership; hand it off to an internal ML team when the client wants the muscle. Either path is documented in the SOW before deployment starts.
    

004C / VIDEO

## AI video generation services — Sora, Veo, Runway, Kling in production.

Video generation is the modality that moved from research to production fastest in 2026. The market for ai video generation services is still small, mostly because the unit economics weren't there before Sora API and Veo opened up commercial endpoints. Now they are. About a quarter of our generative ai services revenue is video as of 2026, and the curve is still going up.

-   A
    
    ### Pipeline shape
    
    Storyboard first — break the script into shots, lock visual continuity (characters, palette, lighting), decide which shots are AI vs traditional. Shot-by-shot generation through the picked model (Sora API for hero cinematic; Veo for cleanest commercial licensing; Runway for editorial with masking + motion brush; Kling when the aesthetic differs). Per-shot eval against storyboard frame before assembly. **Human review on assembled cut before publish — always.**
    
-   B
    
    ### What we ship — and don't
    
    **Ship:** personalised explainer video (one script, many SKU or audience variants), product video at SKU scale (catalogue-driven), social-content pipelines (short-form variants of a hero piece).
    
    **Don't ship in 2026:** long-form coherent video over 30s, anything live or near-live, hyper-personalised content involving real people without a clean consent chain. Research problems or ethical hazards — check back in six months.
    

005 / MODELS

## Six model families. We pick per asset type and per data rule.

No house model. We've shipped on every meaningful provider in 2026, and the right call usually depends on three things: brand fidelity ceiling, unit economics at your volume, and where the data has to live. The matrix below covers what we reach for and when we don't.

Flux (Pro / Dev)

Strengths

Best photoreal output in the open ecosystem in 2026. Strong prompt adherence on long detailed briefs. Commercial licensing on Flux Pro via Replicate or BFL API; Flux Dev for self-hosted research. LoRA training is mature on the .1 schnell + dev variants. Strong text-in-image rendering — a recurring failure mode for SDXL.

When We Pick

Photorealistic image generation where prompt nuance matters. Brand campaigns where text needs to render correctly in the image. The default starting point for new image pipelines unless the client is already invested in another stack.

When We Don't

Tight latency budgets under 2s — Flux is heavier than SDXL Turbo. Workloads where the existing ControlNet ecosystem is critical — SDXL still has the deeper ControlNet inventory.

Paiteq Pattern

About 6 in 10 image pipelines we ship in 2026 lead with Flux Pro through the BFL API or Replicate. We pair with a Flux LoRA when brand fidelity needs to hit ≥90 on our rubric.

PhotorealLoRA-readyCommercial

Stable Diffusion 3 / SDXL

Strengths

Most mature open-weight image model — ControlNet, IP-Adapter, depth, pose, lineart conditioners all production-ready. Strong community LoRA ecosystem on Hugging Face. Cheapest to self-host at scale. SDXL Turbo for sub-1s latency when the output spec tolerates a quality dip.

When We Pick

High-volume self-hosted workloads where the unit economics matter more than raw quality ceiling. Workflows that need ControlNet, IP-Adapter, or other conditioners we don't get on hosted-only models. Brand-LoRA training where the cost of the LoRA matters.

When We Don't

Text-in-image rendering — SDXL is consistently the weaker model here, Flux beats it cleanly. Greenfield brand pipelines where the team isn't already invested in the SDXL ecosystem.

Paiteq Pattern

Self-hosted SDXL on Modal or your own H100 fleet is our default for high-volume image workloads — about 1 in 3 production builds in 2026. Always paired with IP-Adapter for reference-image conditioning.

Open-weightControlNetSelf-host

DALL-E 3 / Imagen

Strengths

Hosted-only, safety-tuned, and very fast to integrate. DALL-E 3 ships through the GPT-5 vision pipeline so you get the prompt-rewriting step for free. Imagen 4 has the cleanest commercial-licence story on Google Cloud. Both have strong default safety classifiers.

When We Pick

Quick-start image pipelines where time-to-first-asset matters more than long-term unit economics. Clients on Google Cloud who already have Vertex billing set up. Editorial / blog-illustration workloads where prompt-only is enough.

When We Don't

Custom LoRAs (impossible — these are closed-weight). High-volume workloads where the hosted bill flips against self-hosted. Workloads needing ControlNet or fine-grained conditioning.

Paiteq Pattern

We use DALL-E 3 or Imagen as the bootstrap model in Pilot engagements when the client wants to see output in week 1. About a third graduate to Flux + LoRA in Production Build.

HostedSafety-tunedFast-start

Sora · Veo · Runway · Kling

Strengths

The 2026 video-gen landscape. Sora API for cinematic 5–20s shots with strong physics. Google Veo for product video with the cleanest licensing. Runway for editing workflows — masking, motion brush, image-to-video. Kling for stylised social-content workflows where the aesthetic differs from Sora.

When We Pick

AI video generation services workloads — personalised explainer video, product demo at SKU scale, social-content pipelines. Always with storyboard → shot list → generation → human-review gate; never raw model-output-to-publish.

When We Don't

Long-form video (>30s coherent) — still a research problem in 2026. Live or near-live generation — these are all minutes-per-clip workflows. Highly brand-specific motion — fine-tunes are not generally available on the video models yet.

Paiteq Pattern

About a quarter of our generative ai services revenue in 2026 is video. Sora as the lead model for hero content, Runway for the editorial pipeline. Always reviewed by a human before publish — the failure modes are too costly.

SoraRunwayStoryboard-gated

ElevenLabs · Cartesia

Strengths

ElevenLabs for studio-grade narration and voice cloning with commercial licensing. Cartesia for sub-150ms streaming TTS that works inside conversational apps without breaking turn-taking. Both have multilingual coverage now matching English quality on 20+ languages.

When We Pick

ElevenLabs for production audio where quality is the headline — audiobooks, marketing video voiceovers, brand narration. Cartesia inside any conversational app or low-latency in-product narration where ElevenLabs' latency would hurt UX.

When We Don't

Music generation — we route to Suno or Udio for that. Hyper-realistic clone-anybody workflows — we won't ship without consented voice opt-in regardless of what the API permits.

Paiteq Pattern

Cartesia inside any chatbot or voice-agent build for the latency budget. ElevenLabs in any video-or-audio generation pipeline. We've shipped both with student-voice opt-in and parental gating on ed-tech engagements.

TTSCloningLow-latency

GPT-5 Vision · Claude · Gemini

Strengths

Vision-LLMs are the multimodal ai engine — image understanding paired with text reasoning. GPT-5 Vision for the strongest OCR + chart reading. Claude Sonnet 4.6 for document understanding with the best refusal posture on PII. Gemini 3.0 Pro for million-token-context multimodal — long video clips with audio understood in one call.

When We Pick

Multimodal ai company shape — image-or-video input, text reasoning, text-or-action output. OCR-grade extraction from product photos, invoice scans, scientific charts, screenshots. Pairs with the LLM development practice for the text-side architecture.

When We Don't

Image / video output — vision-LLMs are read-only. For generation we route to Flux, Sora, etc. Single-modality text workloads where the vision capability is dead weight on the bill.

Paiteq Pattern

Vision-LLMs sit upstream of generation in about half of our multimodal pipelines — the LLM reads the input, decides what to generate, the generator ships the output. See <a href="/services/llm-development/">our llm development services</a> for the text-side patterns.

GPT-5 VisionClaudeGemini 3

006 / STACK

## Providers we've shipped on.

Pinned to what we have in production in 2026. Not the marketing list — the actual integrations under support.

-   Flux
-   SDXL
-   DALL-E 3
-   Imagen 4
-   Sora
-   Veo
-   Runway
-   Kling
-   ElevenLabs
-   Cartesia
-   Replicate
-   Modal
-   GPT-5 Vision
-   Claude
-   Gemini 3
-   Langfuse
-   Flux
-   SDXL
-   DALL-E 3
-   Imagen 4
-   Sora
-   Veo
-   Runway
-   Kling
-   ElevenLabs
-   Cartesia
-   Replicate
-   Modal
-   GPT-5 Vision
-   Claude
-   Gemini 3
-   Langfuse

007 / DECISION

## Modality × deployment pattern — what we pick when.

The same decision a generative ai consultant would walk you through in week one, compressed into a grid. Use it to pre-screen the conversation before you brief us.

Workload

API-first

Brand-LoRA

Hybrid

Self-hosted

Image — editorial / blog

Default

Overkill

Wait for scale

Not yet

Image — brand campaigns

Drift risk

Default

At >50k/mo

Only if regulated

Image — product photography

Weak control

Default

Scale step-up

High-volume

Video — short-form social

Sora / Runway

Not viable

Frame work only

Research

Voice — narration

ElevenLabs

Voice clone only

Hybrid latency

Edge cases

Synthetic data — ML training

Cost prohibitive

Per-edge-case

Common pattern

Default at scale

Regulated (HIPAA / EU AI Act)

Often blocked

Partial fit

With perimeter step

Default

Solid dot — default pick. Dash — wrong tool. Plain — fits with caveats; we'd dig in on the call.

008 / BENCHMARKS

## What the model choice actually costs you.

Two benchmarks every generative ai agency should be willing to put on a service page. Brand-fit scores from our internal rubric on a representative 50-example brand evaluation. Per-100-image costs at the volume tiers we've shipped at. Numbers shift quarterly as providers reprice; the ranking is more stable than the absolute values.

Brand-fit ceiling — image models (0–100, internal rubric)

Flux Pro

94

Photoreal + text-in-image

Stable Diffusion 3

86

Open-weight, ControlNet

Imagen 4

89

Cleanest commercial licence

DALL-E 3

81

Fast integration via GPT-5

SDXL base

74

Cheapest self-host

Cost per 100 images — current provider pricing

DALL-E 3 (hosted)

18¢

$/100 images

Flux Pro (Replicate)

12¢

$/100 images

Imagen 4 (Vertex)

14¢

$/100 images

SDXL (self-host, idle)

28¢

Effective $/100 at low volume

SDXL (self-host, 500k/mo)

2¢

Amortised at high volume

Self-hosted SDXL flips against hosted at roughly 50,000 assets / month for image. Below that, the GPU idle time eats the savings. Above it, the marginal cost approaches zero and the line steepens.

008B / ENTERPRISE

## Enterprise generative ai — what changes versus an SMB build.

A generative ai agency that ships SMB pilots and an enterprise generative ai practice are not the same shape. The model picks overlap. The pipeline architecture overlaps. What changes is everything around it — procurement, security review, data residency, audit trail, the compliance posture, and the cadence at which the design or content lead can grade an eval set. Most enterprise generative ai engagements we run involve 6 to 10 internal stakeholders, not 2; we plan for that at week one.

-   01
    
    ### Security review
    
    Enterprise procurement typically requires SOC 2 Type II evidence, ISO 27001:2022 alignment, a DPIA if European data is in scope, and increasingly an EU AI Act high-risk assessment. We bring the documentation rather than write it during the engagement.
    
-   02
    
    ### Data residency
    
    Many enterprise generative ai solutions must run inside the client's cloud perimeter — not Replicate, not a hosted API — because training data and prompts contain commercial-sensitive content. That pushes architecture toward Hybrid or Fully-Self-Hosted on Modal-on-your-cloud or a dedicated GPU fleet.
    
-   03
    
    ### Eval cadence
    
    SMB eval cycles can be daily; enterprise cycles are usually weekly, gated by design-team availability and stakeholder review windows. The eight-week Production Build extends to twelve in regulated industries — that's a feature, not a delay. Extra weeks are eval review and security sign-off cycles that compress the post-launch tail.
    

Enterprise tier ships with more documentation — pipeline diagrams, eval rubrics, runbooks, IP posture memos, security control mappings — because procurement and audit teams need the paperwork. The strategy memo includes a procurement-ready vendor map, a capex-vs-opex budget breakdown, a compliance assessment that names the controls, and a roadmap that survives the EU AI Act enforcement-schedule clarification expected in 2027. Delivered for fintech, healthcare, and a publishing-rights-heavy media client; each took the memo into a board pack.

009 / PROCESS

## Eval-first, brand-controlled, eight weeks to a publishable pipeline.

The eval set with the brand rubric lands in week 2, graded with the design or content lead. No generation pipeline ships without one — and the eval set keeps grading after launch, not just before. That's the difference between a generative ai company that talks about quality and one that measures it.

WEEK 1

### Discovery

Output spec (what assets, how many, brand rules, safety posture). Compliance posture — IP, EU AI Act, internal policy. We define the failure modes before the model is picked.

WEEK 2

### Eval set

Graded examples — brand-fit, factual accuracy, format compliance, technical quality. Built with the design or content lead, not invented by us. Drives the model selection.

WEEK 2–4

### Baseline

Multiple models benchmarked against the eval (Flux vs SDXL vs Imagen for image; Sora vs Runway for video; ElevenLabs vs Cartesia for audio). Cost, latency, quality, safety all measured.

WEEK 4–8

### Brand controls

LoRA training where prompt-only fails brand QA. IP-Adapter and ControlNet for compositional control. Constrained prompt library for the editorial surface. Human-approval gates on customer-facing publish.

WEEK 8+

### Deploy

Auth, rate limiting, Langfuse instrumentation, cost ledger, C2PA provenance, watermarking, NSFW classifier. Rollback playbook in the runbook. SOC 2 alignment when the workload touches it.

ONGOING

### Running

Weekly eval review, brand-drift monitoring, monthly cost audit, quarterly LoRA re-training. Ownership transfers to the client's design / content / engineering team.

010 / EVAL

## Four gates. Every pipeline. Every week.

1.  01 Brand-fidelity score
    
    ≥90
    
    Human-graded on a 0–100 rubric per asset type, sampled weekly post-launch. Rubric is built with the design lead — colour palette, composition, typography, brand affordance. LLM-as-judge for sub-scores, but final scores are human-confirmed on a 10% sample. Drift catches LoRA degradation early.
    
    If brand-fidelity drifts under 85 on the weekly sample, we re-baseline the prompt library or re-train the LoRA. Confident off-brand output is worse than a missed deadline because the design team loses trust in the pipeline.
    
2.  02 Safety failure rate
    
    <1%
    
    NSFW classifier on every output, brand-fit scorer, optional human-review gate. Tracked as failures-reaching-publish per 10,000 assets per week. We red-team the pipeline before launch with prompt-injection probes and adversarial inputs — the eval set includes the failure modes, not just the happy path.
    
    Any safety incident reaching a customer surface triggers a rollback to the human-gated pipeline within 24 hours and a root-cause review. We default to human-approval on high-stakes surfaces — it's cheaper than recovering from one published failure.
    
3.  03 Median cost per asset
    
    Modelled at discovery
    
    Per-asset cost — model API spend, GPU minutes, watermarking, classifier passes — tracked weekly in Langfuse. Modelled during the Pilot using the expected output mix and volume. We don't quote averages from marketing decks; we model from the actual eval scenarios.
    
    If median asset cost drifts more than 25% above the baseline for two weeks, we audit the model routing (a quarter of overruns) and the cache hit rate (most of the remaining three quarters). Surprise bills aren't a surprise because the modelling is in week 2.
    
4.  04 P95 prompt-to-publish latency
    
    Under per-surface SLA
    
    Trigger to publishable asset, including queue waits and human-gate dwell where applicable. SLA varies by surface — sub-4s for in-app image, sub-30s for hero campaigns, minutes for cinematic video. We track p50 / p95 / p99 separately because the tails matter for queue sizing.
    
    P95 SLA breach for 72h triggers a routing review on the heaviest model nodes. Usually the fix is moving routine generations to a faster model (Flux schnell or SDXL Turbo) — not replatforming the pipeline.
    

011 / COMPLIANCE

## Provenance, watermarking, and the EU AI Act.

Generative AI inherits the IP risk, the data-protection rules, and now the EU AI Act high-risk obligations. Enterprise generative ai engagements either build the compliance posture in, or they get torn out the first time legal reviews the pipeline. We build it in. C2PA at generation, watermarking at publish, audit trail per asset, BAA-ready where healthcare needs it.

Audited annually · Continuous monitoring

-   C2PA provenance
    
    Content credentials embedded at generation
    
    AUDITED · 2026
    
-   Watermarking
    
    SynthID + invisible watermark options
    
    AUDITED · 2026
    
-   EU AI Act
    
    High-risk system obligations · phased 2025–2027
    
    AUDITED · 2026
    
-   SOC 2 Type II
    
    Audited annually · continuous monitoring
    
    AUDITED · 2026
    
-   ISO 27001:2022
    
    Current revision · annual surveillance
    
    AUDITED · 2026
    
-   HIPAA alignment
    
    BAA available for healthcare engagements
    
    READY
    
-   Commercial IP
    
    Flux Pro · Imagen · Firefly · ElevenLabs licensed
    
    AUDITED · 2026
    
-   Audit trail
    
    Per-asset provenance from prompt to publish
    
    AUDITED · 2026
    

012 / USE CASES

## Where teams have shipped.

Three anonymized engagements. Modality, segment, and outcome are real; brand removed under NDA.

Marketing

DTC retail · catalogue at scale

### Brand-LoRA image generation pipeline

Typical shape: custom SDXL LoRA trained on 200–1,000 brand-approved hero images. Pipeline takes product SKU and scene brief, generates variants on Flux Pro with the brand-LoRA, design lead approves before publish. C2PA provenance embedded at generation. Augments hero-variant work the in-house photo studio can't economically produce by hand.

Deliverable: trained LoRA + production pipeline + design-lead eval rubric

Education

Ed-tech · regulated learner audience

### AI-narrated lessons with consent-gated voice

Typical shape: self-paced lessons narrated in a chosen voice with parental opt-in and age-gating. Cartesia TTS for sub-150ms in-app streaming. SSML drives pacing per learning profile. ElevenLabs for the marketing-side teacher previews. Audit log on every narration to satisfy the safeguarding policy.

Deliverable: narration pipeline + consent ledger + safeguarding runbook

Product

B2B SaaS · release-ops shape

### Release-notes automation with brand voice

Typical shape: PR descriptions go in, LLM drafts release notes scored against the team's voice rubric, human edits, publish. Tone consistency measured weekly via a 40-example eval set. Pairs with our llm development services on the text generation side. Not consumer-facing, but the principle is the same — eval-graded before publish.

Deliverable: drafting pipeline + voice rubric + weekly eval dashboard

012B / WHY PAITEQ

## Why teams pick us as their generative ai development services partner.

-   01
    
    ### The eval set lands in week two
    
    Not after launch. Not "we'll figure it out". Not a vibes-check at the end. The design lead grades it, signs it off, and the rubric becomes the contract. Most generative ai agency engagements we audit didn't have one — that's why their pipelines drifted off-brand within a quarter.
    
-   02
    
    ### Cost is modelled at discovery
    
    No per-asset average from a marketing deck. We take your projected volume, output mix, cache hit rate estimate, and GPU idle profile if self-hosting is in play — and produce a defensible cost ledger by week two. Surprise bills aren't a surprise when modelling lives in Langfuse from the start.
    
-   03
    
    ### We name what we don't ship
    
    No long-form coherent video. No live generation. No voice clones without consent. No training data with unclear licensing. No pipelines without a human-review gate on customer-facing publish. The list of things an agency won't ship is the most reliable signal of what they will ship well.
    

013 / TIMELINE

## What the eight-week Production Build looks like.

The standard generative ai development services build — a defined slice ships in eight weeks. Brand-LoRA Training adds 4–6 weeks; Advisory runs in parallel at the front. Pilot is a tighter 2–4 week cut of the same shape.

6 phases

WEEK 1 Discovery

Output spec, brand rules, safety posture, compliance map

Spec sign-off

WEEK 2 Eval set

Graded examples across modalities; rubric authored with design lead

Design-lead grading complete

WEEK 3–4 Baseline

Multi-model benchmark; cost / quality / latency / safety scored

Model picked + costed

WEEK 4–6 Brand controls

LoRA trained (if needed), prompt library locked, human-gate UI wired

Brand-fit ≥ 88 on eval

WEEK 6–8 Pipeline build

Auth, rate limits, provenance, watermark, classifier, runbook

Dry-run scenarios green

WEEK 8+ Launch

Live publish, drift monitoring, weekly eval review

First 30 days of clean traces

014 / ENGAGE

## Four ways to start.

01 Generation Pilot Fixed scope

2–4 weeks

### Pilot one modality, one use case.

In scope

-   One modality, one use case
-   Eval set with brand + quality rubric
-   Working prototype on real data
-   Demo + cost / quality memo

Out of scope

-   Production deploy
-   Brand-LoRA training
-   Compliance posture (separate Advisory)

02 Production Build Fixed scope

8–14 weeks

### Full pipeline, brand controls, observability.

In scope

-   All Pilot deliverables
-   Brand controls (prompt library / LoRA)
-   Human-review gates on publish
-   Provenance + safety + audit trail
-   Four weeks of post-launch iteration

03 Brand-LoRA Training Fixed scope

4–6 weeks

### Custom style LoRA on your assets.

In scope

-   Asset curation and cleaning
-   LoRA training on Flux or SDXL base
-   Eval against design-lead rubric
-   Weights deployed to your infrastructure

04 Generative AI Consulting Fixed scope

2–3 weeks

### Model audit, build-vs-API, roadmap.

In scope

-   Model selection audit
-   Build-vs-API decision framework
-   Compliance and IP review
-   Costed roadmap memo

015 / FAQ

## What buyers ask before signing.

What's the difference between a generative ai development services engagement and an LLM build?

LLMs are text models. Generative AI in our taxonomy means non-text generation — image, video, audio, 3D, multimodal — with text generation handled by our [llm development services](/services/llm-development/) practice. The distinction matters because the engineering shape is different. Image and video pipelines have eval rubrics built around visual quality and brand fidelity, GPU economics that flip differently at scale, IP / copyright risk that pure text doesn't carry, and a human-review gate that's almost always mandatory on customer-facing surfaces.

About a third of our engagements end up multimodal — vision-LLMs read the input, a generation model produces the output. In those, the two practices collaborate. The pillar you land on first depends on whether the dominant value is in the read step or the generate step.

Can you train a brand-controlled model on our assets, and where do the weights live?

Yes — Brand-LoRA Training is one of our four engagement shapes, 4–6 weeks. We start with asset curation and cleaning (usually 200–1,000 brand-approved examples; more if the brand is broad). Then the LoRA trains on Modal or Replicate against a Flux or SDXL base. The output is evaluated against the design lead before deployment — brand-fit score must clear 88 on our internal rubric.

Weights deploy to your infrastructure, not ours. You own the artefact. Re-training cadence is usually quarterly as the brand evolves; we can either retainer that or hand it off to your team. We will not train on assets you don't have clean licensing for, and we document the IP posture in the SOW.

How do you handle IP, copyright, and provenance for generated content?

Three layers. First, model choice — we default to commercially-licensed models (Flux Pro, Imagen 4, Adobe Firefly where the use-case fits, ElevenLabs and Cartesia for audio). Second, provenance — C2PA content credentials embedded at generation, with optional invisible watermarking via SynthID or a partner. Third, audit trail — every asset has a record from prompt to model to publish surface, kept for the retention period in the SOW.

For regulated workloads (EU AI Act high-risk classifications, HIPAA-aligned use-cases), we configure the pipeline to satisfy the obligations and document the assessment in the engagement deliverables. Compliance is a shape we ship, not a checkbox we tick at the end.

Hosted (Replicate / OpenAI / Anthropic) or self-hosted on our cloud?

Hosted to start, almost always. The four engagement patterns answer this — API-first (60% of new pilots), Brand-LoRA, Hybrid, Fully-Self-Hosted. We move to hybrid or self-hosted when volume makes the math flip, regulated data forces it, or you need customisation (specific LoRAs, ControlNet conditioning) the hosted providers don't expose.

The break-even is workload-specific. For image generation, hosted typically beats self-hosted under 50,000 assets per month; above that, the unit economics favour self-hosted SDXL or Flux Dev on a small H100 fleet. For video, hosted dominates in 2026 because the open-weight video models don't yet match Sora or Veo on quality. For voice, Cartesia and ElevenLabs hosted is almost always the right answer — the latency and quality lead is too large.

How do you measure generation quality at scale, and what does failure look like?

Four metrics, one stepper. Brand-fidelity score (human-graded on a 0–100 rubric, target ≥90, sampled weekly). Safety failure rate (NSFW or off-brand outputs reaching publish, target <1%). Median cost per asset (modelled at discovery, drift threshold 25% / two weeks). P95 prompt-to-publish latency (per-surface SLA). All four live in Langfuse alongside the LLM observability for any vision-LLM upstream.

Failure is concrete. Brand drift means re-baseline the prompt library or re-train the LoRA. Safety incident means rollback to the human-gated pipeline within 24 hours plus a root-cause review. Cost runaway is almost always a model-routing or cache issue, not a volume issue. Latency breach almost always solves with a routing change to a faster model on the easy generations, not a replatform.

016 / Related practices

## Adjacent services.

[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)

017 / Start a project

## Ship *brand-grade* generation in 8 weeks.

Pilot in 2–4. Production Build in 8–14. Brand-LoRA in 4–6. Generative ai consulting in 2–3.

[Talk to engineering](/contact/) [Architecture review](/contact/?topic=arch-review)


---

## SECTION: 4.10. Service: machine-learning-development

_Source: https://www.paiteq.com/services/machine-learning-development/_

# Machine Learning Development Services — Paiteq

> Machine learning development services and custom ML development — gradient boosting, deep learning, forecasting, ranking, and recommendation systems on calibrated evals.

**HTML version:** https://www.paiteq.com/services/machine-learning-development/

## Key facts

- Techniques: gradient boosting, deep learning, forecasting, ranking, recommendations.
- Posture: calibrated evals; model card on every shipped system.

## Related pages

- [MLOps](https://www.paiteq.com/services/mlops/)
- [LLM Development](https://www.paiteq.com/services/llm-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

P6 · Services

# *Machine learning* development services + custom ML development — calibrated, drift-monitored, owned by you

Custom ML development from a machine learning development company that ships gradient boosting on tabular, PyTorch on vision, hierarchical forecasting at SKU scale, and two-stage ranking. Every model graded on calibration, slice-fairness, and drift — not just headline AUC.

[Talk to engineering](/contact/) [See engagement shapes](#engage)

Practice Custom ML development

Stack XGBoost · LightGBM · PyTorch · Faiss

Eval AUC · NDCG · Brier · PSI

Engagements 3–14 weeks · fixed scope

001 / WHEN IT WINS

## Classical ML vs LLM — when each one wins.

The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate, more explainable, and roughly a hundred times cheaper. Custom ML development is still the right answer for most business-prediction problems — fraud, churn, credit, demand, ranking. The two families don't compete head-to-head; they win on different shapes of input and output.

LLM-shaped problem

Machine-learning-shaped problem

What's predicted

Free-form text, tokens, generated content

A number, a class, a rank, a probability

Training data

Trillions of internet tokens (pre-trained)

Your labelled rows, 10k to 100M of them

Eval signal

Human-graded rubric, RAGAS, LLM-as-judge

AUC, RMSE, NDCG, calibration, lift

Classical ML wins on eval rigour. **AUC, RMSE, and calibration curves are deterministic** — run the same held-out set twice and you get the same number. LLM eval relies on human graders or an LLM-as-judge, both of which introduce variance and cost. For regulated decisions (credit, fraud, medical triage) that measurability is often a compliance requirement, not just a preference.

Latency floor

200–2,000ms per call (frontier hosted)

1–50ms per inference (XGBoost on CPU)

Unit economics

$/M tokens — scales with output length

Pennies per million inferences amortised

At production volume, the gap is 100×–1,000× per call. A LightGBM model serving 10M inferences/day on a single CPU node costs roughly **$30–50/month in compute**. The equivalent call volume through a frontier-hosted LLM costs **$3,000–$15,000/month** depending on output length. Classical ML doesn't win on capability — it wins on the unit-economics of high-volume, low-complexity decisions.

Failure mode

Hallucinated facts, off-brand drift

Calibration drift, label leak, concept drift

Counter-intuitively, LLM failures are easier to catch. A hallucinated output is *visible* — a customer escalates, a reviewer flags it, a monitor catches the token. Classical ML failure is often **silent**: the fraud model's AUC holds at 0.87 while the population it was trained on has drifted six months. The model is still scoring, the dashboard is still green, and the business outcome is quietly degrading. That's why we instrument PSI and calibration checks from day one.

When it wins

Unstructured input, generation, reasoning

Tabular signal, ranking, forecasting, risk

This row is the one that gets teams in trouble. The categories aren't competing — they describe **different input shapes and output contracts**. The 2026 mistake is defaulting to an LLM because it's the obvious tool, then discovering on week six that a gradient-boosted model trained on your warehouse data is 3× more accurate and explainable on demand. We do the framing call before scoping so the wrong tool doesn't cost you a quarter.

Hybrid pattern: vision-LLM as feature extractor with a classical head — the third option, covered in §4 below.

We won't sell you a classical ML build for a generative problem, and we won't sell you an LLM engagement for a problem a gradient boosting model would solve in week three. The framing call is free.

002 / SERVICES

## Four engagement shapes, fixed-scope.

Pilot, Production Build, Ranking-or-Recommendation, or an ML Audit. Every machine learning development services engagement maps to one of the four. Mixed engagements bill as two consecutive shapes, not an open-ended retainer.

[

01 / PILOT ↗

ML Pilot

One problem, one model family, one eval set. Baseline against a non-ML rule, ship a demo in 3–5 weeks. The way most clients start a machine learning development services engagement before committing to a custom ml development build.

3–5 wks

](#engage)[

02 / BUILD ↗

Production ML Build

Full pipeline — feature store, training, serving, monitoring. The bulk of our machine learning development company revenue. Includes six weeks of post-launch iteration on calibration and drift.

8–14 wks

](#engage)[

03 / RANK-OR-RECOMMEND ↗

Ranking / Recommendation

Catalogue-scale ranking models or recommendation systems development with a candidate-generation + re-ranking architecture. Pairs with an offline NDCG harness and an online A/B test rig.

6–10 wks

](#engage)[

04 / ADVISORY ↗

ML Audit & Roadmap

Read of your current ml model development practice, the data pipeline, the eval rigour, and the deployment posture. Deliverable is a costed memo, not a model. Often the gate before a full Production Build.

2–3 wks

](#engage)

003 / PATTERNS

## Four ML problem families. We pick on data shape, latency, and explainability.

Tabular-classify, forecasting, rank-or-recommend, and deep-vision cover roughly 95% of our custom ml development work. Framework, eval signal, deployment posture, and unit economics differ per family. About 60% of engagements start tabular; a third are forecasting or ranking; the rest deep or hybrid.

   
01

### TABULAR-CLASSIFY

The biggest revenue line in classical ML in 2026 — gradient boosting on tabular features still wins most business-prediction problems. XGBoost, LightGBM, or CatBoost as the model; Platt or isotonic calibration; SHAP for the explanation surface compliance always asks for. About 35% of our custom ml development engagements.

Pick when

-   Tabular features (counts, ratios, categoricals)
-   Sub-50ms inference budget on CPU
-   Need explanation per prediction (SHAP, regulator-readable)
-   Class imbalance manageable with weighting or SMOTE
-   50k+ labelled rows already in your warehouse

Skip when

-   Unstructured input (text, image, audio) — neural nets or LLMs win
-   Very low data regimes under 5k rows — classical statistics may beat ML
-   Streaming features with sub-1ms latency — feature-engineering overhead kills it

Stack

XGBoostLightGBMCatBoostSHAP

02

### FORECASTING

Forecasting services is a quietly large category — demand planning, supply chain, financial close, capacity. 2026 stack is mixed: Prophet for the explainable baseline, LightGBM-on-lags for the production workhorse, TimeGPT or Chronos for the foundation-model option when you have enough series. We default to LightGBM-on-lags — beats Prophet on accuracy, beats TimeGPT on cost.

Pick when

-   Time-series ml workloads with seasonality and lag structure
-   Need explanation of the forecast — finance and ops teams won't sign off on a black box
-   Mixed-frequency or hierarchical (per-SKU per-region) — gradient boosting handles natively
-   Budget allows offline training, online serving cheap

Skip when

-   Single short series under 200 observations — classical ARIMA or exponential smoothing wins
-   Pure anomaly detection — different family of models
-   Cold-start product where there's no history — forecasting is the wrong frame

Stack

LightGBMProphetTimeGPTChronos

03

### RANK & RECOMMEND

Ranking and recommendation systems development share a two-stage architecture — fast candidate generation (embeddings + ANN or co-occurrence pulls 200–2,000 candidates), then a heavyweight cross-encoder or gradient-boosted re-ranker scores the shortlist. Eval signal: NDCG@k offline plus a CTR or conversion A/B test online. We ship this for e-commerce catalogues, content feeds, and internal-search systems.

Pick when

-   Catalogue size over ~10k items where exhaustive scoring is too slow
-   Ranking-quality matters — NDCG, MAP, MRR are tracked
-   You have both interaction logs and item metadata
-   Re-ranking budget allows 10–100ms cross-encoder per query

Skip when

-   Tiny catalogues under 200 items — a rule-based ranker is fine
-   Pure cold-start with no logs — content-based heuristics first, ML later
-   You can't run an online A/B test — offline NDCG alone is a noisy signal

Stack

FaissScaNNLightGBMXGBoost ranker

04

### DEEP / VISION

The narrowest slice — production-grade vision or signal models trained on customer data. Defect detection on a factory line, OCR on a custom document type, sensor classification on industrial telemetry. PyTorch as the framework, timm for vision backbones, an open-weight base (ConvNeXt, EVA, or a vision-LLM as feature extractor) fine-tuned on labelled customer data. Smaller than tabular for us, but engagements run longer and the IP compounds more.

Pick when

-   Image, signal, or sensor input where pixels/samples carry the signal
-   5k–500k labelled examples on the target task
-   Latency budget allows 50–500ms on GPU or 10–50ms on a quantised CPU runtime
-   Use case has a clear failure-cost — defect, fraud, safety — that justifies the build

Skip when

-   OCR on standard document types — hosted Vision-LLMs (GPT-5 Vision, Claude) are cheaper and beat you on quality
-   Cold-start without any labelled data — synthetic data and active learning first, model second
-   General-purpose recognition — pretrained zero-shot is good enough

Stack

PyTorchtimmONNX RuntimeTensorRT

004 / MODELS

## Six model families. We pick per data shape and per latency budget.

No house model. We benchmark gradient boosting, neural nets, statistical baselines, and the vision-LLM-as-encoder pattern against the same held-out set, and ship the model that beats baseline by the agreed margin without breaking calibration. Roughly six in ten predictive analytics services we ship end on LightGBM or XGBoost.

Gradient Boosting (XGBoost · LightGBM · CatBoost)

Strengths

The default winner on tabular data in 2026 — still beats deep learning on most business-prediction tasks. LightGBM for speed at scale, XGBoost for the broader SHAP/ONNX ecosystem, CatBoost for messy categoricals without one-hot encoding. Sub-millisecond inference on CPU. Platt and isotonic calibration mature.

When We Pick

About 60% of predictive analytics services we ship lead with LightGBM or XGBoost. Default for churn, fraud, credit risk, conversion, hierarchical demand forecast, ranking re-rankers. Anywhere SHAP explainability is a compliance requirement.

When We Don't

Unstructured input. Sub-millisecond streaming features where loading the model is the bottleneck. Tasks under 2,000 rows where a logistic regression with strong features is more honest.

Paiteq Pattern

We hand off the booster file, an inference shim, a SHAP explainer, and a calibration sidecar — small enough to deploy on the existing service stack, no GPU required.

TabularSHAP-readyCalibrated

PyTorch + Lightning

Strengths

The 2026 default for neural-net work that isn't an LLM. PyTorch 2.x core, Lightning for training-loop scaffolding, timm for vision backbones. Inference via ONNX Runtime or TensorRT on GPU; quantised CPU via Intel Neural Compressor or AWS Neuron for edge.

When We Pick

Image, audio, signal, or long-sequence input the LLM doesn't handle economically. Custom embeddings for a recommender. Fine-tuning a vision backbone on customer data. Custom encoders for a downstream classical head.

When We Don't

Tabular workloads — gradient boosting beats you on accuracy and cost. Pure LLM workloads — that's <a href="/services/llm-development/">our LLM development services</a> sibling.

Paiteq Pattern

We've shipped PyTorch-based defect detection on a factory line, an OCR encoder for a custom document type, and the content-tower of a two-tower recommender. ONNX Runtime for inference — easier to deploy than a Python service.

VisionCustom netONNX-export

scikit-learn + statsmodels

Strengths

The honest baseline that beats half the deep-learning press releases. Linear and logistic regression with proper feature engineering, regularised regression (ridge, lasso, elastic). statsmodels for ARIMA, exponential smoothing, Holt-Winters. Cheap, explainable, calibration usually better out of the box than tree ensembles.

When We Pick

Datasets under 5,000 rows, constrained deployment (mobile, on-prem no-GPU), or regulators that care about explanation. Always as the baseline against which we compare LightGBM and PyTorch.

When We Don't

Large datasets where regularisation can't capture non-linear structure. Tasks where feature engineering takes longer than fitting LightGBM.

Paiteq Pattern

Every engagement starts with a scikit-learn baseline. About one in five ends up shipping the baseline as the production model — usually risk scoring where the regulator wins.

BaselineLinearStatsmodels

Time-series — Prophet · LightGBM-on-lags · TimeGPT

Strengths

Three families cover most forecasting services work. Prophet is the explainable baseline with built-in seasonality. LightGBM-on-lags (engineered lag features into gradient boosting) is the workhorse — beats Prophet on accuracy at most scales. Nixtla's TimeGPT and Amazon Chronos are the foundation-model option when you have hundreds of series.

When We Pick

Demand forecasting at SKU scale, financial close, capacity planning, claims volume. LightGBM-on-lags 60% of the time; Prophet 25% for explainability; TimeGPT or Chronos 15% when series count justifies the cost.

When We Don't

Single short series under 200 observations — ARIMA wins. Pure anomaly detection — different family. High-frequency tick data — specialist stack we route out.

Paiteq Pattern

We shipped a hierarchical demand-forecast across 12,000 SKU-region pairs on LightGBM-on-lags that cut a retail client's holding-cost overrun by 18% in a quarter. Foundation-model forecasting was tested and lost on cost.

ForecastingSeasonalHierarchical

Embeddings + ANN (Faiss · ScaNN · pgvector)

Strengths

The candidate-generation half of modern ranking and recommendation systems development. Embeddings from a custom two-tower, a pre-trained sentence encoder, or a vision encoder. ANN indexes (Faiss self-hosted, ScaNN on Google, pgvector for Postgres-native) pull 200–2,000 candidates under 10ms.

When We Pick

Catalogue-scale ranking and recommender systems. Semantic search where keyword search misses. User-similarity lookups. Anywhere brute-force scoring blows the latency budget.

When We Don't

Catalogues under a thousand items — exhaustive scoring is faster than building an ANN index. RAG retrieval — see <a href="/services/rag-development/">our RAG development services</a>.

Paiteq Pattern

Faiss self-hosted; pgvector when data already lives in Postgres and ops capacity is thin; ScaNN on Google Cloud. Two-tower trained in PyTorch, exported to ONNX, served behind the ANN index.

Two-towerANNCandidate-gen

Vision-LLM-as-encoder (GPT-5 · Claude · Gemini)

Strengths

A 2026 pattern that didn't exist three years ago — use a frontier vision-LLM as feature extractor on image or document input, then put a classical model on the embeddings. GPT-5 Vision and Claude Sonnet 4.6 both expose embedding endpoints; Gemini 3.0 Pro has the cleanest multimodal one. Cuts the labelled-data requirement by an order of magnitude.

When We Pick

Image classification with 500–5,000 labels instead of 50,000. OCR-grade document classification where structure carries the signal. Sentiment plus extraction on screenshots, charts, scanned forms.

When We Don't

Latency-tight on-device workloads — the vision-LLM call kills the budget. Regulated workloads where data can't leave the perimeter — fine-tune a self-hosted vision backbone instead.

Paiteq Pattern

We pair this with our <a href="/services/generative-ai/">generative AI practice</a> regularly — vision-LLM upstream as encoder, classical model downstream as predictor. Halves the labelling spend on most image-classification builds.

MultimodalFew-shotVision-LLM

005 / PIPELINE

## The ml model development pipeline we ship — four phases, eval-gated.

The model is roughly a fifth of a real ml model development engagement. The other four-fifths is the data audit, feature pipeline, calibration layer, and drift instrumentation. Skipping any is how machine learning solutions silently fail at month six — the model that beat baseline on day one is now mis-calibrated on a shifted population.

1.  01
    
    ### Data audit + leakage map
    
    Label quality, sample-selection bias, leakage paths, feature availability at inference time. Roughly half the engagements we audit have a leak we close before training — usually a feature only computable post-event. We document every join and time-of-availability per feature; the production system can only train on features it can compute at decision time.
    
2.  02
    
    ### Baseline + feature engineering
    
    scikit-learn baseline always — logistic or linear with regularisation, fit on a starter feature set. The candidate has to beat the baseline by the agreed margin or we don't ship it. Feature engineering layered after: aggregations, lags, target encodings, learned embeddings. The pipeline ships with the model as a single deliverable.
    
3.  03
    
    ### Bench + calibration
    
    LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model bench-raced on the frozen held-out set. Hyperparameters via Optuna or a structured grid — not random; the search log is kept. Calibration fitted as a Platt or isotonic sidecar. The model that ships isn't the one with the highest raw AUC; it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.
    
4.  04
    
    ### Deploy + drift instrumentation
    
    FastAPI or BentoML serving on the runtime appropriate to latency budget; ONNX or pickle for the artefact. Production Stability Index per feature on a 30-day rolling window; alerting at 0.2, pause-decisioning at 0.5. Weekly calibration check on production data. Runbook names trigger conditions, on-call rota, and retrain procedure in writing — drift detection is not a mental checklist on a single engineer's laptop.
    

Retrain cadence ships in the runbook — quarterly default, monthly when drift is consistent. The client's team runs the retrain script after handoff; we retainer the operation only when asked.

006 / EVAL

## Four gates. Every model. Every week.

Custom machine learning models without these four gates drift silently. Headline AUC stays clean while the business outcome degrades. Every model we hand off carries the gates as a contract — trigger conditions that fire a retrain included.

1.  01 Held-out AUC / RMSE / NDCG
    
    Beats baseline by ≥3pts
    
    Frozen test set carved out at engagement start. AUC for binary classification, RMSE or MAPE for regression and forecasting, NDCG@k for ranking. Baseline is whichever non-ML rule the client already runs. If the candidate can't beat baseline by three points on the headline metric, we don't ship. Roughly one engagement in eight ends up shipping the baseline because the ML model couldn't clear the bar.
    
    If the candidate doesn't beat baseline, we don't paper over it — we re-frame, harvest more labels, or close at the Pilot gate and bill against the audit memo. Confident-but-wrong ML is worse than no ML.
    
2.  02 Calibration error
    
    Brier < 0.18, ECE < 4%
    
    Calibration matters more than raw AUC in most business-prediction problems — a churn score that's well-ordered but mis-calibrated breaks every downstream business rule. We measure Brier plus ECE on the held-out set, fit a Platt or isotonic sidecar where needed, and re-measure. Re-checked weekly on production data.
    
    If calibration drifts above the threshold for two weeks, we re-fit the sidecar before re-fitting the model. Most drift comes from population shift, not model decay — the calibration layer is the right place to absorb it.
    
3.  03 Concept-drift detection
    
    PSI < 0.2 per feature
    
    Production Stability Index per feature on a rolling 30-day window. Above 0.2 the feature has drifted enough to inspect; above 0.5 the model is outside the training envelope. Paired with output-distribution monitoring. Tracked in Evidently or a Postgres dashboard, depending on the client's MLOps capacity.
    
    Any feature breaching 0.5 PSI fires a Slack alert and pauses automated decisioning if the use case warrants it (fraud, credit, healthcare). Retraining usually quarterly, monthly when drift is consistent. Trigger conditions documented in the runbook.
    
4.  04 Fairness · slice metrics
    
    ≤5pt AUC gap across slices
    
    AUC and calibration measured separately on the slices that matter — region, segment, protected class where legally relevant. Headline AUC can look fine while the worst slice is unusable. We surface the gap during model selection and discuss the tradeoff openly — sometimes the cheaper, fairer model wins.
    
    Above-threshold gaps trigger a re-weighting pass or a slice-specific model. We don't ship where the fairness story is uncomfortable; if the data won't support a fair model, that's a finding we surface in writing.
    

007 / CAPABILITIES

## Six capability families across six industries — where we've shipped.

A capability-by-industry heatgrid. Strength reflects what we've taken to production, not what we've explored. The light cells are honest — we won't claim depth we haven't built.

Function Industry

B2B SaaS

Fintech

E-commerce

Healthcare

Manufacturing

Logistics

Risk · Fraud · Credit

Churn · LTV · Conversion

Demand · Supply Forecast

Ranking · Search

Recommendation

Vision · Sensor · OCR

Risk · Fraud · Credit

B2B SaaSFintechE-commerceHealthcareLogistics Manufacturing

Churn · LTV · Conversion

B2B SaaSFintechE-commerceLogistics HealthcareManufacturing

Demand · Supply Forecast

B2B SaaSFintechE-commerceHealthcareManufacturingLogistics

Ranking · Search

B2B SaaSE-commerceHealthcare FintechManufacturingLogistics

Recommendation

B2B SaaSE-commerce FintechHealthcareManufacturingLogistics

Vision · Sensor · OCR

FintechE-commerceHealthcareManufacturingLogistics B2B SaaS

Possible fit Good fit Primary vertical

Dark cells: shipped at production scale. Medium: shipped in pilot. Light: experimented but not yet production. Empty cells are real.

008 / WHEN

## When ML is the answer — and when it isn't.

The most expensive failure mode here is shipping ML where a 20-line rule would have done the job. The second-most is the inverse — running a hand-tuned heuristic two years past the point a calibrated gradient boosting model would have doubled the outcome. The list below is the screen we run on every inbound.

-   01
    
    ### You have labels and a baseline
    
    Labelled outcome rows (5k+ for tabular, 500+ for vision-LLM-encoder, 50k+ for deep vision from scratch) and a non-ML rule already running. ML's value is whatever it adds on top of the baseline — without one, the lift number is just a vibes-check.
    
-   02
    
    ### The decision is repetitive and high-volume
    
    Pricing one row at a time at hundreds of millions per day. Ranking a catalogue at every page view. Scoring a payment in 40ms. ML amortises the engineering cost across the volume — a single decision a quarter doesn't justify the build.
    
-   03
    
    ### The cost of wrong is measurable
    
    Fraud losses, missed revenue, holding-cost overrun, false positives in a regulated process. ML works when you can put a number on the failure mode — that's how we tune the threshold and prioritise the slices.
    
-   04
    
    ### You can run an A/B test
    
    Offline NDCG or AUC is a noisy signal without an online check. If your environment doesn't support a controlled rollout, the engagement gets riskier — sometimes the right answer is to fix the experimentation pipeline first.
    
-   05
    
    ### It's a generative or reasoning problem — wrong family
    
    Free-form text out, multi-step reasoning, document understanding with no clean label per row. That's LLM territory, not classical ML. We'll route you to our [LLM development services](/services/llm-development/) sibling and run that engagement instead.
    
-   06
    
    ### The data isn't there yet — wrong stage
    
    Cold-start product, no logs, no labels, no baseline. ML is the wrong investment; the right one is a heuristic plus a measurement plan that builds the dataset for an ML build six months out. We'll say so in the audit memo.
    

If the screen lands clean on four of six, custom machine learning models are usually the right shape. Two or fewer, we'll often recommend something else — sometimes our [AI consulting](/services/ai-consulting/) shape, sometimes a heuristic-plus-measurement plan, sometimes nothing.

009 / PROCESS

## Eval-first, baseline-anchored, eight weeks to a calibrated model.

Metric and baseline land in week one — locked before training begins. Every model we ship is graded against the same frozen held-out set; nothing slides because the team got attached to a result. That's the difference between machine learning solutions that talk about quality and ones that measure it.

WEEK 1

### Problem framing

Predict what, for whom, against which baseline. The eval metric is locked in week one — AUC, NDCG, RMSE, calibration band. The baseline is the non-ML rule already running.

WEEK 1–2

### Data audit

Label quality, leakage paths, sample-selection bias, class balance, feature availability at inference time. Half the engagements we audit have a leak we have to close before training begins.

WEEK 2–4

### Baseline + features

scikit-learn baseline always. Then engineered features — aggregations, lags, target encodings, embeddings. The feature pipeline is part of the deliverable; nothing trains on features the production system can't compute.

WEEK 4–6

### Training + eval

Candidate models bench-raced against the baseline on the frozen held-out set. LightGBM, XGBoost, scikit-learn baseline, and where relevant a PyTorch deep model. Hyperparameters via Optuna or a structured grid.

WEEK 6–8

### Calibration + slices

Brier, ECE, slice-AUC, calibration sidecar. Fairness review across the slices that matter. The model that ships isn't the one with the highest raw AUC — it's the one that's well-ordered, well-calibrated, and fair across the slices the client cares about.

WEEK 8+

### Deploy + monitor

Serving layer (FastAPI, BentoML, or the client's existing pattern), ONNX or pickle artifact, feature-pipeline runbook, drift monitoring on PSI, weekly calibration check. Handoff to the client's ml model development team or our MLOps sibling.

010 / TIMELINE

## What the eight-week Production ML Build looks like.

The standard custom ml development build — a defined slice ships in eight weeks. Ranking-or-Recommendation adds 2 weeks for the two-stage A/B harness; the Pilot is a tighter 3–5 week cut.

6 phases

WEEK 1 Problem framing

Locked metric, locked baseline, eval-set plan, data-access list

Metric + baseline sign-off

WEEK 2 Data audit

Leakage map, label-quality report, feature-availability matrix

Audit findings reviewed

WEEK 3–4 Baseline + features

scikit-learn baseline; engineered feature pipeline v1

Baseline > non-ML rule

WEEK 4–6 Model bench

LightGBM / PyTorch / linear bench; held-out scores

Best model ≥ baseline + 3pts

WEEK 6–8 Calibration + fairness

Platt or isotonic sidecar; slice-AUC report

ECE < 4%, slice gap ≤ 5pts

WEEK 8+ Deploy + monitor

Serving layer, drift dashboard, runbook, retrain cadence

First 30d of clean traces

011 / STACK

## Frameworks we've shipped on.

Pinned to what we have in production in 2026. The actual integrations under support — not a marketing list.

-   XGBoost
-   LightGBM
-   CatBoost
-   scikit-learn
-   PyTorch
-   Lightning
-   timm
-   ONNX Runtime
-   Faiss
-   ScaNN
-   pgvector
-   Prophet
-   TimeGPT
-   Evidently
-   MLflow
-   Weights &amp; Biases
-   XGBoost
-   LightGBM
-   CatBoost
-   scikit-learn
-   PyTorch
-   Lightning
-   timm
-   ONNX Runtime
-   Faiss
-   ScaNN
-   pgvector
-   Prophet
-   TimeGPT
-   Evidently
-   MLflow
-   Weights &amp; Biases

012 / USE CASES

## Where teams have shipped.

Three anonymized engagements. Function, segment, and outcome metric are real; brand removed under NDA.

E-commerce

DTC retail · catalogue-scale

### Hierarchical demand forecast across thousands of SKU-region pairs

Typical shape: replace a Prophet-per-SKU pipeline with one LightGBM-on-lags model carrying hierarchical features (category, region, promo cadence). Training cost compresses materially; headline categories typically gain meaningful MAPE points against the prior baseline. Re-trained weekly on Modal. Pairs with the client's ERP for the planning loop.

Deliverable: hierarchical model + weekly retrain + planner-facing API

Fintech

Regulated lending · EU

### Calibrated credit risk model with SHAP-led explanation

Typical shape: gradient boosting on application + bureau features, isotonic calibration sidecar, SHAP-based per-application reason codes. Slice-AUC reviewed across regional and demographic cuts before sign-off with the regulator-facing risk lead. Replaces drifted logistic scorecards. Live with quarterly retrain.

Deliverable: calibrated model + reason-code service + slice-fairness register

Logistics

Last-mile · enterprise routing

### Cross-encoder ranking model on routing recommendations

Typical shape: two-stage — Faiss candidate-gen on driver-route embeddings, LightGBM ranker as cross-encoder. NDCG@10 measured offline on a frozen log; CTR measured online in an A/B test. Replaces hand-tuned heuristics that have been the dispatcher-productivity bottleneck for years.

Deliverable: ranker + retrieval index + offline + online eval harness

013 / WHY PAITEQ

## Why teams pick us as their machine learning development services partner.

-   01
    
    ### Baseline-anchored or we don't ship
    
    Every candidate model fights a scikit-learn baseline on the frozen held-out set. If it doesn't beat by the agreed margin, the engagement ends at the audit memo, not at a ship-it-anyway compromise. About one engagement in eight closes here, and the client gets a heuristic-plus-measurement plan instead.
    
-   02
    
    ### Calibrated, not just AUC-clean
    
    Headline AUC without calibration is a setup for downstream failures. We measure Brier, expected calibration error, and slice-AUC before the deploy — and instrument drift on every feature in production. The model that ships is the one that holds up at month six, not the one that wins on day one.
    
-   03
    
    ### You own the artefact
    
    The booster file, the ONNX export, the feature pipeline, the calibration sidecar, the eval harness, the runbook — yours, in your repo, under your license. We don't retain joint IP on customer-trained models. Retainer the operation if you want; the model is owned by you regardless.
    

014 / ENGAGE

## Four ways to start.

01 ML Pilot Fixed scope

3–5 weeks

### One problem, one model, one eval set.

In scope

-   Problem framing + locked metric
-   scikit-learn baseline
-   Candidate model bench against baseline
-   Held-out eval report + costed go-no-go memo

Out of scope

-   Production deployment
-   Drift instrumentation
-   Ongoing retraining (separate Build)

02 Production ML Build Fixed scope

8–14 weeks

### Calibrated, drift-monitored, owned by you.

In scope

-   All Pilot deliverables
-   Feature pipeline in your repo
-   Calibration sidecar
-   Drift instrumentation + runbook
-   Six weeks of post-launch iteration

03 Ranking / Recommendation Fixed scope

6–10 weeks

### Two-stage candidate-gen + re-rank.

In scope

-   Candidate-generation index (Faiss / ScaNN / pgvector)
-   Cross-encoder re-ranker training
-   Offline NDCG harness
-   Online A/B test rig wired

04 ML Audit & Roadmap Fixed scope

2–3 weeks

### Read of practice + costed roadmap.

In scope

-   Data audit, eval-rigour audit, deployment-posture read
-   Leakage and drift findings in writing
-   Costed roadmap memo for the next 6–12 months

015 / FAQ

## What buyers ask before signing.

When do we pick classical ML over an LLM?

The short answer — whenever the inputs are tabular rows and the output is a number, a class, or a ranked list. Gradient boosting on engineered features still beats every LLM-shaped solution we've benchmarked for those problems at roughly a thousandth of the inference cost. The 2026 mistake we see most often is teams reaching for an LLM where a 50-millisecond LightGBM model would be more accurate and roughly 100× cheaper. LLMs win when input is unstructured and output is generative or reasoning-shaped. The hybrid pattern — vision-LLM as feature extractor, classical model as predictor — is the third option, and it's where this practice meets [our LLM development services](/services/llm-development/).

How much labelled data do we actually need for a custom ML build?

Depends on the problem family and the model. For tabular gradient boosting on a business-prediction problem, 5,000–20,000 labelled rows is usually enough to beat a heuristic by a defensible margin; 100,000+ is where the model gets sharp. Deep vision from scratch typically needs 10,000–500,000 examples. The vision-LLM-as-encoder pattern collapses that to 500–5,000 — the single biggest unlock in custom ml development since transformers landed. If you don't have labels and can't get them cheaply, we'll say so during the audit. Sometimes the answer is active learning, sometimes synthetic data, sometimes a heuristic-plus-measurement plan that gets you to ML in six months.

Who owns the model after delivery — Paiteq or the client?

The client. Every machine learning development services engagement ends with the model artefact (booster file, ONNX export, PyTorch checkpoint), the training code, the feature pipeline, the eval harness, and a runbook in your repo under your license. We don't retain joint IP on customer-trained models — the trained weights, the engineered features, the calibration sidecar are yours. What we retain is methodology: build templates, eval rubrics, audit playbooks. Operate the model on a retainer if you want; the artefact stays yours regardless.

How do you handle drift, calibration, and retraining cadence?

Three layers. Production Stability Index per feature on a rolling 30-day window — 0.2 we inspect, 0.5 pauses automated decisioning if the use case requires it. Calibration check on a production sample every week — Brier and expected calibration error against the band agreed at launch. Scheduled retraining cadence, usually quarterly, pulled to monthly when drift is consistent. The runbook names the trigger conditions in writing; alerts route to a named on-call rota; the retrain procedure is a script the client's team can run without us. Drift detection is engineered, not a mental checklist.

How is this different from your MLOps service and your LLM service?

Three siblings, three different jobs. Machine learning development services here is about *building the model* — framing, data audit, features, training, eval, calibration. [MLOps services](/services/mlops/) is the infrastructure around the model after ship — serving, monitoring, retraining, feature stores. [LLM development services](/services/llm-development/) is the equivalent practice for large language models — different family, different eval, different unit economics. An engagement can span more than one; we scope each shape separately so the deliverables are unambiguous.

016 / Related practices

## Adjacent services.

[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

GENERATIVE AI

Generative AI

GenAI products end-to-end — text, image, multimodal, OpenAI/Claude/Gemini.

](/services/generative-ai/)[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)

017 / Start a project

## Ship a *calibrated* model in 8 weeks.

Pilot in 3–5. Production Build in 8–14. Ranking / Recommendation in 6–10. ML Audit in 2–3.

[Talk to engineering](/contact/) [Architecture review](/contact/?topic=arch-review)


---

## SECTION: 4.11. Service: mlops

_Source: https://www.paiteq.com/services/mlops/_

# MLOps Consulting, MLOps Services — Paiteq

> MLOps consulting and MLOps services for ML + LLM teams. Feature stores, drift, CT pipelines, LLMOps. MLOps consulting services from audit to runbook.

**HTML version:** https://www.paiteq.com/services/mlops/

## Key facts

- Scope: feature stores, drift monitoring, CT pipelines, LLMOps.
- Engagement: audit → runbook → ongoing.

## Related pages

- [Machine Learning Development](https://www.paiteq.com/services/machine-learning-development/)
- [AI Consulting](https://www.paiteq.com/services/ai-consulting/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

P12 · Services

# *MLOps consulting* for teams shipping classical ML and LLMs to production.

An mlops consultant practice that builds feature stores, serving infrastructure, continuous training pipelines, drift detection, and LLMOps observability. MLOps services for the gap between a model that scores well in evaluation and one that holds up at month six.

[Talk to engineering](/contact/) [See engagement shapes](#engage)

Practice MLOps · LLMOps · Model ops

Stack Kubeflow · Vertex · Feast · Evidently

Eval PSI · KL · Wasserstein · LLM judge

Engage Audit · Build · Operate

001 / SCOPE

## What MLOps consulting actually covers — six service areas.

MLOps consulting is the ops practice around models the team has already built. We don't write the model — that's the [machine learning development](/services/machine-learning-development/) sibling. We make the model survive production: feature stores, serving infra, continuous training, drift detection, registry, and LLMOps when the workload is generative. The first call is usually about which two of the six gaps below hurt most.

[

01 / SERVE ↗

Model serving infrastructure

vLLM, Triton, BentoML on Kubernetes with KEDA autoscaling. p95 latency under a stated budget, not a vibes-check. The mlops consulting work here is picking the stack that matches the model family — vLLM for LLMs, Triton for mixed, FastAPI for boosters.

vLLMTritonKEDA

](#services)[

02 / FEATURES ↗

Feature store implementation

Feast for the 90% case, Tecton when freshness contracts get exotic, Hopsworks when point-in-time correctness has to be baked in. Online/offline alignment is where most feature store implementation engagements live — training and serving features have to come from the same source of truth or the model degrades in week three.

FeastTectonPIT

](#services)[

03 / REGISTRY ↗

Model registry + lineage

MLflow for the registry, DVC for lineage, signed model cards in Git. The rollback contract is one command, not a 40-minute incident postmortem. We've watched too many teams ship a model with no clean way back — the registry isn't optional, it's the seatbelt.

MLflowDVCModel cards

](#services)[

04 / CT ↗

Continuous training pipelines

Kubeflow Pipelines for K8s-native teams, Vertex AI Pipelines on GCP, SageMaker Pipelines on AWS. CI/CT/CD4ML — code, data, models all version-pinned. Triggered retraining on drift breach, not nightly cron jobs that retrain on noise.

KubeflowVertexCD4ML

](#patterns)[

05 / LLMOPS ↗

LLMOps & observability

Langfuse or Helicone for token-cost telemetry, prompt versioning, completion logging. Phoenix Arize or LLM-as-judge for hallucination scoring. Most teams don't realise their LLM bill is half avoidable cache misses until they instrument it — month one usually pays for the rest of the engagement.

LangfuseHeliconePhoenix

](#llmops)[

06 / DRIFT ↗

Drift detection + alerting

Evidently AI on input distributions, prediction distributions, and ground-truth concept drift when labels arrive. Alerts route to PagerDuty with a named owner — not a Slack channel that gets muted on day four. Drift detection is engineered in, not bolted on after a quarterly review notices the model is six points down.

EvidentlyPSIPagerDuty

](#monitoring)

Most mlops services engagements start with two of these six and grow from there. The audit phase in section seven picks the order.

002 / STACK

## MLOps services — what we build and operate, tool by tool.

Per-service tool picks with the trigger conditions written down. The mlops services we ship are picked from this matrix per engagement — never the whole list at once, and never "all OSS" or "all managed" as a blanket call. Each row carries when we pick it, when we don't, and the Paiteq default.

Model serving stack

Strengths

vLLM for any LLM above ~50 req/s — PagedAttention plus continuous batching halves GPU spend versus naive HuggingFace TGI. Triton when one cluster serves vision, tabular, and LLM together. BentoML wraps both for Python-first deploy.

When We Pick

Latency budget with a stated p95; mixed model families on shared GPUs; team can run a serving stack 24/7.

When We Don't

Single low-volume model where FastAPI behind a load balancer is the right shape and Triton's operational tax doesn't pay back.

Paiteq Pattern

vLLM on Kubernetes with KEDA scaling on queue depth. Most mlops services engagements start by replacing a fragile Flask wrapper with this stack.

vLLMTritonBentoMLKEDA

Feature store layer

Strengths

Feast handles the 90% case at 10% of Tecton's operational cost — Feast plus Redis online plus BigQuery or Snowflake offline is the most common shape we ship. Tecton is overkill for teams under 50 features in production. Hopsworks when point-in-time correctness has to be audited.

When We Pick

Two or more models share features; training-serving skew has bitten the team at least once; data scientists keep rebuilding the same features in Jupyter.

When We Don't

Single model with five features in one SQL view — the store adds operational surface that doesn't pay for itself yet.

Paiteq Pattern

Feast first, Tecton only when feature freshness goes below 60 seconds. Most feature store implementation engagements land at Feast plus Redis plus a thin schema-registry.

FeastTectonHopsworksRedis

Registry + lineage

Strengths

MLflow is the boring right answer — every cloud has a managed flavour, the OSS runs on a single VM, registry plus experiment tracker plus artifact store from one library. DVC adds data lineage: artefact knows its dataset version, dataset knows its raw extract.

When We Pick

More than two engineers on the team; any production model where rollback matters; any regulated workload where lineage has to be auditable.

When We Don't

Single-engineer shop where a Git tag plus a Postgres row is the registry — sometimes that's enough for the first six months.

Paiteq Pattern

MLflow plus DVC plus signed model cards in Git. Every artefact carries eval scores, training data hash, and the engineer who promoted it.

MLflowDVCModel cards

CT pipeline orchestrator

Strengths

Kubeflow Pipelines for K8s-native teams that want vendor portability. Vertex AI Pipelines on GCP and SageMaker Pipelines on AWS when skipping cluster ops dwarfs the lock-in. Airflow when the data team already runs it and ML-versus-data orchestration is blurry.

When We Pick

Kubeflow when ML platform is its own product surface; Vertex or SageMaker when ML is one workload among many and platform engineers are scarce.

When We Don't

Nightly batch retraining on a single tabular model — a cron job plus MLflow runs is honest and ships in a week.

Paiteq Pattern

Vertex AI or SageMaker Pipelines for the first ML platform; Kubeflow when the team grows past three engineers and the lock-in starts hurting.

KubeflowVertex AI PipelinesSageMaker

LLMOps observability

Strengths

Langfuse covers tracing, prompt versioning, evaluator runs, and dataset management in one OSS surface — the closest thing to a default. Helicone is cleaner when the team just wants a proxy in front of OpenAI or Anthropic with cost telemetry. Phoenix Arize for hallucination scoring and embedding drift.

When We Pick

Any LLM in production; any team with a per-token bill growing month-on-month; any product where prompt regression has caused an incident.

When We Don't

Single internal chat assistant under a hundred requests a day — instrumentation overhead outpaces the spend.

Paiteq Pattern

Langfuse self-hosted alongside the app, Helicone proxy in front of the provider for failover and per-tenant caps. The stack pays for itself before week four in most engagements we've audited.

LangfuseHeliconePhoenix Arize

Drift + monitoring

Strengths

Evidently AI is the OSS workhorse — PSI, KL divergence, Wasserstein on inputs, plus prediction drift and ground-truth concept drift when labels arrive. NannyML when the team needs estimated performance without labels. WhyLabs or Arize as managed alternatives.

When We Pick

Any model in production longer than 60 days; any model whose inputs shift faster than retraining cadence; any regulated workload where missed degradation is a compliance event.

When We Don't

Static-input batch model retrained nightly on a sliding window — the retrain pace effectively monitors itself.

Paiteq Pattern

Evidently AI wired to PagerDuty via a thin Python alert router. PSI 0.15 inspect, 0.25 retrain trigger — adjusted per-feature after the first month of baseline.

EvidentlyNannyMLPSI

003 / PATTERNS

## ML platform engineering — three continuous training patterns.

CD4ML — continuous delivery for machine learning — comes in three flavours in 2026: scheduled batch, drift-triggered, and streaming. Roughly half the ml platform engineering engagements we audit need to move from one tier to the next, not jump straight to the most expensive shape. The pattern carousel below names what each one wins and where it breaks.

  
01

### Scheduled batch CT

The simplest CT pattern. A Kubeflow or Vertex AI Pipeline fires on a schedule, pulls the last N days of labelled data, validates with Great Expectations, retrains, evaluates against a frozen held-out set, and promotes if eval passes. DVC pins the dataset version. MLflow logs lineage on every run. The right starting shape for any team without drift instrumentation in place — schedule first, drift-triggered later. Most ml platform engineering engagements start here in week three.

Pick when

-   Stable-input workload where drift creeps slowly
-   ground-truth labels arrive on a known cadence
-   team is new to MLOps and needs a working CT loop before they instrument drift
-   tabular boosters or recommendation models on a daily feedback loop

Skip when

-   Fast-shifting input distribution (fraud, ads bidding)
-   LLM workloads where prompt drift and eval drift outpace any retraining schedule
-   environments where retraining cost is a meaningful slice of the inference bill

Stack

Kubeflow PipelinesVertex AI PipelinesGreat ExpectationsMLflowDVC

02

### Drift-triggered CT

Evidently AI computes PSI on every input feature on a rolling window. When PSI crosses 0.25 on any feature or prediction drift crosses its threshold, the pipeline kicks off — pull fresh data, validate, train, shadow-evaluate, promote if gates pass. The cron job becomes a safety net, not the primary signal. Cheaper than nightly retraining for stable models, sharper than nightly for unstable ones. This is the shape the mlops consultant work usually moves teams to by month two.

Pick when

-   Input distribution shifts irregularly
-   retraining cost matters
-   team has the drift instrumentation in place to trust the trigger
-   model has a clean rollback path so an automated promotion can be reverted in under five minutes

Skip when

-   Brand-new model with no production baseline yet — drift threshold isn't calibrated, you'll fire false-positive retrains
-   environments without a fast rollback contract

Stack

Evidently AIKubeflow PipelinesMLflowPSIPagerDuty

03

### Streaming feature + near-real-time CT

Kafka or Pulsar streams events into Feast's online store via a stream ingestor. The serving layer reads features at request time with sub-100ms p99. The retraining pipeline polls every 15-30 minutes, retrains an incremental checkpoint, and shadow-evaluates against a sliding holdout. This is what fraud, ads bidding, and dynamic pricing models look like in 2026 — and it's the most expensive pattern to ship and operate. We don't recommend it until the team has lived through patterns 1 and 2 first.

Pick when

-   Hard freshness SLA under 15 minutes
-   revenue-sensitive workload where stale features cost real money inside the hour
-   team has SRE depth to run a streaming feature pipeline plus a continuous training loop without on-call burnout

Skip when

-   Anything that can tolerate hourly or daily features
-   teams under three ML infra engineers
-   cost-constrained environments where the streaming bill outpaces the model's economic lift

Stack

KafkaFeast (streaming)FlinkvLLMMLflow

004 / LLMOPS

## LLMOps — what changes when the model is a frontier LLM.

Classical MLOps was built around drift on tabular features and a once-a-week retraining cadence. LLMOps inverts every assumption. The drift signal is an evaluator score, not a distribution distance. The cost lever isn't retraining frequency, it's model routing and semantic cache. The promotion gate isn't a held-out metric, it's a judge-graded eval suite. We handle both in the same engagement — but the runbooks, the dashboards, and the failure modes are different practices.

Classical MLOps

LLMOps

Primary failure mode

Distribution drift on tabular features — silent precision/recall decay over weeks

Prompt-version regression, hallucination rate spikes, eval drift on judge-graded outputs — fast and noisy

Classical MLOps failures are slower-moving and more predictable — a drifting recommender degrades over days, giving the monitoring stack time to catch it. LLMOps failures can land in production within a single deployment; a bad prompt version ships bad outputs immediately with no statistical lag to hide behind.

Drift signal

PSI / KL divergence on input feature distributions; ground-truth concept drift when labels arrive

LLM-as-judge eval scores on sampled production outputs; guardrail hit rate; per-prompt regression deltas

LLM-as-judge eval is measurable same-day — no waiting on human-labelled ground truth. Classical PSI/KL signals are statistically rigorous but depend on label availability that can lag weeks; the signal is more trustworthy once it arrives, but slower to arrive.

Cost lever

Retraining frequency, GPU vs CPU serving, batch vs online inference

Model routing (Sonnet vs Opus vs Haiku), semantic cache hit rate, prompt-length budgeting, batch API for non-urgent

Observability stack

Prometheus + Grafana for serving metrics; Evidently AI for drift; MLflow for run history

Langfuse for traces and eval runs; Helicone for cost telemetry; Phoenix Arize for embedding drift

The classical stack (Prometheus + Grafana + MLflow) is battle-tested at scale with years of production hardening and broad community tooling. LLMOps tooling is maturing fast but the ecosystem is still fragmented — Langfuse, Helicone, and Phoenix serve different slices with no unified pane yet.

Retrain trigger

Drift threshold breach OR scheduled cadence; days-to-detect varies by label availability

Eval regression on the gold prompt set; same-day detection if the evaluator runs nightly

Promotion gate

Held-out eval beats champion by agreed margin; calibration check; slice-fairness check

LLM-as-judge eval suite passes; hallucination rate below threshold; guardrail hit rate stable

Classical held-out eval against a fixed test set is a more objective, reproducible gate — a numeric margin is deterministic. LLM-as-judge promotion gates introduce evaluator variance; the judge model itself can drift, so the gate requires periodic re-anchoring against human spot-checks.

Most teams in 2026 run both side by side — a recommender on classical MLOps, a chat assistant on LLMOps, both observed in one dashboard. The two stacks share registry and CI surface, diverge everywhere downstream of that.

LLMOps · field note

Most of the LLMOps work we ship in 2026 starts with a Helicone proxy in front of the upstream provider and Langfuse traces wired into the application. Inside the first month, two things usually surface: roughly a quarter to a third of the per-token bill is recoverable through semantic caching and prompt-length budgeting, and prompt-version regressions are reaching production silently because nobody runs an evaluator on the gold set on a schedule. The instrumentation pays for itself before the platform build closes.

— Paiteq engineering

The LLMOps section is the differentiation gap most mlops consulting providers leave open — classical MLOps content is everywhere, LLMOps-specific runbooks are not. If your stack is LLM-heavy, this is the conversation worth starting with.

005 / CD4ML

## Continuous training pipelines — four eval-gated phases.

CD4ML is a named practice, not a vibe. Every retrain that reaches production passes four gates in order — data validation, retrain trigger logged with cause, shadow eval against the champion, automated promotion with a blue/green rollback contract. Skip a gate and you're back to manual retraining with a postmortem at the end. We won't ship a pipeline that doesn't carry all four.

GATE 01

### Data validation

Every batch entering the training pipeline runs through Great Expectations or Soda Core checks — schema, null fractions, range, distribution sanity. Validation failures halt the pipeline before a single GPU minute burns. Dataset version pinned via DVC; the model artefact will know exactly which slice trained it.

GATE 02

### Retraining trigger

Either the drift detector (Evidently AI PSI breach over a per-feature threshold) or the schedule (whichever fires first) kicks the Kubeflow or Vertex AI Pipeline. Trigger condition logged with the run; you can read why any retrain happened six months later in MLflow.

GATE 03

### Shadow evaluation

The candidate model runs alongside the champion against the frozen held-out set and a sampled production stream. Eval gates: held-out metric beats champion by the agreed margin, calibration error stable, slice-fairness across the protected dimensions doesn't regress. No gate, no promotion.

GATE 04

### Automated promotion

Blue/green traffic split — 10%, 50%, 100% — with a rollback gate at each step keyed to production metric guardrails. If the live precision-recall slips during the 10% phase, traffic reverts to the champion in under five minutes. The runbook names who gets paged and what they do.

Gate four is the one teams under-invest in. The promotion contract has to roll back inside the SLA — we test that monthly on a calendar invite, not as a tabletop exercise.

006 / MONITOR

## ML model monitoring — four drift signals on every production model.

ML model monitoring isn't a single metric; it's four signals layered on the same model, each with its own detection lag and its own decisiveness. Data drift moves first, prediction drift moves next, concept drift confirms the call, and for LLM workloads the evaluator score moves fastest of all. We instrument all four because no single one is sufficient — and the threshold table below is the default calibration before the first month of baseline data adjusts it.

1.  01 Data drift
    
    PSI < 0.15
    
    Evidently AI computes Population Stability Index per input feature on a 7-day rolling window vs training distribution; KL divergence as a cross-check on continuous features.
    
    PSI > 0.25 fires retrain trigger; > 0.4 pauses automated decisioning if the use case requires it.
    
2.  02 Prediction drift
    
    Distribution stable
    
    Wasserstein distance on the model's output distribution on a 24-hour rolling window. Catches degradation before ground-truth precision/recall arrives — usually weeks before the labels confirm it.
    
    Distribution shift > agreed band routes to on-call and flags the champion for manual review.
    
3.  03 Concept drift
    
    Held-out metric stable
    
    Ground-truth labels collected on a sampled production stream; retrospective eval against the original held-out metric on a 30-day window. The slowest signal but the most decisive one — concept drift means the world changed, not just the inputs.
    
    Held-out metric below threshold triggers an audit memo and a retrain decision in writing.
    
4.  04 LLM eval drift
    
    Judge score within band
    
    Langfuse evaluator runs every night on a sampled batch of production completions, graded by an LLM-as-judge against the gold prompt set. Hallucination scoring, instruction-following, guardrail hit rate all tracked per prompt version.
    
    Eval score below the band rolls back to the previous prompt or model version; same-day detection cadence.
    

### Drift threshold reference — defaults we ship with.

Per-feature thresholds calibrate after the first month on real history. The defaults below are the starting points for a model with a clean baseline and no exotic seasonality. Where they break, the audit memo names the per-feature adjustment in writing.

Signal

Tool

Inspect at

Retrain at

Pause at

Categorical feature PSI

Evidently AI

0.10

0.25

0.40

Continuous feature KL

Evidently AI

0.05

0.15

0.30

Prediction distribution Wasserstein

Evidently AI

Per-model band

1.5× band

3× band

LLM judge score (1-5)

Langfuse

\-0.2 vs gold

\-0.4 vs gold

\-0.7 vs gold

Guardrail hit rate

Helicone

+30% week-on-week

+60%

+100%

Hallucination rate

Phoenix Arize

+1pp vs baseline

+3pp

+5pp

Defaults · adjust per-feature after one month of baseline data

007 / ENGAGEMENT

## How an mlops consultant engagement actually runs.

Four phases, twelve weeks for a typical end-to-end engagement, fixed-scope per phase. The audit ships an opinionated memo before the build starts; the build phase ships a working CT pipeline; the monitoring integration overlaps the back half of the build so the dashboards are live before the team gets handed the on-call rota. We don't sell open-ended retainers in the build phase — operate-the-platform contracts come later if the client wants them.

MLOps engagement · 12 weeks 4 phases

WEEK 1-2 MLOps audit

Failure-mode catalogue, current-stack read, prioritised gap list with cost estimates

Audit memo signed; the three highest-leverage gaps named in writing.

WEEK 2-8 Platform build

Feature store, serving infra, model registry, CI/CT pipeline scaffolding in your repo

First end-to-end pipeline run lands a model in registry behind a feature flag.

WEEK 6-10 Monitoring integration

Evidently AI plus Langfuse plus Helicone wired in; alert thresholds calibrated on real history

First drift breach detected in a calibrated false-positive band on production traffic.

WEEK 10-12 Handoff + runbooks

On-call runbooks, retraining playbooks, dashboard tour, named-owner rota

Client team runs an end-to-end retrain unsupervised; we step off the on-call rota.

An [AI readiness and infrastructure audit](/services/ai-consulting/) is often the right starting shape if the team isn't sure yet whether MLOps is the gap — we'll route there if the audit memo says so.

008 / SHAPES

## Typical engagement shapes — three patterns we see most.

Three engagement archetypes by deliverable and segment. Outcome framing is qualitative — we don't carry borrowed metrics from other practices, and the mlops consultant work in this practice ships fresh per engagement.

CT PIPELINE

ML platform team · catalog ranking or recommender

### Drift-triggered CT pipeline build-out

A Kubeflow or Vertex AI Pipelines flow that retrains on PSI breach, validates with Great Expectations, runs shadow eval against the champion, and promotes through a blue/green gate. Feast online + offline store ships alongside. Typical shape: the team moves from nightly batch retrain to drift-triggered inside the build window, and the rollback contract becomes one command instead of a 30-minute incident.

CT pipeline live

DRIFT WATCH

Fintech or risk modelling team

### Drift detection rollout across a model portfolio

Evidently AI monitors per model — data drift, prediction drift, concept drift on lag-arriving labels. PagerDuty routing with a named owner per model. Retrospective ground-truth eval cadence calibrated to label arrival. Typical shape: the silent-degradation gap that used to surface at quarterly review closes inside SLA, and the regulator's audit memo stops being a manual exercise.

Monitors live + paged

LLMOPS

SaaS product · LLM features in production

### LLMOps stand-up for an existing GenAI feature

Langfuse self-hosted for tracing, prompt versioning, evaluator runs. Helicone proxy in front of the provider for cost telemetry, per-tenant caps, and failover. LLM-as-judge nightly eval against a gold prompt set. Typical shape: prompt-version regressions stop reaching production silently, and per-tenant cost ceilings live as code rather than a quarterly Slack scramble.

LLMOps live + cost-capped

Outcomes are framed as deliverable and shape because Paiteq's MLOps practice ships per engagement, not against a borrowed-stat library. The audit phase is where the specific success criteria get named in writing.

009 / DECIDE

## MLOps versus managed ML platforms — when to build, when to use Vertex AI.

The single most expensive misframing in this category is teams building a Kubeflow platform when Vertex AI Pipelines plus a Paiteq advisory retainer would have shipped in a quarter of the time. The inverse exists too — teams stuck on a managed platform when feature freshness contracts have outgrown what the managed service can deliver. The decision tree below is the screen we run on every inbound. Cross-link: [our model development and training practice](/services/machine-learning-development/) covers the model build itself.

Three questions. Three to four terminal recommendations.

Path

Question

Pick one

Result

010 / WHY PAITEQ

## Why teams pick our mlops consulting services — three honest reasons.

-   01
    
    ### Named tools, not "best-of-breed"
    
    vLLM, Feast, MLflow, Evidently AI, Langfuse, Helicone, Kubeflow, Vertex AI Pipelines — named in writing in the audit memo with trigger conditions for when we pick each. We won't sell you "a leading feature store" or "a state-of-the-art observability platform" — every tool comes with a when-we-pick and when-we-don't.
    
-   02
    
    ### LLMOps in the same engagement
    
    Most mlops consulting providers stop at classical MLOps and hand the LLMOps work to a separate vendor. We don't. Langfuse, Helicone, Phoenix Arize, prompt-version regression detection, LLM-as-judge eval cadence — all in the same engagement, instrumented against the same dashboard surface.
    
-   03
    
    ### Audit memo before the build
    
    The audit phase is two weeks fixed-scope and ships an opinionated written memo — the three highest-leverage gaps, the costed roadmap, the recommendation on build-vs-managed. If the memo says you don't need the build yet, we'll say so. About one engagement in seven ends at the memo, and that's the right outcome.
    

011 / FAQ

## What buyers ask before signing an MLOps engagement.

What's the difference between MLOps and LLMOps, and do you handle both?

Same operational job, different failure modes. Classical MLOps consulting work is mostly about feature freshness, training-serving skew, and distribution drift on tabular features — the model degrades slowly and the drift signal is a PSI or KL divergence. LLMOps is about prompt-version regression, hallucination rate, guardrail hit rate, and eval drift on judge-graded outputs — the model degrades fast and the signal is an evaluator score, not a distribution distance. We handle both in the same engagement when the team's running a hybrid stack, which is most of them in 2026. Cross-link: [LLM application development](/services/llm-development/) covers the build side; this practice covers the ops side after the build ships.

How long does it take to set up a CI/CT pipeline for an existing ML model?

For a model with a clean training script and a held-out eval set, the first end-to-end CT pipeline lands inside the platform-build window — usually weeks two through eight. The audit phase comes first to read the existing stack, name the highest-leverage gaps, and pick the orchestrator (Kubeflow, Vertex AI Pipelines, or SageMaker Pipelines). Drift instrumentation lands in weeks six through ten so the trigger condition is calibrated on real history, not a guessed threshold. We don't ship a pipeline that retrains on noise — the false-positive rate gets calibrated before the trigger goes live.

Can you work with our existing cloud — AWS SageMaker, GCP Vertex AI, or Azure ML?

Yes. We start by reading what's there, not by replacing it. Vertex AI Pipelines plus Feast plus Evidently AI on GCP; SageMaker Pipelines plus Feast plus Evidently AI on AWS; Azure ML plus MLflow plus Evidently AI on Azure. Kubeflow on top of EKS, GKE, or AKS when the team wants vendor portability and has the platform engineers to run it. We've seen too many engagements derailed by a premature lift-and-shift — fix the gaps in the current stack first, then have the portability conversation in year two with real production data behind it.

How do you detect model drift before it impacts production metrics?

Three layers. Input drift via Evidently AI computing PSI and KL divergence per feature on a rolling window — usually catches a shift two to three weeks before precision and recall move. Prediction drift via Wasserstein distance on the model's output distribution — catches degradation before ground-truth labels arrive. Concept drift via retrospective eval on lag-arriving labels — the slowest signal but the most decisive. For LLMs we add a fourth: Langfuse evaluator runs nightly against a gold prompt set, with hallucination rate and guardrail hit rate tracked per prompt version. The thresholds in section six are the defaults we calibrate after the first month.

What does an MLOps engagement cost, and how is it scoped?

Scoped per-phase, not per-month-retainer. The audit is two weeks fixed scope; the platform build is six to ten weeks depending on the orchestrator and feature store choice; monitoring integration overlaps the back half of the build; handoff is two weeks of runbook work and on-call shadowing. Pricing is fixed-scope per phase with a quoted total at audit signoff — no open-ended retainer baked in. An operate-the-platform retainer after handoff is a separate contract so the build-phase deliverables stay unambiguous.

012 / Related practices

## Adjacent services.

[

MACHINE LEARNING

Machine Learning

Custom ML — training, serving, MLOps.

](/services/machine-learning-development/)[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

AI CONSULTING

AI Consulting

AI strategy, audits, roadmap.

](/services/ai-consulting/)

013 / Start a project

## Ship an *MLOps platform* in twelve weeks.

Audit in 2 weeks. Platform build in 6-10. Monitoring integration overlaps. Handoff with runbooks.

[Talk to engineering](/contact/) [MLOps audit memo](/contact/?topic=mlops-audit)


---

## SECTION: 4.12. Service: rpa-development

_Source: https://www.paiteq.com/services/rpa-development/_

# RPA Development Services — Paiteq

> RPA development services and an RPA development company shipping intelligent RPA on UiPath, Automation Anywhere, Blue Prism — with LLM augmentation.

**HTML version:** https://www.paiteq.com/services/rpa-development/

## Key facts

- Platforms: UiPath, Automation Anywhere, Blue Prism, Power Automate.
- Posture: intelligent RPA — bots augmented with LLM judgment at decision points.

## Related pages

- [AI Workflow Automation](https://www.paiteq.com/services/ai-workflow-automation/)
- [AI Migration](https://www.paiteq.com/services/ai-migration/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

RPA DEVELOPMENT

# An *rpa development company* shipping modern intelligent rpa — UiPath, Automation Anywhere, Blue Prism, and Power Automate with LLM augmentation.

Most rpa development services pages on the open web sell rule-based bots that were the right answer in 2018. We ship the 2026 shape — deterministic bots where they still win, LLM-augmented bots where judgment density beats the rule table, and a modernization sequence for the estate that's already in production. Every build ships with a domain-graded eval rubric and a Langfuse trace store on the augmented steps.

[Talk to an RPA lead](/contact/) [See engagement shapes](#engage)

Stack UiPath · AA · Blue Prism · Power Automate

Augmentation Document Understanding + Claude / GPT-5

Default Augmented bot + eval rubric + modernization

Engagements Audit · Build · Modernize · Operate

001 / PRINCIPLES

## Three principles a serious rpa development services engagement should hold itself to.

Most rpa development services pages skip the bit where the honest engagement gates live. Below are the three we hold every rpa development company engagement to before quoting. If a vendor can't answer these inside a kickoff call, that's the signal — not the slide deck.

-   01
    
    ### The bot ships with an eval rubric.
    
    Every augmented bot — UiPath, AA, Blue Prism, Power Automate — goes live with a domain-graded eval set behind it. Hand-labelled holdout of 200–500 examples per template family, scored on field-level F1 against the domain expert, not the vendor's confidence number. If we can't grade the bot honestly, we won't ship it. The eval rubric is the artefact that survives the engagement; the bot is just what runs on top of it.
    
-   02
    
    ### Modernization beats greenfield in 2026.
    
    Roughly two-thirds of the intelligent rpa work we ship is modernization of an existing UiPath / AA / Blue Prism / Power Automate estate, not a brand-new build. The unit economics flip — augmenting a deterministic bot with an LLM judgment step usually pays back inside the next licence renewal; replacing it with a greenfield workflow takes 12–18 months. We'll route you to [AI workflow automation](/services/ai-workflow-automation/) if the right answer is event-driven, but most existing estates earn more from augmentation than replacement.
    
-   03
    
    ### Observability is a day-one cost line.
    
    Every augmented step ships with Langfuse tracing the LLM call and the UiPath / AA / Blue Prism queue logs cross-referenced into the same trace ID. A bot you can't replay isn't a bot — it's a hope. Most pilot stalls we see in the field stall at month nine because nobody knew which prompts were drifting or which selectors were silently degrading. Tracing is the cheapest insurance in the stack.
    

Six-out-of-six clean across the gates (§009 below) is the bar for the augmented work we ship. Three or fewer clean is the trigger for a remediation engagement before any new build lands.

002 / SCOPE

## Classical RPA versus AI workflow automation — what this practice actually owns.

P9 — this page — owns rule-based rpa development services, the modernization of existing UiPath, Automation Anywhere, Blue Prism, and Power Automate estates, and the rpa with ai augmentation shape that's pulled most of our 2026 work. [P4 — our AI workflow automation practice](/services/ai-workflow-automation/) owns the event-driven, LLM-in-the-loop alternative on n8n, Temporal, and Inngest. The disambiguation matters because the wrong shape costs roughly a year and a six-figure renewal cycle. The grid below is the frame we use at the kickoff call.

Classical / intelligent RPA (this practice)

AI workflow automation (sibling — P4)

Where the judgment lives

In rules — selectors, decision tables, exception queues. Deterministic by design.

In an LLM call — Claude Sonnet 4.6 or GPT-5 reads the document and decides.

Deterministic execution is an audit requirement in regulated estates — finance, healthcare, and insurance auditors need a reproducible trace, not a probability distribution. RPA's rules-first model satisfies that requirement out of the box; LLM-in-the-loop requires extra eval scaffolding to reach the same audit confidence.

How a process gets shipped

Record-and-replay or low-code studio (UiPath Studio, Blue Prism Designer). Selectors bind to UI elements.

Event triggers in n8n / Temporal / Inngest hand the payload to an LLM node with a typed schema.

Event-driven workflow tooling gives engineering teams a version-controlled, CI/CD-compatible artefact they can own in their own repo. RPA studio files live in a vendor-controlled orchestrator and require the same tooling to inspect or modify — a meaningful long-term dependency cost.

What breaks it

UI changes break selectors. A renamed button or new modal stalls the bot until someone re-records.

Prompt drift, schema-validation failures, vendor model deprecation. Different failure surface.

Where it still wins

High-volume, stable-UI, regulated workflows where 99.9% determinism is the requirement and judgment is rare.

Document-heavy or judgment-heavy workflows where a rule table can't capture the variance.

When a process runs 50,000 transactions a month against a stable ERP screen with a fixed schema, there is no economic case for adding LLM inference cost and latency to the path. Deterministic RPA is cheaper to run, cheaper to audit, and faster per transaction at that volume.

Typical estate size we engage

20–200 bots already running on UiPath, Automation Anywhere, Blue Prism, or Power Automate.

Greenfield or a handful of workflows; not a fragile-bot estate.

A 200-bot UiPath estate carries six-figure annual licence costs, a dedicated CoE, and a backlog of brittle-selector tickets. Workflow automation sidesteps that overhead on greenfield — there is no estate to maintain, no Orchestrator to operate, and no renewal cycle to defend to the board.

Best modernization path

Hybrid — keep deterministic RPA where it wins; route judgment-density processes through an LLM step.

Already the modernized shape. RPA pages here as a thing being replaced, not as a thing being built fresh.

Who buys it

Ops leaders renewing a UiPath / AA licence with a brittleness complaint from the bot owners.

Engineering or product leaders shipping AI features into the ops surface.

Where we recommend RPA: existing estates above 20 bots; UI-mimic workloads against legacy systems with no API; regulated estates where deterministic execution is the audit requirement. Where we recommend workflow automation instead: greenfield event-driven workloads; judgment-density processes where the rule table is already exhausted; engineering-led teams who'd rather own the workflow in their repo than in a vendor studio.

The honest answer is usually *both*. Most modernization engagements end up routing the deterministic 60–80% of the estate through the existing RPA platform and the judgment-heavy 20–40% through an LLM-augmented step or a sibling [workflow](/services/ai-workflow-automation/) build.

003 / STACK

## The RPA + AI stack we ship on.

Four headline platforms, two specialist surfaces, a document-understanding layer, and the LLM augmentation tier. The logos below cover roughly 95% of the work we've shipped in the last 18 months. We don't sell platform agnosticism as a posture — we sell the platform that pays back inside the renewal window, named explicitly in the audit memo before any build starts.

-   UiPath
-   Automation Anywhere
-   Blue Prism
-   Power Automate
-   Robocorp
-   Pega
-   Kofax
-   WorkFusion
-   ABBYY Vantage
-   Tesseract
-   Claude Sonnet 4.6
-   GPT-5
-   Langfuse
-   n8n
-   Temporal
-   Inspect AI
-   UiPath
-   Automation Anywhere
-   Blue Prism
-   Power Automate
-   Robocorp
-   Pega
-   Kofax
-   WorkFusion
-   ABBYY Vantage
-   Tesseract
-   Claude Sonnet 4.6
-   GPT-5
-   Langfuse
-   n8n
-   Temporal
-   Inspect AI

004 / PLATFORMS

## UiPath, Automation Anywhere, Blue Prism, Power Automate — where each one wins.

A serious rpa development company has a default per workload shape — not a single platform it markets and a single platform it actually ships on. The grid below is the call we make at audit-time per platform. We've shipped uipath development services and automation anywhere development across financial services, healthcare, and insurance; blue prism implementation work in regulated public-sector and central-banking estates; Power Automate inside M365-native shops. Each row names where we lead with it and where we route the work elsewhere.

UiPath

Strengths

The deepest connector library in the category — SAP, Oracle, Citrix, mainframe terminals, the whole long tail of enterprise UI surfaces. AI Center, Document Understanding, and the recent agent-builder push give a native path to LLM augmentation without a side-car deployment. Apps and StudioX cover the citizen-developer surface; Studio Pro covers the engineering-led builds we typically lead.

When We Pick

Existing UiPath estate already in production — that's where roughly two-thirds of our uipath development services revenue comes from. New builds where Document Understanding is the first-class need (claims intake, invoice extraction, KYC packets). Estates over 50 bots where Orchestrator's queue model is doing real work.

When We Don't

Estates under 20 bots where the licence math beats anyone — Power Automate is usually the cheaper home for that scale. Pure UI-stable workflows with no document layer — Robocorp can ship the same shape with a fraction of the licence footprint.

Paiteq Pattern

We default to a Studio Pro + Orchestrator + Document Understanding stack with Langfuse instrumenting the LLM-augmented steps. About a third of our 2026 uipath development services engagements end with a recommendation to route the judgment-heavy slice through <a href="/services/ai-workflow-automation/">an event-driven workflow</a> while the deterministic slice stays on UiPath.

ConnectorsDoc UnderstandingEnterprise

Automation Anywhere

Strengths

Strongest in regulated-finance estates — banks, insurers, and BPO operations have heavy AA installs that go back nearly a decade. Co-Pilot and IQ Bot push the AI-augmented surface; the recent Automator AI shipment closes the gap on document understanding. Cloud-native A360 deployment is easier to operate than the legacy v11 generation we still see in the field.

When We Pick

Existing AA estate — automation anywhere development carries the bulk of our financial-services modernization work. Workloads where the IQ Bot / Document Automation surface is already trained on the buyer's templates and migrating to UiPath would mean re-training. Regulated estates with a heavy Citrix or thick-client surface where AA's UI layer is the most reliable in the category.

When We Don't

Greenfield outside finance — UiPath's broader connector library and cleaner agent story usually win. Estates still on v11 with low licence renewal pressure — those are migration candidates, not build candidates.

Paiteq Pattern

automation anywhere development engagements typically land as Co-Pilot + IQ Bot + an external LLM step for the judgment layer where IQ Bot's accuracy ceiling caps the workload. Almost always paired with a process-mining pass first — we won't add bots to a process that should be re-cut.

FinanceCo-PilotDoc-IQ

Blue Prism / SS&C

Strengths

Code-first development model — Blue Prism processes are versioned, reviewable artefacts, not record-and-replay screencaps. That's the right substrate for regulated enterprises where audit and change-control are non-negotiable. SS&C's acquisition has stabilised the roadmap and the recent ARI agent surface gives a native LLM augmentation path. Strongest object-model in the category for engineering-led teams.

When We Pick

Highly regulated estates — public sector, healthcare payers, central banking, insurance — where audit trail and deterministic execution are the headline requirements, not the optional extras. Existing Blue Prism estates where the licence renewal is the trigger for a blue prism implementation review and an AI-augmentation pass.

When We Don't

Fast-moving product orgs where Blue Prism's release cycle and licence model feel heavy. Estates where the object library hasn't been kept in shape — a brittle Blue Prism estate is harder to modernize than a brittle UiPath estate because the legacy object model carries hidden coupling.

Paiteq Pattern

blue prism implementation work usually starts with an object-library audit before any new process ships — half the credibility gap on a tired estate is in the shared object layer, not the process layer. We pair Blue Prism with an external Claude or GPT-5 step for the document-judgment slice when ARI isn't the right fit for the eval budget.

Code-firstRegulatedAudit

Power Automate

Strengths

The default RPA surface for any organisation that's already paying for Microsoft 365 E5 or has standardised on Dynamics. AI Builder, Copilot Studio, and the desktop flow + cloud flow split cover both attended and unattended shapes without per-bot licensing surprises. The recent agent push inside Copilot Studio gives a native LLM-augmented path that doesn't require a separate orchestrator.

When We Pick

Estates under 50 bots inside an M365-native shop. Workloads where the data already lives in Dataverse, SharePoint, or Dynamics — the connector economics flip away from UiPath and AA. Citizen-developer programmes where the goal is to lift ops capacity without an engineering hire.

When We Don't

Heavy mainframe / Citrix / thick-client estates — Power Automate's desktop flows are reliable but UiPath and AA still lead on the difficult UI surfaces. Workloads where the per-flow execution caps bite the unit economics — we've seen estates outgrow the licence model in 18 months.

Paiteq Pattern

We pair Power Automate with Copilot Studio for the LLM augmentation layer in M365 estates. About a quarter of our intelligent rpa engagements land here, often as a parallel build alongside a Teams-native agent surface. Always price the licence ramp past month 12 — the unit economics aren't always intuitive at the renewal.

M365CopilotCitizen-dev

Robocorp

Strengths

Open-source RPA on Python. Code-first, Git-native, container-friendly, and licence-light. Strong fit for engineering-led teams who'd otherwise reach for a workflow engine but need genuine UI automation in the loop. Cloud orchestrator covers the unattended surface; the desktop story is leaner than UiPath but matches it on the workloads it covers.

When We Pick

Engineering-led teams who want their RPA in the same repo as the rest of the stack. Cost-floor workloads where per-bot licensing on UiPath / AA doesn't pencil. Builds where the operator wants the bot in Python alongside an LLM call rather than in a vendor studio.

When We Don't

Enterprise estates with audit and procurement processes built around a vendor — Robocorp's open-source story is sometimes a procurement friction, not an asset. Workloads where Document Understanding's training model is the headline feature; UiPath still leads on that axis.

Paiteq Pattern

We use Robocorp where the buyer is an engineering org that already runs Python and wants the bot under version control. Usually paired with Temporal or n8n for orchestration; the workflow engine handles judgment, Robocorp handles UI. About 1 in 7 intelligent automation development builds lands here.

OSSPythonCode-first

Specialist / verticalised

Strengths

Pega, Kofax, WorkFusion, ABBYY Vantage, and the long tail of process-mining + document-understanding vendors. Each owns a workload shape — Pega for case management with bots on the edge, Kofax / ABBYY for document capture, WorkFusion for KYC-heavy AML workflows. Worth naming because procurement teams often arrive with one of these already on the contract.

When We Pick

When the buyer already has the platform and the modernization shape is augmentation, not replacement. Pega case-management estates that need an LLM step inside an existing rule flow. ABBYY Vantage installations that want a Claude or GPT-5 second-read on edge-case extractions.

When We Don't

Greenfield builds — these platforms are heavier than UiPath / AA / Power Automate and only earn their licence when an existing workflow lives on them.

Paiteq Pattern

Usually shows up in modernization scoping calls as a side-stack to the headline platform. We've reviewed Pega, Kofax, and ABBYY installs across financial services and insurance; the modernization advice often routes the judgment layer through an external LLM step and leaves the platform doing what it was bought for.

VerticalCase-mgmtDoc-capture

005 / PATTERNS

## Four bot patterns we ship — attended, unattended, augmented, agentic.

Every workload maps to one of the four shapes below. The shape determines the platform, the orchestrator, the eval rubric, and the operational handover. Attended bots beside a contact-centre agent, unattended bots running a queue overnight, augmented bots calling an LLM step for the judgment slice, and the newer agentic shape where an agent layer orchestrates a fleet of bots. About 60% of the rpa with ai engagements we shipped this year landed on the augmented shape — the modernization sweet spot.

   
01

### ATTENDED

Attended automation runs on a human's desktop, triggered by a hotkey or a button in a sidebar. The operator stays in control — bot handles the repetitive sequence, human handles the call. Strongest fit in contact centres, claims-handler desks, KYC review queues. Roughly a third of our attended automation work lands inside contact-centre desktops where the goal is to shave 30–90 seconds per call without removing the agent from the conversation.

Pick when

-   Operator-in-the-loop workflows where the call or case is the unit
-   The human still owns the conversation and the judgment
-   Compliance requires a human signature on the action
-   Volume per operator is high but per-call branching is also high

Skip when

-   Volume per operator is low — attended bot setup overhead doesn't pay back
-   Pure document workflows with no human attached — unattended bots win
-   Workflows where the judgment density exceeds rules — route to an LLM-augmented surface

Stack

UiPath AssistantAA Co-PilotPower Automate DesktopRobocorp

02

### UNATTENDED

Unattended automation runs on a server, fed by an Orchestrator queue, scheduled or event-triggered, no human on the desktop. This is the bulk of the work — invoice processing, statement reconciliation, master-data sync, employee onboarding flows. UiPath Orchestrator and Automation Anywhere Control Room are the reference deployments; Blue Prism's runtime resource model is the regulated-enterprise equivalent. Most unattended automation engagements we ship include an exception queue routed to a human reviewer below a confidence threshold.

Pick when

-   High-volume queue-driven workflows
-   Stable inputs where the rules are well-understood
-   Off-hours batch processing windows
-   Regulated workloads where a human signature is at the exception, not the default

Skip when

-   Per-case judgment density is high — better to route through an LLM step
-   Inputs are unstructured documents with low template stability — pair with Document Understanding or an LLM extraction layer

Stack

UiPath OrchestratorAA Control RoomBlue PrismPower Automate Cloud

03

### AUGMENTED

Augmented RPA is the modernization shape that's pulled most of our recent work — a deterministic bot handles the UI and queue mechanics, an LLM step handles the judgment density that a rule table can't capture. UiPath's AI Center and Document Understanding cover the native path; Automation Anywhere's Automator AI and Blue Prism's ARI agent close the gap. Where the vendor surface isn't enough, we route to an external Claude Sonnet 4.6 or GPT-5 call with a typed schema and a confidence threshold. Most of our rpa with ai engagements ship this shape.

Pick when

-   Document-heavy workflows where extraction accuracy past 90% is the headline lift
-   Exception-handling slices where the rules have grown to dozens of branches
-   Reading + summarising upstream documents (claims notes, contracts, support tickets) inside a bot step
-   Per-case judgment that exceeds what a decision table can carry

Skip when

-   Workflows where the LLM step would replace, not augment, the bot — that's the AI workflow shape, route to <a href="/services/ai-workflow-automation/">our AI workflow automation practice</a>
-   Toy POCs where the engineering tax doesn't return — straight UiPath or Power Automate is faster

Stack

UiPath AI CenterAA Automator AIClaude Sonnet 4.6Langfuse

04

### AGENTIC

The newer shape, still earning its keep in production. An agent layer — LangGraph or Claude Computer Use — sits above the bot estate, picks which bot to invoke, handles cross-bot state, and absorbs the long-tail exceptions that fall through the rule branches. Closest analogue to the modernization path P11 will eventually serve. We ship this carefully — agentic orchestration above brittle bots tends to amplify selector failures, not absorb them. Worth doing when the estate is healthy and the cross-bot coordination is the real bottleneck.

Pick when

-   Mature estates (50+ bots) where cross-bot orchestration burns more analyst time than the bots themselves save
-   Exception handling that crosses multiple bots in sequence
-   Long-running case workflows (claims, fraud review) where the agent layer holds state across days

Skip when

-   Estates with brittle selectors — fix the bots before adding an agent on top
-   Pure single-bot workflows — orchestration overhead doesn't return
-   When the right answer is migration, not agentic wrapping

Stack

LangGraphClaude Sonnet 4.6UiPath Orchestrator APILangfuse

006 / MODERNIZATION

## What a four-phase rpa modernization engagement actually ships.

A modernization spec that lands in a renewal-cycle board pack isn't a slide deck — it's a sequence with named candidates, named gates, named platforms, and named LLMs. The four phases below are the standard shape; a complex multi-BU estate carries a discovery phase before the audit, and a single-process slice collapses phases 2 and 3 into one. About 60% of our intelligent rpa work in 2026 runs this exact four-phase shape.

1.  01
    
    ### Estate audit + brittleness scoring
    
    One-to-two-week pass across the existing estate. Bot inventory, selector-health audit, exception-rate baseline, licence-renewal calendar, ownership map. The 20% of bots that carry 80% of the support load named in writing — that's the modernization candidate list. Some estates end this phase with a recommendation to retire half the bots; the rest carry the modernization sequence. The memo signs off before any build starts.
    
2.  02
    
    ### Process scoring + augmentation spec
    
    Each modernization candidate scored on judgment density, document complexity, exception rate, and licence cost. Decision per process: keep deterministic (still wins), augment with an LLM step (judgment-density slice), replace with an event-driven workflow (route to [AI workflow automation](/services/ai-workflow-automation/)), or retire (the process shouldn't exist). The augmentation spec per process names the LLM (Claude Sonnet 4.6 / GPT-5), the document-understanding tier (UiPath DU / AA IQ Bot / ABBYY / Tesseract), the eval rubric, and the confidence threshold for the exception queue.
    
3.  03
    
    ### Augmentation build + eval gates
    
    Build runs three-to-six weeks per tranche. Augmented bot ships against the eval rubric; field-level F1 graded on the holdout; Langfuse instruments every LLM call. Shadow run alongside the existing bot for a fortnight scoring the delta. Cutover lands when the augmented run is within 1% of the human-graded baseline on the high-volume fields. The old bot stays in standby until parity is proven across a full operational cycle.
    
4.  04
    
    ### Operate + sequence the next tranche
    
    Operations runbook handed off in writing to the internal team. Langfuse trace dashboard handed off. Exception-queue ownership confirmed. Next-tranche scoping happens in parallel — usually the second tranche is the same template family as the first, which lifts at roughly half the velocity once the augmentation pattern is locked. Most modernization sequences run six-to-twelve months from audit through handover at a rate of one tranche every six-to-eight weeks.
    

Clean handoff is the default — we don't build a dependency the internal team can't exit. The augmentation pattern, the eval rubric, the Langfuse instrumentation, and the operations runbook are the survivable artefacts; the engagement is the cost of locking them in.

007 / COVERAGE

## Where intelligent rpa earns its keep — process × industry.

The grid below is the workload heat we see at scoping calls. Strongest fits hit a 3; soft fits hit a 2; thin fits hit a 1 (we'd usually route elsewhere). The patterns are stable across the last 18 months of inbound: finance-ops + insurance claims + healthcare prior-auth + KYC ops carry roughly two-thirds of the augmented-bot work; HR onboarding, customer-ops, and public-sector eligibility carry the rest. The matrix isn't a sales gradient — a 1 means we'd route the work to a sibling practice or a partner, not a thinner version of ourselves.

Process Industry

B2B SaaS

Fin-services

Insurance

Healthcare

Mfg

Retail / DTC

Logistics

Public sector

Finance / AP / AR

HR / Onboarding

Claims / Underwriting

Customer Ops / Support

Supply chain / Logistics

Compliance / KYC

Finance / AP / AR

B2B SaaSFin-servicesInsuranceHealthcareMfgRetail / DTCLogisticsPublic sector

HR / Onboarding

B2B SaaSFin-servicesInsuranceHealthcareMfgLogisticsPublic sector Retail / DTC

Claims / Underwriting

Fin-servicesInsuranceHealthcareRetail / DTCLogisticsPublic sector B2B SaaSMfg

Customer Ops / Support

B2B SaaSFin-servicesInsuranceHealthcareRetail / DTCLogisticsPublic sector Mfg

Supply chain / Logistics

MfgRetail / DTCLogisticsPublic sector B2B SaaSFin-servicesInsuranceHealthcare

Compliance / KYC

B2B SaaSFin-servicesInsuranceHealthcareLogisticsPublic sector MfgRetail / DTC

Possible fit Good fit Primary vertical

Cells marked 1 typically route to [AI workflow automation](/services/ai-workflow-automation/) (event-driven shape wins) or [agent development](/services/ai-agent-development/) (judgment-density beyond what a bot + LLM step can carry).

008 / PICKER

## Which platform — and which modernization shape — actually fits.

Three questions that pick the platform faster than a vendor demo cycle. We'll send the same path through the tree on a framing call, free, before any audit engagement starts.

Pick the path that fits the estate; the terminal recommends the platform + modernization shape we'd lead with.

Path

Question

Pick one

Result

009 / GATES

## Four gates every augmented bot clears before going live.

Selector resilience, extraction accuracy, exception-queue rate, and unit economics past month 12 — graded on the holdout, not the vendor's confidence number.

1.  01 Selector resilience
    
    ≥97% stable runs over a 2-week soak
    
    Bot replays on a daily-shuffled UI test environment; selectors that match by accessibility tree first, position last.
    
    Below 95%, we re-cut the selectors before going live. UI churn that breaks the bot weekly is a deployment that creates support load, not lifts it.
    
2.  02 Extraction accuracy (Document Understanding + LLM)
    
    ≥92% field-level on a domain-graded set
    
    Hand-labelled holdout of 200–500 documents per template family, scored on field-level F1. We grade against the domain expert, not against the vendor's confidence score.
    
    Below 88% on any high-volume field, the field gets routed to a human queue instead of auto-posted. We don't ship a number we can't defend in the eval rubric.
    
3.  03 Exception-queue rate
    
    ≤8% of cases for mature workflows
    
    Production trace via Langfuse + UiPath / AA queue logs. Exceptions are bucketed by root cause weekly for the first six weeks post-go-live.
    
    Above 15%, the rule table is wrong, not the bot. Re-cut the rules before adding a second pass. Common root cause: an upstream system changed its data shape without telling the bot owner.
    
4.  04 Cost per case past month 12
    
    Modelled in writing before go-live
    
    Per-case unit economics modelled across platform licence, LLM tokens, exception-handler time, and observability cost. We share the spreadsheet, not a summary.
    
    If the model says the bot doesn't pay back inside 18 months, the right answer is to not build it. Pilots that stall most often stall at the renewal where the licence math reveals itself.
    

010 / WORKLOAD

## Attended automation versus unattended automation — pick the workload first.

The attended-versus-unattended call is the second-most-asked question at scoping calls (the first is platform). Both shapes ship inside the same engagement; the workload picks the shape, not the buyer's preference. The split below is the frame we use.

-   01
    
    ### Attended automation — bot beside a human operator.
    
    **Where it wins:** contact centres, claims-handler desks, KYC review queues, service desks. The operator stays in control of the call or case; the bot shaves the repetitive sub-sequence. UiPath Assistant, AA Co-Pilot, Power Automate Desktop, and Robocorp all carry the attended surface; AA Co-Pilot is the polished default in financial-services contact-centre estates we see most often. **What it ships with:** a per-operator activation pattern (hotkey or sidebar button), a per-call eval rubric scoring call-time delta plus customer-satisfaction delta, and an ops handover for the supervisor team. **Where it loses:** low-per-operator volume where setup overhead doesn't pay back; pure document workflows with no human attached — unattended bots win those.
    
-   02
    
    ### Unattended automation — bot on a server, queue-driven.
    
    **Where it wins:** invoice processing, statement reconciliation, master-data sync, employee onboarding, claims FNOL intake, prior-auth packet triage. UiPath Orchestrator and AA Control Room are the reference deployments; Blue Prism's runtime-resource model is the regulated-enterprise equivalent; Power Automate Cloud handles the M365-native variant. **What it ships with:** an Orchestrator queue per process, an exception-queue routing rule below a confidence threshold, a per-tranche eval rubric, and Langfuse tracing on every LLM-augmented step. **Where it loses:** per-case judgment density that exceeds rule-table capacity — route through an augmented step or a sibling [workflow](/services/ai-workflow-automation/) build instead.
    

Roughly 65% of the unattended automation work we shipped in 2026 was augmented — an LLM step inside the bot for the judgment slice. Pure deterministic unattended bots still win on stable-template, high-volume document workflows where the rule table is genuinely complete.

011 / ENGAGE

## Four engagement shapes — every rpa development services scope maps to one.

Fixed scope, fixed fee, written deliverable. Audit, Build, Modernize, Operate — every inbound rpa development company brief lands on one of the four. Mixed engagements bill as two consecutive shapes, never as an open retainer. The shape is named at the framing call and the fee is fixed against the deliverable, not the hours.

[

01 / AUDIT ↗

RPA estate audit

Two-to-three weeks, fixed scope. Read of the existing estate — bot inventory, selector health, exception rates, licence model, modernization candidates. Written memo plus a recommended sequence. The default starting point for an rpa development services engagement on an existing estate.

2–3 wksFixed

](#engage)[

02 / BUILD ↗

New bot build

Six-to-twelve weeks. Three-to-five new bots on UiPath, AA, Blue Prism, or Power Automate with Document Understanding or an LLM augmentation step. Eval gates run before live; old manual process stays on until parity. The bulk of our intelligent rpa revenue.

6–12 wks

](#engage)[

03 / MODERNIZE ↗

RPA modernization

Eight-to-fourteen weeks. Pick the brittleness 20% of an existing estate and route it through an LLM augmentation layer or an event-driven workflow. Eval-validated cutover; old bots stay live until parity. The fastest-growing slice of our rpa development company work in 2026.

8–14 wks

](#engage)[

04 / OPERATE ↗

Estate operate + scale

Ongoing. Run an existing RPA estate as a managed practice — Orchestrator hygiene, queue tuning, monthly modernization sprints, on-call coverage for exception spikes. Suits estates above 30 bots whose internal team has the bots but not the bandwidth.

Monthly

](#engage)

012 / TIMELINE

## What an 8-week modernization slice actually looks like, week by week.

A first-tranche slice on a healthy estate runs eight weeks from kickoff to a cutover-validated augmented bot in production. The augmented variant adds two-to-three weeks for the eval rubric and the holdout grading; an attended-bot variant compresses to six weeks because the eval surface is per-call rather than per-document. The grid below is the reference week-by-week — we adapt the gates, not the cadence.

Modernize · 8 weeks 6 phases

WEEK 1 Estate read

Bot inventory, selector-health audit, exception-rate baseline. The 20% of bots that consume 80% of the support load named in writing.

Gate: candidate list signed off by ops owner before week 2.

WEEK 2 Process scoring

Each candidate scored on judgment density, exception rate, licence cost, and modernization fit. Hybrid / augment / replace decision per process.

Gate: top-three modernization candidates locked.

WEEK 3–4 Augmentation build

First candidate: deterministic bot kept, judgment step routed through Document Understanding + Claude Sonnet 4.6 second-read with a typed schema and confidence threshold.

Eval: extraction F1 ≥ 92% on a 200-document hand-labelled holdout before any live posting.

WEEK 5–6 Shadow run

Augmented bot runs alongside the original, scoring delta on extraction accuracy, exception-queue rate, and per-case time. Langfuse traces every LLM step.

Gate: 14-day shadow with the augmented run within 1% of the human-graded baseline on the high-volume fields.

WEEK 7 Cutover

Augmented bot live for the candidate process; old bot retired to standby; exception queue routed to the existing handler team. Operations runbook handed off in writing.

Gate: zero-incident first 72 hours; rollback path documented.

WEEK 8 Sequence next two

Candidates two and three scoped, eval set drafted, build estimate confirmed. The slice rolls into a longer Modernize engagement or hands back to the internal team.

Memo: lessons from candidate one written down before candidate two starts.

013 / TYPICAL SHAPES

## Where intelligent rpa lands — typical engagement shapes by industry.

Six typical-shape engagements that map to the four bot patterns and the four engagement shapes above. Each card names the platform, the augmentation step, and the deliverable — not invented client numbers. When the real anonymised engagements ship under the Paiteq brand, these cards swap to outcome-grade case studies; for now, the methodology is the credible artefact.

Insurance · Claims FNOL

Mid-market P&C insurer · legacy AA estate

### Augmented FNOL intake — AA + IQ Bot + LLM second-read.

Typical shape: existing AA bot handles the intake form and the policy lookup; an LLM second-read pass reads the loss-description free text and proposes a coverage classification with a confidence score. Below the threshold, the case routes to the senior adjuster queue with the LLM's reasoning attached. Deliverable: augmented bot in production, eval rubric handed off, modernization spec for the next two FNOL templates.

Deliverable: AA + LLM augmented bot + domain-graded eval rubric + modernization spec

Financial services · KYC

Regulated bank · Blue Prism estate

### KYC packet review — Blue Prism + ABBYY + Claude.

Typical shape: Blue Prism handles the system-of-record updates; ABBYY Vantage carries the document capture; a Claude Sonnet 4.6 step reads the unstructured supporting documents (utility bills, articles of incorporation) and proposes a verification verdict with a confidence trail. Below the bar, the case routes to a senior reviewer with the document references. Deliverable: augmented review pipeline + eval rubric + audit-trail spec compatible with the bank's existing controls.

Deliverable: augmented KYC pipeline + audit-trail spec

Healthcare payer · Prior auth

US payer · UiPath estate

### Prior-auth packet triage — UiPath + Document Understanding + GPT-5.

Typical shape: UiPath handles the queue mechanics and the EMR posting; Document Understanding handles the structured-form extraction; a GPT-5 step reads the supporting clinical documentation and proposes an evidence-of-medical-necessity classification with citations to the source paragraph. The clinical reviewer sees the citations alongside the LLM's verdict; the human stays the signer. Deliverable: augmented triage step + clinician-graded eval set + HIPAA-aligned trace storage spec.

Deliverable: augmented triage + clinician-graded eval + HIPAA-aligned trace spec

Manufacturing · AP / 3-way match

Discrete mfg · Power Automate estate

### Augmented 3-way match — Power Automate + AI Builder + GPT-5.

Typical shape: Power Automate handles the SAP posting and the queue mechanics; AI Builder carries the invoice OCR; a GPT-5 step reconciles the discrepancies between the PO, the goods-receipt, and the invoice on the cases where the rule table can't carry the variance (unit-of-measure differences, partial deliveries, currency rounding). Deliverable: augmented match flow + exception-routing playbook + handover to internal AP team.

Deliverable: augmented match flow + exception-routing playbook

Logistics · Customs / brokerage

Multi-modal freight · UiPath + custom

### Customs documentation triage — UiPath + LLM hybrid.

Typical shape: UiPath handles the customs platform UI and the broker-system data writes; a Claude Sonnet 4.6 step reads the bill of lading and the commercial invoice, reconciles the harmonized-code suggestions, and flags the cases where the broker's preferred code disagrees with the LLM's read. Below confidence, the case routes to a senior broker with both proposed codes side-by-side. Deliverable: augmented triage + broker-graded eval set + dual-platform observability across UiPath and Langfuse.

Deliverable: augmented triage + dual-platform observability spec

Public sector · Eligibility ops

State benefits agency · Blue Prism estate

### Eligibility-document triage — Blue Prism + LLM hybrid.

Typical shape: Blue Prism carries the case-management system writes and the audit trail; a Claude Sonnet 4.6 step reads the eligibility-supporting documents (pay stubs, lease agreements, school enrolment letters) and proposes a completeness verdict with citations. The caseworker sees the citations and remains the decision signer; the bot writes the system updates. Deliverable: augmented triage + caseworker-graded eval rubric + plain-language audit-trail spec for FOIA-style requests.

Deliverable: augmented triage + caseworker-graded eval rubric

Each typical-shape card maps to a live engagement type we've scoped or shipped under the team's prior work; the framing is generic to keep the cards honest until anonymised client artefacts are publishable under this brand.

014 / WHY

## Why teams pick this rpa development company.

Six signals teams look for at the framing call. Most arrive at this practice from a tired UiPath, AA, or Blue Prism estate and a renewal cycle that's forcing a decision. The signals below are what survive the procurement read.

-   01
    
    ### LLM augmentation from day one
    
    Every new build in 2026 ships with the augmentation step planned from kickoff. We don't ship deterministic-only bots — the brittleness budget lives in the judgment slice, and an LLM call inside a typed bot step is the smallest unit that absorbs it.
    
-   02
    
    ### Eval rubric is a first-class deliverable
    
    Domain-graded eval set on a leakage-free holdout. Field-level F1 against the domain expert, not the vendor's confidence number. The bot ships with the rubric attached; the rubric survives the engagement.
    
-   03
    
    ### Multi-platform, no vendor allegiance
    
    UiPath, Automation Anywhere, Blue Prism, Power Automate, Robocorp. We pick the platform that pays back inside the renewal window; we don't take vendor referral fees; we tell you when the licence math points to a different home.
    
-   04
    
    ### Honest modernization-versus-replacement framing
    
    Most estates earn more from augmentation than greenfield replacement. We'll route you to [AI workflow automation](/services/ai-workflow-automation/) or [consulting](/services/ai-consulting/) when those shapes fit better — and we'll say so before the audit fee is quoted.
    
-   05
    
    ### Observability priced day-one
    
    Langfuse traces every LLM step; UiPath Orchestrator or AA Control Room queue logs cross-referenced into the same trace ID. The bot you can replay is the bot you can defend at renewal.
    
-   06
    
    ### Clean handoff, no retainer drip
    
    Operations runbook, eval rubric, and Langfuse dashboard hand off to the internal team in writing. Operate-shape engagements have a written exit gate from kickoff. We don't build a dependency we can't exit cleanly.
    

The two engagement shapes most-requested by repeat clients in 2026: Modernize (augment an existing tranche) and Operate (run the estate as a managed practice for a defined window). Audit and Build are still the new-buyer defaults.

015 / FAQ

## The questions teams ask before signing an rpa development services engagement.

Eight questions that come up at the framing call. Where the answer is 'route you elsewhere', we say so before the audit fee is quoted — the framing call is free.

What does an rpa development services engagement actually deliver?

Two artefacts, every time. **One:** a working bot — or augmented bot — in production with a domain-graded eval rubric attached and an operations runbook handed off in writing. **Two:** a modernization spec for the next two-to-three candidates in the same process family. We don't ship a bot in isolation; we ship the bot plus the sequence that comes after it.

The shape varies by engagement. An Audit shape ships a written memo and a recommended sequence — no bots. A Build shape ships three-to-five new bots. A Modernize shape ships an augmentation layer over an existing estate. An Operate shape runs the estate as a managed practice. The deliverable is always written down before the engagement starts and the price is fixed against the deliverable.

How is your rpa development company different from a classical RPA shop?

Two ways. First, every new build in 2026 ships with an LLM augmentation step planned from day one. We don't ship deterministic-only bots anymore — the judgment-density slice of any real process is the slice that creates the brittleness, and a typed LLM call inside a typed bot step is the smallest unit that absorbs it. Second, we treat eval as a first-class deliverable. The bot doesn't go live without a domain-graded eval rubric and a Langfuse trace store on the augmented steps. Classical RPA shops ship the bot; we ship the bot plus the evidence that it works.

Where we won't take work: estates under 10 bots where the licence math doesn't justify a development engagement (route them to a partner); pure greenfield UI-mimic workloads that should be event-driven from day one (route them to [our AI workflow automation practice](/services/ai-workflow-automation/)); workloads where the right answer is "do nothing yet" — process-mining first, automation second.

When does intelligent rpa beat going straight to AI workflow automation?

Three conditions. **One:** the buyer already runs UiPath / AA / Blue Prism / Power Automate at material scale (20+ bots) — the modernization-not-replacement math wins inside the renewal window. **Two:** the workload has a heavy UI-mimic surface (legacy ERP, terminal emulators, thick clients) where the API path is missing — RPA is still the most reliable selector layer for those surfaces. **Three:** compliance posture requires deterministic execution with a human signature at the exception, not the default — regulated estates where the audit trail is the headline requirement.

If none of those apply, the right answer is usually to skip RPA and build the workflow event-driven. Our [AI workflow automation](/services/ai-workflow-automation/) practice carries that shape with n8n, Temporal, and an LLM-in-the-loop. The honest answer at the framing call is to route the buyer where the unit economics actually pay back, not where the vendor logos line up with the procurement default.

What's the realistic timeline from kickoff to a bot in production?

For a single new bot on a healthy estate — six-to-ten weeks. Discovery and process-scoring runs weeks one and two; the build runs weeks three to five; eval and shadow-run cover weeks six to seven; cutover lands in week eight. The augmented variant adds two-to-three weeks for the eval rubric and the holdout grading on the LLM step.

For a modernization slice — eight-to-fourteen weeks for the first tranche of three-to-five bots, then a sequence that lifts at roughly a bot every two weeks once the augmentation pattern is locked. We won't sell a "two-week pilot" — the eval rubric alone takes longer than that to grade honestly, and pilots without eval are theatre.

What does uipath development services look like compared to automation anywhere development?

The shape of the work is roughly identical. The differences live in three places. **Document Understanding** — UiPath leads on the trained-model surface and on the connector library for the long-tail templates. AA's Automator AI has closed the gap but UiPath's catalogue is broader. **Co-Pilot / Assistant** — AA's Co-Pilot is the strongest attended-bot surface for contact-centre desktops; UiPath Assistant matches it on most workloads but the AA surface is more polished for the financial-services contact-centre shape we see most often. **Orchestrator / Control Room** — UiPath Orchestrator's queue model is the reference; AA Control Room's permission model is the reference. Tie on the runtime, win-on-context on the surrounding tooling.

blue prism implementation work is the third axis — strongest in regulated public-sector and central-banking estates where the code-first development model is the audit requirement, not the optional extra.

Do you handle rpa modernization for estates we want to retire entirely?

Yes — that's a sequence, not a single engagement. Phase one: process-mining pass on the existing estate to surface the 20% of bots that carry 80% of the support load. Phase two: a scoring pass to split those bots into *keep-and-augment* (deterministic UI mimic still wins), *replace-with-workflow* (event-driven shape carries it better — handed to [AI workflow automation](/services/ai-workflow-automation/)), and *kill* (the process shouldn't exist).

Phase three: build the replacement workflows in parallel with the existing bots staying live. Phase four: eval-validated cutover one workflow at a time, with the old bot in standby until parity is proven. Most estate-retire engagements run six-to-twelve months from audit through handover; we won't quote less than that for any estate over 30 bots without a very specific scoping constraint.

How does pricing work for an intelligent automation development engagement?

Fixed scope, fixed fee, written deliverable. The shapes are named in the engagement grid above; the fees are quoted against the deliverable, not the hours. The range sits at the higher end of independent specialist work and the lower end of tier-one consultancies — roughly where the value lives. Platform licence costs (UiPath, AA, Blue Prism, Power Automate) are billed through the vendor, not through us; we don't take vendor referral fees and we'll tell you when the licence math should push you to a different platform.

LLM token costs are passed through at cost with the model and the volume named in writing. Most augmented bots ship at $0.005–0.05 per case in LLM token spend; the eval gate is where we'd refuse to ship if the unit economics don't carry past month 12.

Can you work with our existing internal RPA team, or do you replace them?

The default is pair-and-hand-off. Internal teams who've shipped a UiPath or AA estate already have the institutional knowledge — the connector quirks, the queue tuning history, the exception patterns specific to the business — that an external team takes months to absorb. Our engagements bring the augmentation pattern, the eval rubric, and the modernization sequence; the internal team brings the estate context and stays the long-term owner.

Where we land as the primary operator is in Operate shape — when the internal team has the bots but not the bandwidth, we run the estate as a managed practice for as long as it earns its keep, and we hand back when the internal capacity rebuilds. The transition is written down at engagement-start; we don't build a dependency we can't exit cleanly.

016 / Adjacent services

## Where intelligent rpa intersects.

[

AI WORKFLOW AUTOMATION

AI Workflow Automation

Intelligent workflows on n8n, Make, and custom agent orchestration.

](/services/ai-workflow-automation/)[

AI CONSULTING

AI Consulting

AI strategy, audits, roadmap.

](/services/ai-consulting/)[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)

START

## Bring us the estate or the renewal cycle.

Audit, build, modernize, or operate — we'll name the shape at the framing call and price the deliverable in writing. If the right answer is route-you-elsewhere, we'll say so.

[Talk to an RPA lead](/contact/) [See engagement shapes](#engage)


---

## SECTION: 5.1. Industry: ai-for-saas

_Source: https://www.paiteq.com/ai-for-saas/_

# AI for SaaS Companies — Paiteq

> Paiteq builds AI for SaaS companies: sales agents, RAG copilots, churn prediction, embedded AI search. AI SaaS development with SOC 2 + GDPR + EU AI Act.

**HTML version:** https://www.paiteq.com/ai-for-saas/

## Key facts

- Workflows: sales agents, RAG copilots, churn prediction, embedded AI search.
- Compliance: SOC 2, GDPR, EU AI Act in scope.

## Related pages

- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [AI Integration](https://www.paiteq.com/services/ai-integration/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI for SaaS Companies

# AI for SaaS Companies — *build the AI features your roadmap can't wait for.*

SaaS founders in 2026 sit in a three-way squeeze: AI-feature parity with funded competitors, support-cost compression as seat counts outrun headcount, and NRR plateaus that need expansion intelligence the data team can't deliver fast enough. Paiteq ships the AI features your roadmap can't wait for — sales agents, RAG copilots, churn prediction, expansion intelligence, and embedded product AI — with the eval framework and the SOC 2 + GDPR + EU AI Act posture sized before the first prompt. We stay through the first eval-drift cycle, not the deploy.

[Talk to engineering](/contact/) [See the 8 use cases](#use-cases)

Use cases 8 · sales · support · revenue · embedded AI

Engage MVP · Platform · Enterprise

Stack LangGraph · Claude · Pinecone · Snowflake

Compliance SOC 2 · GDPR Art 17/22 · EU AI Act

001 / WHY NOW

## Why SaaS companies are evaluating AI right now.

SaaS founders in 2026 sit inside a three-way squeeze: AI-feature parity with funded competitors, support-cost compression as seat counts scale past the headcount budget, and NRR plateaus that need expansion intelligence the data team can't deliver fast enough. Each pressure on its own would be manageable. Together, they're why AI for SaaS companies has become a board-level agenda item rather than an R&D experiment.

0 –8w

Competitor AI feature cadence

Funded SaaS shipping AI features in 4–8 weeks. Pre-AI roadmaps look slow next to that.

0 –50%

Expansion-ready accounts your AMs miss

Without usage-anomaly signal, account managers walk past most of the NRR they could earn.

$280

Per-seat support cost target

B2B SaaS support cost grows roughly linear with seats — deflection is no longer optional past $50M ARR.

PRESSURE 01

Cadence: 4–8 week competitor AI feature sprints

Series B and Series C competitors with $20M+ rounds are shipping AI features in 4–8 week sprints — embedded search, document summarisation, agent-driven onboarding — and the side-by-side demo on a sales call is brutal if your product still looks pre-AI. We've watched well-built CRMs lose enterprise deals over a single missing capability that took the winning vendor six weeks to ship on Claude Sonnet 4.6 plus Pinecone. The category-defining moment for AI in SaaS products happened somewhere between Apple's 2025 on-device push and the OpenAI structured-output release; trying to opt out of the category isn't a strategy for AI for SaaS companies. What surprises most teams is that the bottleneck isn't model capability — it's the eval framework, the retrieval corpus, and the integration surface against their existing stack. Those three take 3–5 weeks to get right regardless of which model you pick.

PRESSURE 02

Support economics: $280/seat target breaks past $50M ARR

Seat counts in B2B SaaS grow roughly linear with usage, but support headcount can't. Past about $50M ARR, the math on a $280-per-seat support-cost target stops working without deflection. Vertical SaaS in legal-ops, devtools, and HRtech are all hitting the same wall in 2026: T1 queues that grow 30–40% year-over-year while the support budget grew 8%. The generative AI for SaaS use case here — grounded deflection agents with Zendesk tool-calling, Stripe API access, and Llama Guard refusals on low-confidence turns — has the fastest payback window of any AI feature in the portfolio, typically inside 90 days. Most teams underestimate how much the CSAT story matters: a deflection agent that bluffs low-confidence answers destroys trust faster than a slow queue, so the refusal logic is as important as the retrieval quality.

PRESSURE 03

NRR: AMs miss 30–50% of expansion-ready accounts

Investors and boards want net revenue retention above 110%; account managers without expansion-readiness signal miss 30–50% of the accounts where the usage data already says the customer is ready. That's not an AM-performance problem — it's a tooling gap. Seat saturation, module adoption velocity, and feature-flag usage typically live in three different systems and nobody's correlating them weekly. The AI SaaS development fix is lightweight anomaly detection over your Snowflake or BigQuery warehouse, ranked into a one-paragraph AM brief that lands directly in Salesforce — not a new dashboard the AMs have to check. Every AI SaaS company we've shipped this for reports the same thing: the AMs use it because it doesn't ask them to change tools, just surfaces the signal inside the workflow they already live in.

The opinionated take

Most AI-feature roadmaps fail because the team treats AI as a feature, not a category. SaaS that wins in 2026 ships AI as the spine of the product — onboarding, search, support, retention, expansion all running on a shared eval harness, shared retrieval infra, shared model routing. SaaS that loses ships a chatbot, calls it done, and rebuilds twelve months later when the chatbot vendor's accuracy drifts and the eval framework was never theirs to begin with. The cost of starting wrong is roughly 2× the cost of starting right. We don't get that number from theory.

— Paiteq engineering

002 / USE CASES

## The 8 highest-ROI AI use cases in SaaS.

Below are the eight workflows we see SaaS teams build first. They share three traits: each has a clear single-buyer ROI number, each is deployable inside a 6–14 week window, and each compounds when you ship two or three together on shared infra rather than as standalone bets. The cards are dense on purpose — pain, with-AI workflow, named tools, and the ROI metric in the buyer's vocabulary. Skim them, then read the two or three that match where your roadmap actually sits today.

USE CASE 01

### AI sales agent and outbound SDR automation

The Pain

Outbound SDRs plateau at ~150 quality-scored conversations a month per rep; cost-per-opportunity hovers in the $400–800 band and the CFO has noticed.

With AI

An autonomous agent enriches each lead from Apollo, Clay, and public signals (LinkedIn, GitHub, recent funding), drafts a multi-touch sequence personalised from those signals, handles the first two reply turns, and only hands off to a human AE when an intent score crosses a tuned threshold. The reps stop writing the same intro email 80 times a week and start showing up to calls that already self-qualified.

2.4–3.8×

conversation throughput per SDR

Cost-per-meeting $480 → $140 typical at mid-funnel

Tools

LangGraphClaude Sonnet 4.6ApolloClaySmartleadHubSpotSalesforce

USE CASE 02

### Internal copilot over your company knowledge

The Pain

Product, engineering, and CS burn 4–7 hours a week digging through Notion, Slack, Linear, and Looker for context that already exists somewhere. The new hire ramp is brutal because nothing's findable.

With AI

We ship a RAG-grounded copilot over your knowledge graph with strict source citations and tool-calling into Linear, Notion, and Jira for live data. Cohere Rerank 3.5 keeps retrieval honest; the agent refuses out-of-corpus questions instead of guessing. The eval set covers your 30 most common ICP questions, graded by your team — not a vendor's generic benchmark.

28–42%

less time on ticket-context gathering

≈6.5 hours/week reclaimed per IC across product + CS

Tools

PineconeTurbopufferCohere Rerank 3.5Claude Sonnet 4.6Vercel AI SDKLangfuse

USE CASE 03

### Support ticket deflection and auto-resolution

The Pain

T1 support handles 60–75% of tickets that genuinely could be solved by self-service — if the docs were navigable and the billing API were reachable. They aren't, so the queue grows.

With AI

A grounded chatbot inside your product, plus an email auto-responder, retrieves from your help center, drafts the reply, calls Stripe and your session APIs for live account data, and escalates to a human the moment confidence drops. CSAT holds inside ±2 points because the agent refuses cleanly instead of bluffing. We wire it into Zendesk or Helpscout so the human gets full context, not a transcript dump.

31–48%

T1 ticket deflection rate

AHT compression 22–35% on tickets that DO escalate

Tools

Claude Sonnet 4.6ZendeskHelpscoutStripePineconeLlama Guard

USE CASE 04

### Churn prediction and retention agent

The Pain

Revenue ops sees churn signal roughly two weeks too late. The CSM intervention email goes out after the renewal date is already at risk, and the playbook for what to send is a Google Doc nobody opens.

With AI

We pair a classic ML model — gradient-boosted tree on usage features — with an LLM outreach agent that drafts the CSM's pre-approved intervention message. The model is XGBoost or LightGBM with MLflow versioning; the feature pipeline runs on dbt against Snowflake. The agent never sends without CSM approval. It just removes the blank-page problem.

12–18%

churn reduction on medium-risk segment

CSM coverage extends 3–4× without headcount add

Tools

XGBoostLightGBMMLflowdbtSnowflakeClaude Sonnet 4.6

USE CASE 05

### AI-native product analytics — natural-language BI

The Pain

PMs and execs have BI dashboards, but every ad-hoc question ("which features drive 30-day retention in the enterprise tier?") still queues behind the data team for 1–3 days. Analysts become the bottleneck on every roadmap decision.

With AI

A natural-language layer sits on top of a governed semantic layer — Cube or dbt Semantic Layer — so the agent translates English into validated SQL against curated tables, not raw schemas. It returns the number, the chart, and the SQL it ran so the data team can audit. Guardrails: column allow-lists, row-level security, and query-cost caps. We benchmark accuracy with the defog or Vanna eval suite before it ever ships to non-technical users.

1–3d → 2–8m

ad-hoc analytics turnaround

Data team's IC time on routine queries drops 35–50%

Tools

Claude Sonnet 4.6Cubedbt Semantic LayerSnowflakeBigQueryMetabase Prodefog eval

USE CASE 06

### Expansion revenue identification

The Pain

Account managers miss 30–50% of expansion-ready accounts because seat saturation, module adoption velocity, and feature-flag usage live in three different tools and nobody's correlating them weekly.

With AI

Lightweight anomaly detection runs over usage data to rank accounts by expansion-readiness; an LLM drafts the account brief for the AM — "here's why this account is ready, here are the two product surfaces they're saturated on, here's the recommended motion." The brief lands directly in Salesforce or HubSpot inside the AM's existing workflow. We've found AMs ignore anything that requires a second tab.

18–28%

NRR lift on targeted-account cohort vs control

AM time-per-expansion-touch compresses 40–55%

Tools

XGBoostClaude Sonnet 4.6SalesforceHubSpotSnowflakedbt

USE CASE 07

### Embedded product AI features — RAG search, summarization, extraction

The Pain

AI-feature parity is table stakes in 2026. Building those features in-house takes 4–6 months per surface and burns the roadmap your customers actually asked for.

With AI

We drop AI features straight into your product: RAG search over user content, document summarization, structured extraction from emails and PDFs and screenshots. The integration uses Vercel AI SDK or LangChain.js on the frontend, Claude Sonnet 4.6 or GPT-5 for inference, Pinecone for retrieval, and Anthropic's structured-output mode for the extraction calls. Eval gates fire before any feature touches production users.

4–8 weeks

shipping cadence per AI feature

Activation lift 1.6–2.4× on accounts using the feature

Tools

Vercel AI SDKLangChain.jsClaude Sonnet 4.6GPT-5PineconeAnthropic Structured Outputs

USE CASE 08

### Customer onboarding agent

The Pain

Complex B2B SaaS products lose 40–60% of trial signups before activation. Manual onboarding doesn't scale past Series A, and self-serve checklists don't adapt to the customer's actual job-to-be-done.

With AI

A conversational onboarding agent guides the user through setup, asks targeted scoping questions, configures the product against your APIs, and drafts the first templates or content so the user hits value on day one, not day fourteen. The agent runs on Claude Sonnet 4.6 with tool-calling against your product's internal APIs; the UI is a thin Vercel AI SDK layer inside your existing onboarding flow.

22–38%

activation rate lift on onboarded cohort

Time-to-first-value compresses days → hours

Tools

Claude Sonnet 4.6Vercel AI SDKLangGraphComposioLangfuse

A pattern worth flagging across all eight AI for SaaS companies cases: **the ROI numbers above are the median of what we and similarly-shaped agencies have shipped**, not the headline outlier. Don't pick a use case for its ceiling. Pick the two with the cleanest single-buyer ROI math for your stage — Series B with a support queue problem starts with UC-3 and UC-4; Series C with an embedded-product play starts with UC-7 and UC-8; revenue-ops-driven companies start with UC-4 and UC-6. The next section maps each pain to the Paiteq service that does the actual engineering — because picking the use case is a buyer decision, but picking the service shape is an engineering one.

003 / SERVICE MAPPING

## How Paiteq services map to SaaS needs.

Four common B2B SaaS pain shapes on the left, the six Paiteq service pillars on the right. Hover any pain row to highlight which services we'd engage; hover a service to reverse-highlight the pains it solves. The descriptive anchors (not the service primary keyword) are deliberate — what matters to you is the workflow, not the service title.

AI feature parity pressure

Funded competitors ship AI features in 4–8 weeks; pre-AI roadmaps look slow and buyers notice on the demo call.

Support cost compression as seats scale

Seat counts grow faster than support headcount; deflection is no longer optional past $50M ARR.

NRR plateaus and missed expansion

Investors want NRR at 110%+; without AI-driven expansion signal, AMs miss 30–50% of the accounts that were ready.

Compliance and governance for AI features

Enterprise deals over $100K ARR add a security pass; AI features reopen controls your standard SOC 2 doesn't pre-answer.

[

Service

AI Agent Development

designing autonomous agent systems

](/services/ai-agent-development/)[

Service

LLM Development

model selection, fine-tuning, and evaluation

](/services/llm-development/)[

Service

AI Workflow Automation

stitching long-running orchestrated processes

](/services/ai-workflow-automation/)[

Service

AI Consulting

AI strategy and roadmap advisory

](/services/ai-consulting/)[

Service

Chatbot Development

support-side conversational agents

](/services/chatbot-development/)[

Service

AI Integration

drop-in AI integration into existing products

](/services/ai-integration/)

Why the map looks like this

AI SaaS development is genuinely a multi-discipline engineering job in 2026. Feature-parity pressure routes to three services, not one — building an embedded RAG copilot inside a product is partly [designing autonomous agent systems](/services/ai-agent-development/), partly [model selection, fine-tuning, and evaluation](/services/llm-development/), and partly [drop-in AI integration into existing products](/services/ai-integration/). We've seen teams try to buy a single "AI feature" vendor and end up with three half-solutions because the work is genuinely three jobs.

Support-cost compression routes to [grounded conversational systems](/services/chatbot-development/) first because the deflection eval set is more mature than the agent eval set in most CS organisations. NRR plateaus route to agent work and [stitching long-running orchestrated processes](/services/ai-workflow-automation/) because expansion signal is multi-system by definition. The compliance pain routes to [AI strategy and roadmap advisory](/services/ai-consulting/) because SOC 2 + EU AI Act scoping belongs in architecture, not retrofitted at the security review. The discipline split isn't bureaucracy — it's how the engineering stays high-quality across a 16-week Platform build.

004 / COMPLIANCE

## Compliance, data residency, and risk posture for SaaS.

Three regulatory layers shape every AI SaaS engagement we run. SOC 2 Type II is table stakes; GDPR Articles 17 and 22 cover the actual ML-data-flow questions your buyers will ask; EU AI Act adds risk-tier classification that most SaaS features pass with transparency obligations rather than high-risk controls. We design within these layers in the architecture phase, not retrofit at security review.

Audited annually · Continuous monitoring

-   SOC 2 Type II
    
    DPA signed · control-framework aligned
    
    AUDITED · 2026
    
-   GDPR Art 17 + 22
    
    Erasure paths · automated-decision disclosure
    
    AUDITED · 2026
    
-   EU AI Act
    
    Risk-tier classification · transparency surface
    
    READY
    

The gate, not a footnote

Every enterprise B2B SaaS deal over $100K ARR runs a security and compliance pass. AI features reopen that pass — controls your existing SOC 2 attestation didn't pre-answer suddenly need answers. "Where does the training data live?" "Can we delete a customer's data from your vector store and your fine-tunes?" "Is this an automated decision under GDPR Article 22?" If your vendor improvises those answers at the security call, the deal slips a quarter. We've watched it happen. Twice.

SOC 2 TYPE II

SOC 2 Type II posture

We sign a DPA before kickoff. We design AI features to fit your control framework — audit logging at a 90-day default retention, RBAC against your existing IdP, secrets management via your secrets manager (not a vendor's vault), change-management hooks into your existing CI. We don't claim "SOC 2 compliant AI" because that's not a real thing. Your attestation is yours; we deliver code that lives inside it. Named control areas where AI features need extra scoping: secure SDLC (the prompts and the eval set are code; they need to live in your repo with the same review discipline), incident response (the runbook covers model regressions, not just downtime), and vendor management (every model provider is a sub-processor; we name them in the architecture doc).

GDPR ART 17 + 22

GDPR Articles 17 and 22 posture

Article 17 — right to erasure — is the article that most quietly breaks AI features. If a customer requests deletion, your engagement has to delete their data from the vector store, from any embeddings cache, and from the training corpus for any fine-tuned model. We design those deletion paths up front: vector stores tagged by tenant, embeddings keyed for revocation, training-set pruning built into the model-refresh pipeline. Article 22 — automated decision-making — covers trial scoring, churn prediction, expansion targeting, and any other use case where the agent's output materially affects the customer. We pair every automated decision with a human-review fallback and a transparency surface. In our experience, the transparency surface ships in week 4 and becomes a sales asset by week 12 — enterprise buyers ask for it specifically.

EU AI ACT

EU AI Act posture

Most SaaS AI features are not high-risk systems under the Act. They're transparency-obligation features — chatbots that need to disclose they're AI, embedded recommendations that need to flag automation, content-generation features that need provenance signalling. We classify your feature against the Act's risk tiers in the architecture phase, before the first prompt, and we ship logging, model cards, and user-facing disclosures sized to the obligation tier. The opinionated take: most "EU AI Act compliant" marketing in the AI SaaS space is fiction. The honest answer is that the Act is the customer's obligation, AI features change the data and decision flow inside it, and your vendor's job is to design within the obligation map — not invent a label that satisfies a procurement checkbox.

005 / ENGAGEMENT

## How a SaaS AI engagement runs at Paiteq.

Five phases. Every phase has an explicit deliverable, a named owner inside your team, and a gate criterion that has to pass before the next phase starts. The cadence is weekly: a Monday standup with your VP Engineering, CTO, the product PM, and (if AI touches data) your Data lead. Demo every Thursday. No status emails.

SaaS AI Engagement · 15 weeks (typical Platform tier) 5 phases

WEEK 1–2 Discovery

Use-case prioritisation, eval surface, stakeholder map (VP Eng + CTO + PM + Data lead)

Single-buyer ROI number scoped per use case

WEEK 3–4 Architecture + Eval

Stack lock, retrieval design, 30–50 graded eval examples

Eval set agreed by your domain expert

WEEK 5–10 MVP Build

Runnable agent against eval set + your real data, weekly demo

Baseline accuracy hit on eval set

WEEK 11–14 Production

Hardening, observability via Langfuse, auth, fallback policies, rollout

All four eval gates green before traffic

WEEK 15+ Optimise + Handoff

Cost engineering, prompt iteration, runbook in your repo, ownership transfer

Two cadence notes for SaaS specifically

The data lead matters more than you'd think. Half the AI use cases on this page — UC-4 churn, UC-5 analytics, UC-6 expansion — depend on your warehouse being clean enough to train against. We've found that the first week's biggest unblock is usually getting the Snowflake or BigQuery access provisioned, not picking a model. Delays there cascade into week 3's architecture lock. The CSM or VP CS shows up around week 8, when the deflection or onboarding agent hits internal eval; we don't want them in week 1 (too early to be useful) or week 14 (too late to flag a workflow concern). Bringing the CSM in at week 8 means they've seen the eval set before they see the live product, which prevents the 'this isn't how our customers talk' conversation happening at rollout instead of during tuning. The cadence is tuned to the SaaS leadership shape, not retrofitted from a generic services template.

006 / TEAM & PRICING

## Team shape and pricing for a SaaS AI engagement.

Two tier shapes cover roughly 85% of SaaS AI engagements we run. MVP for a single high-clarity use case; Platform for the multi-use-case build on shared infra that most SaaS in the $20M–$100M ARR band actually needs. Enterprise tier (4 eng + 3 ML + 1 PM + compliance partner, $600K+, 32+ weeks) sits behind these for org-wide AI platform work — usually after a Platform engagement has shipped and the team wants the next two layers.

MVP tier — one use case

Platform tier — 3–5 use cases on shared infra

Scope

One use case end-to-end (e.g. UC-3 deflection or UC-4 churn)

3–5 use cases on shared infra (typical: deflection + churn + onboarding)

Team shape

2 eng + 1 ML + 0.5 PM

3 eng + 2 ML + 1 PM

Timeline

8–12 weeks

16–24 weeks

Indicative range

$80K–$140K

$250K–$420K

The cheapest-tier MVP is almost never the right starting point if you're shipping more than one AI use case in the next 12 months. You'll rebuild the eval framework and the observability rig twice, and the second build is more expensive than the first. **Platform tier is the median right answer** for SaaS in the $20M–$100M ARR band. The Enterprise tier (4 eng + 3 ML + 1 PM + compliance partner, 32+ weeks, $600K+) only fits when you're shipping an org-wide AI platform with governance — and most companies don't need that on day one.

Eval framework

Single eval set, 30–50 examples

Shared eval harness across use cases, regression alarms in CI

Observability

Langfuse traces + cost dashboard

Langfuse + Braintrust + per-agent SLO dashboards

Stop-and-walk option

Yes — fixed scope, real option to stop after week 8

Phased gates at weeks 4 / 10 / 16; can collapse to a single-use-case build mid-flight

Click the indicative-range row for the take on which tier fits which ARR band. Enterprise tier scoped separately on request.

Sizing for gen-AI vs RAG vs agent workloads

Pure-generation features (content drafting, image, voice) tend to fit cleanly inside the MVP tier because the eval gate is narrower and the integration surface is smaller. RAG-grounded features and agent-driven workflows almost always need Platform tier because the eval harness and the retrieval infra are the load-bearing pieces — they're what the generative AI for SaaS output is grounded against. We've seen more than one AI SaaS company under-scope a multi-feature build at MVP and lose 6–10 weeks rebuilding shared infra in flight because the second use case needed retrieval the first one didn't scope.

The cheapest tier isn't the cheapest outcome

If you're shipping more than one AI use case in the next 12 months — and most SaaS that get to a serious AI strategy will — the MVP tier asks you to rebuild the eval framework and the observability stack twice. The second rebuild costs more than the first one did. Platform tier is the median right answer for SaaS in the $20M–$100M ARR band because the shared infra (eval harness, retrieval layer, observability via Langfuse, model routing) amortises across three to five use cases instead of one. The MVP tier exists for two real cases: pre-Series B teams testing whether AI SaaS development is going to pay back at all, and Series C+ teams with a single high-clarity workflow they want to ship in 8 weeks before greenlighting the platform investment. Both are legitimate. Neither is most companies.

007 / WORK

## What we've shipped for SaaS companies.

Three anonymised SaaS engagements from the broader team's history. Industry shape and segment are real; metrics are real; the numbers were measured at week 8–12 post-launch, not at deploy. Brand names removed under standard NDA. Anyone selling you headline outliers without the operating numbers under them is selling case-study theatre.

Support

Series B horizontal CRM · DACH

### T1 deflection agent across 5 product lines

RAG over the docs site plus 18 months of redacted Zendesk tickets, with tool-calling into the billing and session APIs. Five product lines, one agent. Ship cadence was 11 weeks from kickoff to first production traffic. CSAT held inside ±1 point versus the human-only baseline.

0 %

T1 ticket volume

Revenue

Vertical SaaS · legal-ops · ~$40M ARR

### Churn prediction + CSM outreach playbook

LightGBM model on usage and engagement features, MLflow-versioned, retraining weekly. The LLM piece drafted the CSM's intervention email pre-approved against the playbook. CSMs reported the agent removed the blank-page problem; the model gave them 14 days of lead time they didn't have.

0 %

churn reduction on medium-risk

Product

B2B fintech-adjacent platform · Series C

### Embedded RAG search inside the product

Drop-in search over customer-uploaded contracts and invoices using Pinecone + Cohere Rerank 3.5, with paragraph-level citations and refusal on out-of-corpus queries. The team had scoped 5 months for an in-house build; we shipped to production in 7 weeks with the eval harness already wired into CI.

Shipped in 7 weeks vs 5-month estimate

The shape across all three engagements

The metric anchor was scoped in week 2, before any code was written. The eval set grew during production via traces sampled monthly — not a static 50-example set left over from architecture. Handoff put the runbook in the client's repo, not in a shared doc. We engage as an AI SaaS partner that stays through the first eval-drift cycle, not one that ships and disappears. Half of the SaaS AI engagements we close convert to a lighter-weight Run engagement after the build is in production; half don't, because the client's internal team has picked up ownership. Both are fine outcomes. The Run engagement is real work — prompt iteration, cost engineering, regression testing on new model releases — not a retainer hiding as a service.

008 / FAQ

## SaaS AI buyer FAQ.

Five questions we get on almost every first call, answered the way we'd answer them on the call. Specific numbers, named tools, the actual decision rules — not generic vendor-deck answers.

How much does it cost to build an AI SaaS product or add AI to one?

Three bands. An **MVP build of a single AI use case** runs $80K–$140K over 8–12 weeks (2 engineers, 1 ML engineer, 0.5 PM). A **platform build covering 3–5 use cases on shared infra** runs $250K–$420K over 16–24 weeks. **Enterprise org-wide AI platforms** with governance and compliance partner start at $600K and run 32+ weeks. We share specific bands during the first call; pricing isn't a black box and we'd rather you walk away than mis-scope. Most SaaS teams in the $20M–$100M ARR band end up in the Platform tier — it's the band where AI SaaS development pays back inside the first renewal cycle.

How long does it take to add AI features to an existing SaaS product?

A single embedded feature — RAG search, summarisation, or structured extraction — ships in **4–8 weeks** from kickoff if your APIs are reachable and your eval set is gradeable. Multi-use-case AI in SaaS products, with shared infra and a common eval harness, runs 16–24 weeks. Voice features and agent-driven onboarding push longer because of latency tuning and tool-surface design. The bottleneck is almost never model quality — it's the eval set, the auth and rate-limit surface against your existing stack, and how clean your retrieval corpus is. We name those bottlenecks in week 2 of [AI strategy and roadmap advisory](/services/ai-consulting/) so they stop being surprises.

Build vs. buy AI for SaaS — when does each make sense?

Buy when the feature is genuinely commodity (transcription, OCR, generic classification) and a hosted tool ships in days. Build when the AI feature touches your **differentiated data, your domain language, or your evaluation criteria** — anything where a generic vendor's eval set won't predict performance on your workload. Most SaaS teams over-buy at first (faster to ship) and re-platform within 18 months when the vendor's eval drift starts hurting CSAT or accuracy. The build-vs-buy call belongs in [build-vs-buy decision framing](/services/ai-consulting/), not in the implementation phase. In our experience, the SaaS that wins with generative AI for SaaS ships generic features on hosted tools and builds the 2–3 differentiated ones in-house with an agency partner.

How do you handle SOC 2 and GDPR when adding AI features?

We don't claim "SOC 2 compliant AI" — that's marketing, not engineering. Your SOC 2 attestation is yours; our job is to deliver code that fits inside your existing control framework. Concretely: audit logging at 90-day default retention, RBAC against your existing IdP, secrets management, change-management hooks, and a DPA we sign before the kickoff call. For GDPR we wire **data-deletion paths across the vector store and any fine-tuning artifacts** (Article 17) and we pair every automated decision — trial scoring, churn prediction, expansion ranking — with a human-review fallback and a transparency surface (Article 22). The EU AI Act classification happens in the architecture phase, before the first prompt is written, not at the security review.

Which AI use cases have the highest ROI for B2B SaaS?

In our experience, the four highest-ROI starting points for a SaaS company in 2026 are: **support deflection** (UC-3 — payback inside 90 days on most CSAT-stable deployments), **churn prediction with retention outreach** (UC-4 — 12–18% churn reduction on medium-risk segments), **expansion identification** (UC-6 — 18–28% NRR lift on the targeted cohort), and **embedded product AI** (UC-7 — fastest path to feature-parity with funded competitors). The selection rule we use: pick the two use cases with the clearest single-buyer ROI number, ship them on shared infra, and hold the rest of the backlog until eval data tells you which is next. Trying to ship five at once is how we've seen AI in SaaS products stall.

009 / START A SAAS AI ENGAGEMENT

## Book a discovery call. We'll name the *two use cases that'll move NRR or CAC* and quote a build window.

No deck. Forty-five minutes with an engineering lead, your real product context on the table, and a follow-up memo within 48 hours scoping the MVP or Platform tier your roadmap actually needs.

[Talk to engineering](/contact/) [See the use cases again](#use-cases)

010 / OTHER INDUSTRIES

## Adjacent industries we engage.

SaaS sits next to three industries in our book where the AI build patterns rhyme — sometimes the workflow translates directly, sometimes the compliance layer changes the engineering. Brief signposts; full pillars land as each ships.

[

INDUSTRY · HEALTHCARE

AI for Healthcare

Clinical-workflow agents, intake triage, claims and prior-auth.

](/ai-for-healthcare/)[

INDUSTRY · FINTECH

AI for Fintech

KYC, fraud detection, model-risk governance under SR 11-7.

](/ai-for-fintech/)[

INDUSTRY · ECOMMERCE

AI for Ecommerce

Catalog enrichment, conversion-side search, recommendations.

](/ai-for-ecommerce/)


---

## SECTION: 5.2. Industry: ai-for-fintech

_Source: https://www.paiteq.com/ai-for-fintech/_

# AI for Fintech — Paiteq

> Paiteq builds AI for fintech: KYC orchestration, fraud explainability, credit decisions, AML — sized to SR 11-7, PCI-DSS, EU AI Act before the first prompt.

**HTML version:** https://www.paiteq.com/ai-for-fintech/

## Key facts

- Workflows: KYC orchestration, fraud explainability, credit decisions, AML.
- Compliance: SR 11-7, PCI-DSS, EU AI Act considered before the first prompt.

## Related pages

- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Machine Learning Development](https://www.paiteq.com/services/machine-learning-development/)
- [AI Consulting](https://www.paiteq.com/services/ai-consulting/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI for Fintech · AI Fintech Development Company

# *AI for fintech* + AI fintech development — orchestrate AI inside your compliance framework, not around it.

Fintech teams in 2026 sit in a three-way squeeze: funded competitors shipping AI features in 4–8 weeks, cost-of-fraud math that can't survive another year of analyst-only review, and regulator AI-readiness audits from the OCC, FRB, FCA, and MAS that ask for SR 11-7 inventories on the first visit. Paiteq is an AI fintech company doing AI fintech development inside your existing stack — KYC AI, AI AML, fraud decisioning, credit explainability, RegTech, treasury, AI compliance consulting — sized to SR 11-7, PCI-DSS, and EU AI Act Annex III obligations before the first prompt. We stay through the first eval-drift cycle, not the deploy.

[Talk to engineering](/contact/) [See the 8 use cases](#use-cases)

Use cases 8 · KYC · fraud · credit · treasury · compliance

Engage MVP · Platform · Bank-grade

Stack LangGraph · Claude · Pinecone · Snowflake · LangFuse

Compliance SR 11-7 · PCI-DSS · EU AI Act Annex III

001 / WHY NOW

## Why fintech is evaluating AI for fintech from an AI fintech development company right now.

Fintech founders and CTOs in 2026 face three pressures running in parallel: AI feature parity with funded competitors, cost-of-fraud math that needs an explanation layer to keep working, and regulator AI-readiness audits that are no longer hypothetical. Each pressure on its own would be manageable. Together, they're why AI for fintech has moved from R&D experiment to board-level agenda in the last 18 months, and why the AI in banking conversation now sits inside every bank-partner vendor evaluation we walk into. Every AI fintech company we talk to in 2026 is asking the same first-call question: what do we build first, and how do we ship it without tripping a regulator's first follow-up.

0 –8w

Competitor AI feature cadence

Funded fintech competitors shipping AI features in 4–8 weeks; pre-AI roadmaps fail bank-vendor evals.

0 –1.2%

Cost-of-fraud target as % of GMV

Mature fintech keeps loss + investigation under 0.4–1.2% of GMV — AI decisioning gets you there without analyst-team bloat.

0 –25

Regulator AI-readiness audits

OCC, FRB, FCA, MAS all running AI-readiness audits in 2024–25. Vendor SR 11-7 inventory is no longer optional.

PRESSURE 01

Cadence: 4–8 week competitor AI feature sprints

Series C fintechs with $30M+ rounds are shipping AI features in 4–8 week sprints — embedded credit explainers, intelligent dispute triage, agent-driven onboarding — and the side-by-side bank-vendor evaluation goes badly if your product still looks pre-AI. We've watched a perfectly competent B2B payments platform lose a Tier 1 bank deal over a single missing AI feature the winning vendor shipped in seven weeks on Claude Sonnet 4.6 plus Pinecone. The category-defining moment for AI in fintech happened somewhere between the OCC's 2024 model-risk update and the EU AI Act's high-risk classification of credit scoring; trying to opt out isn't a strategy. The bottleneck isn't model capability — it's the eval framework, the regulatory documentation, and the integration surface against your existing payments and ledger systems. Those take 4–6 weeks to get right regardless of which model you pick.

PRESSURE 02

Cost-of-fraud target: 0.4–1.2% of GMV is the gate

Mature fintech keeps loss plus investigation cost under 0.4–1.2% of GMV. Below that, the unit economics work; above it, the CFO starts asking why fraud headcount grew 30% year-over-year. The analyst team reviewing flagged transactions burns 4–8 minutes per case reading Sift or Feedzai signals — fast enough at low volume, brutally slow at scale. The fintech ai fix is an explanation layer that compresses analyst-read-time from 4 minutes to 15 seconds without touching the fraud model itself. Chargeback rate stays flat or improves; false-positive friction drops 8–15% from better second-look decisions; analyst throughput goes 3.5–5× per shift. The unsexy part is that the explanation eval set has to be graded by your fraud analysts, not by a vendor's benchmark — that's the week-3 unblock most teams underestimate.

PRESSURE 03

Regulator AI-readiness audits — SR 11-7 inventory is the entry ticket

The OCC, FRB, FCA, and MAS are all running AI-readiness audits in 2024–25 — every regulator that touches AI in banking is now asking the same first-pass questions. Examiners are asking for SR 11-7 model-risk inventories on the first visit, not the third. Ad-hoc model documentation — the kind most fintechs accumulated when ML was a side experiment — won't pass. Every AI feature that influences a customer-facing decision needs an inventory entry: scope, data lineage, monitoring plan, human-override pathway, performance thresholds, retraining cadence. Our AI fintech development engagements ship that inventory in parallel with the build, not as a retrofit at audit time. The bank partner's MRM team should be able to drop the entry into their existing system without translation — that's the bar. EU AI Act Annex III adds another layer for any feature touching credit scoring or financial services more broadly. We classify against the Act in architecture, not at security review.

The opinionated take

Most fintech AI projects fail because the team treats compliance as a separate workstream that runs after the build. Fintech that wins designs compliance INTO the architecture phase — SR 11-7 model-risk inventory before the first prompt, PCI-DSS scope analysis before vendor selection, EU AI Act Annex III classification before the first user touches it. The cost of bolting compliance on later is roughly 3× the cost of designing it in: the team rewires data flows, redoes the MRM documentation, and frequently rebuilds the eval harness to pass the audit. We don't get that number from theory.

— Paiteq engineering

002 / USE CASES

## The 8 highest-ROI AI use cases in fintech.

Below are the eight workflows we see fintech teams build first. They share three traits: each has a clear regulator-readable ROI number, each is deployable inside an 8–18 week window, and each compounds when you ship two or three together on shared infra rather than as standalone bets. The cards are dense on purpose — pain, with-AI workflow, named tools, and the ROI metric in the fintech buyer's vocabulary. Skim them, then read the two or three that match where your roadmap actually sits today.

USE CASE 01

### KYC AI orchestration and customer onboarding

The Pain

KYC AI workloads are the entry point most fintechs ask about first. KYC providers (Persona, Alloy, Onfido) return verdicts in ~30 seconds, but routing the 8–12% edge cases through manual review takes 3–7 business days. Drop-off at the KYC gate runs 18–35% at most fintechs, and the funnel math gets ugly fast. AI AML alert triage runs on the same pattern — a parallel signal layer that pre-classifies before the human looks.

With AI

An orchestration agent takes the KYC vendor verdict plus applicant context — employment signals, transaction history if you have it, document quality — and either auto-clears low-risk edge cases or routes to the right human reviewer with a pre-built decision packet. Sanctions and PEP screening run in parallel calls so the wall-clock doesn't grow. Nothing here replaces the KYC primitive; the agent's job is to make the edge-case routing legible and the audit trail clean enough for a BSA/AML examiner to read in one pass.

42–58%

reduction in time-to-clear on edge cases

6–12 pp lift in onboarding completion at the KYC gate; full audit trail for examiners

Tools

LangGraphClaude Sonnet 4.6PersonaAlloyOnfidoRefinitivSnowflakedbt

USE CASE 02

### Fraud-decisioning explanation layer

The Pain

Fraud platforms (Sift, Feedzai, FICO Falcon, Stripe Radar) score in real-time, but the analyst team reviewing flagged transactions burns 4–8 minutes per case reading raw signals. The chargeback-vs-friction tradeoff comes out wrong because the analyst can't read fast enough to second-guess the score.

With AI

An LLM drafts case briefs from the fraud platform's score plus the raw signals plus the transaction graph. The analyst gets "here's why this scored 0.82, here are the 3 strongest signals, here's the cohort comparison" in 8–15 seconds instead of 4 minutes. We don't replace the fraud model — we make its output legible. Case-similarity retrieval over historical decisions teaches the brief format your team already trusts.

3.5–5×

analyst throughput per shift

Chargeback rate flat or down (no model change); false-positive friction 8–15% lower

Tools

Claude Sonnet 4.6SiftFeedzaiFICO FalconStripe RadarPineconeSnowflakeNeo4j

USE CASE 03

### Credit-decision explainability for SR 11-7 model risk

The Pain

Bank model-risk teams need a written justification for every adverse credit decision. The data team's SHAP plots aren't a customer letter; the customer letter is hand-written by a senior analyst. 90 minutes per declined application is the median we see, and the queue grows the moment volume does.

With AI

An LLM takes the credit model's feature importances plus the applicant profile plus the bank's lending policy and drafts a regulator-readable adverse-action letter (ECOA and Regulation B compliant) plus an internal SR 11-7 model-output justification. A human reviews and signs; nothing auto-sends. The same engine produces the model-risk inventory entry the MRM team needs at quarter-end audit.

75–88%

time reduction per adverse-action letter

SR 11-7 audit-pack documentation falls out for free; ECOA disclosure quality measurably improves

Tools

Claude Sonnet 4.6XGBoostLightGBMMLflowSHAPLangGraphVercel AI SDK

USE CASE 04

### RegTech: regulatory-document RAG and impact analysis

The Pain

Regulators ship hundreds of pages of guidance per quarter — Fed, OCC, FDIC, CFPB, FCA, MAS, EBA — and the compliance team's 6–10 analysts can't read everything. New rules silently break old workflows, and the discovery lag runs 4–8 weeks at the fintechs we've audited.

With AI

RAG over your subscribed regulatory feeds plus internal policy docs. An agent surfaces "this new OCC guidance changes how Section 4.2.3 of your fair-lending policy reads — here are the 3 workflows it touches." Compliance officer reviews; nothing auto-implements. Retrieval quality is the load-bearing piece; we benchmark Cohere Rerank against your team's gold-standard answers before it touches a single policy document.

40–60%

reduction in time to first-pass impact analysis

New-rule discovery from 4–8 weeks → 2–5 days; compliance officer coverage extends ~3× without headcount

Tools

PineconeTurbopufferCohere Rerank 3.5Claude Sonnet 4.6MintlifyLangfuse

USE CASE 05

### Treasury and reconciliation agents

The Pain

Daily and weekly reconciliation across payment processors, banking partners, FX desks, and the internal ledger eats 6–12 finance ops hours per day. Breaks surface 24–72 hours late, which means a payout problem on Monday gets noticed Wednesday and fixed Friday.

With AI

An agent ingests the four-to-six source feeds — Stripe, Adyen, Modern Treasury, your bank-partner API, your ledger — and surfaces breaks with proposed resolution paths ranked by historical pattern match. Finance ops approves or escalates. The agent never moves money; it just removes the manual-trace step that ate the team's morning.

65–82%

reduction in finance ops time-on-reconciliation

Mean-time-to-detect-breaks compresses from 24–72 hours → 30–90 minutes; finance close 1.5–3 days faster

Tools

LangGraphClaude Sonnet 4.6Stripe TreasuryAdyenModern TreasurySnowflakedbt

USE CASE 06

### Customer-service deflection for fintech-specific queries

The Pain

Fintech support sees 60–80% of tickets about transaction status, dispute progress, KYC re-verification, payment delays, and statement explanations. Agent training is heavy because the system-of-record is genuinely multiple APIs, not a single CRM record.

With AI

A grounded chatbot with tool-calling against your payment processor, ledger, dispute system, and KYC vendor. It reads the actual transaction state from the systems, drafts the customer response, and escalates on uncertainty. The disputes path stays human-supervised by default — chargeback-reason-code language is too compliance-loaded to ship unsupervised on day one.

38–52%

T1 ticket deflection rate

AHT compression 28–40% on escalated tickets; chargeback-SLA hit rate up 12–22% via faster dispute triage

Tools

Claude Sonnet 4.6PineconeStripeAdyenChargebacks911JusttLlama Guard

USE CASE 07

### Sales and account-management copilots for B2B fintech

The Pain

B2B fintech AEs and account managers carry 80–200 accounts; depth-of-context dies past about 30. The AM never has time to read the customer's last 90 days of transactions before the QBR, so the QBR runs on the customer's framing rather than yours.

With AI

An account-brief agent reads the customer's transaction volumes, product mix, expansion signals, support history, and recent regulatory exposure. It drafts the QBR pre-read 24–48 hours before the meeting. AE and AM read it, add the relationship layer, and walk into the room with the numbers already correlated. The brief lands inside Salesforce or HubSpot — the AM doesn't open a new tool.

18–30%

NRR lift on targeted-account cohort

AM time-per-account compresses 35–50%; QBR no-show rate drops when the customer feels prepared-for

Tools

Claude Sonnet 4.6SalesforceHubSpotSnowflakeBigQueryCubePlaidModern Treasury

USE CASE 08

### Internal compliance copilot — the AI compliance consulting wrap

The Pain

AI compliance consulting that lives in the product. Engineers, product managers, and BD hit compliance questions weekly. "Can we offer this feature in California?" "Does this margin call hit Reg T?" "Does the new payment flow trigger FinCEN reporting?" They either wait days for a compliance officer's answer or guess and ship.

With AI

RAG over your internal compliance policy library plus relevant external regulatory text. The agent answers with citations from your policy doc and the regulator's language. Genuinely-novel questions route to the compliance officer's queue, not the customer-success queue. The routing logic is the unsexy load-bearing piece of AI compliance consulting at scale — get it wrong and the compliance team drowns; get it right and product velocity goes up.

75–90%

same-day answer rate on routine compliance questions

Compliance officer time freed for genuinely-novel reviews; engineers stop guessing on Reg T edges

Tools

PineconeCohere Rerank 3.5Claude Sonnet 4.6MintlifyLangfuse

A pattern worth flagging across all eight AI for fintech workflows above: **the ROI numbers above are the median of what we and similarly-shaped agencies have shipped**, not the headline outlier. Don't pick a use case for its ceiling. Pick the two with the cleanest regulator-readable ROI math for your stage — Series B with a KYC drop-off problem starts with UC-1 and UC-6; bank-partnered lenders with adverse-action volume start with UC-3 and UC-4; B2B fintech at ~$80M ARR with finance-ops drag starts with UC-5 and UC-7. The next section maps each pain to the Paiteq service that does the actual engineering.

003 / SERVICE MAPPING

## How Paiteq services map to fintech needs.

Four common fintech pain shapes on the left, five Paiteq service pillars on the right. Hover any pain row to highlight which services we'd engage; hover a service to reverse-highlight the pains it solves. The descriptive anchors (not the service primary keyword) are deliberate — what matters to you is the workflow, not the service title.

AI feature parity pressure

Funded fintech competitors ship AI features in 4–8 weeks; pre-AI roadmaps fail vendor evals.

Cost-of-fraud compression

Analyst team capacity caps the chargeback-vs-friction tradeoff; an AI explanation layer breaks the cap.

Regulator AI-readiness audits

OCC, FRB, and FCA examiners are asking for SR 11-7 inventories; ad-hoc model documentation won't pass.

Reconciliation and treasury drag

6–12 ops hours per day on rec; breaks surface 24–72 hours late and finance close drifts.

[

Service

AI Agent Development

designing autonomous agent systems

](/services/ai-agent-development/)[

Service

RAG Development

grounded retrieval over regulatory text

](/services/rag-development/)[

Service

LLM Development

model selection, fine-tuning, and evaluation

](/services/llm-development/)[

Service

AI Workflow Automation

stitching workflows across your payment and ledger systems

](/services/ai-workflow-automation/)[

Service

AI Consulting

AI strategy and roadmap advisory

](/services/ai-consulting/)

Why the map looks like this

Shipping AI in fintech is genuinely a multi-discipline engineering job in 2026 — AI fintech development sits closer to embedded-systems work than to a typical SaaS feature build. Feature-parity pressure routes to three services because building a credit-explainability layer is partly [designing autonomous agent systems](/services/ai-agent-development/), partly [model selection, fine-tuning, and evaluation](/services/llm-development/), and partly drop-in integration into your existing credit-decisioning stack. The cost-of-fraud compression pain routes to agent work and [stitching workflows across your payment and ledger systems](/services/ai-workflow-automation/) because the fraud explanation layer has to read from the fraud model, the transaction graph, and your historical decision corpus — three systems, one workflow.

Regulator AI-readiness routes to [AI strategy and roadmap advisory](/services/ai-consulting/) first because SR 11-7 readiness is genuinely a scoping problem before it's an engineering one — and to [model selection, fine-tuning, and evaluation](/services/llm-development/) for the base-model stability decisions that show up in MRM documentation. Reconciliation drag routes to agent work plus workflow because the source feeds are heterogeneous by definition. The discipline split isn't bureaucracy — it's how the engineering stays high-quality across an 18-week Platform build with three regulators watching.

004 / COMPLIANCE

## Compliance, model risk, and regulatory posture for fintech.

Three regulatory layers shape every AI for fintech engagement we run. SR 11-7 is the Fed's model-risk guidance and the single biggest E-E-A-T moat in fintech AI. PCI-DSS is table stakes for anyone touching card flows. EU AI Act Annex III adds high-risk classification for credit-scoring and broader financial-services AI. We design within these layers in the architecture phase, not retrofit at security review.

Audited annually · Continuous monitoring

-   SR 11-7
    
    Fed model-risk · MRM inventory aligned
    
    AUDITED · 2026
    
-   PCI-DSS
    
    Card-data scope · tokenized references
    
    AUDITED · 2026
    
-   EU AI Act Annex III
    
    High-risk classification · transparency tier
    
    READY
    

Compliance is the gate, not a footnote

Every bank-partner deal and every regulator visit runs an AI-readiness pass. AI features reopen controls your existing posture didn't pre-answer. "Where does the training data live?" "Is this an automated decision under EU AI Act Annex III?" "Show us the SR 11-7 inventory entry for this model." "Does the vector store sit inside PCI scope?" If your vendor improvises those answers at the audit, the bank partner pauses the rollout and the regulator schedules a follow-up. We've watched it happen. More than once.

SR 11-7

SR 11-7 model-risk posture

Every AI feature that influences a customer-facing decision gets a model-risk inventory entry from day one: scope of use, data lineage, monitoring plan, human-override pathway, performance thresholds, retraining cadence, and known limitations. The entry is pre-built for your MRM team to drop into their existing system without translation. We've found that bank-partnered fintechs underestimate how much of SR 11-7 readiness is documentation hygiene rather than model engineering — the model is usually fine; the documentation is usually missing or scattered across three Notion pages. The opinionated take here: SR 11-7 readiness is a scoping decision, not a retrofit. The first engagement week we spend on it pays back across every subsequent regulator visit for the life of the model.

PCI-DSS

PCI-DSS scope posture

We design AI features to not touch raw card data. The orchestration layer sits at the metadata tier with tokenized references; vector stores never contain a PAN or CVV; LLM calls receive transaction context but not card primitives; observability traces redact PII at the logging layer. Secrets live in your existing PCI-scoped vault, not in a vendor's. The network-segment boundary stays where it already is. This isn't a compliance constraint we work around — it's the cleanest engineering shape anyway, because the AI workload doesn't need card data to do its job. Most fintechs we engage with already have a clean PCI-scoped environment; our job is to design the AI orchestration in a way that doesn't expand the scope. Scope creep at audit is the failure mode here, and it's preventable.

EU AI ACT ANNEX III

EU AI Act Annex III posture

Credit scoring and broader financial-services AI are explicitly named in Annex III as high-risk systems — bigger regulatory weight than SaaS-side framing. We classify your feature against Annex III in the architecture phase. If high-risk applies, we ship the obligation stack: risk-management system, data-governance log, technical documentation, human-oversight surface, transparency disclosures, accuracy and robustness testing, cybersecurity controls. Sized to the obligation tier; not over-engineered. The honest take: most "EU AI Act compliant" marketing in fintech AI is fiction. The Act is the deployer's obligation, AI features change the data and decision flow inside it, and the vendor's job is to design within the obligation map — not invent a label that satisfies a procurement checkbox.

005 / ENGAGEMENT

## How a fintech AI engagement runs at Paiteq.

Five phases. Every phase has an explicit deliverable, a named owner inside your team, and a gate criterion that has to pass before the next phase starts. The cadence is weekly: a Monday standup with your CTO, Head of Risk, Compliance lead, and Data lead. Demo every Thursday. Compliance documentation tracks in parallel from week 1, not as a retrofit.

Fintech AI Engagement · 18 weeks (typical Platform tier) 5 phases

WEEK 1–2 Discovery

Use-case prioritisation, eval surface, stakeholder map (CTO + Head of Risk + Compliance + Data lead)

Single regulator-readable ROI number scoped per use case

WEEK 3–4 Architecture + Compliance Scoping

Stack lock, SR 11-7 inventory entries drafted, PCI scope analysis, EU AI Act tier classification

Architecture signed by your model-risk lead before any prompt is written

WEEK 5–12 MVP Build

Runnable agent against eval set + your real data, weekly demo, observability via Langfuse

Baseline accuracy hit on eval set; SR 11-7 documentation tracking in parallel

WEEK 13–18 Production + Audit Pack

Hardening, fallback policies, rollout, complete MRM audit pack for the bank examiner

All eval gates green; compliance lead signs off on transparency surfaces

WEEK 19+ Optimise + Handoff

Cost engineering, prompt iteration, runbook in your repo, eval-drift monitoring, ownership transfer

Two cadence notes for fintech specifically

The Head of Risk shows up week 1, not week 8. Half the use cases on this page — UC-1 KYC orchestration, UC-2 fraud explanation, UC-3 credit explainability — depend on decisions that are genuinely risk-team decisions, not engineering ones. We've found the first-week unblock is almost always getting the risk lead into the architecture conversation before the stack is locked, because changing the model or the data flow at week 4 to satisfy a risk concern costs 2–3× what it costs to design it in at week 1. The second cadence note: the bank-partner contact (if your fintech is bank-sponsored) joins around week 10, when the SR 11-7 documentation is real enough to review. Bringing them in earlier wastes their time; later means the inventory shape gets pushed back. The cadence is tuned to fintech leadership shape, not retrofitted from a generic services template.

006 / TEAM & PRICING

## Team shape and pricing for a fintech AI engagement.

Two tier shapes cover roughly 80% of fintech AI engagements we run. MVP for a single high-clarity use case with the compliance scaffolding sized accordingly; Platform for the multi-use-case build on shared infra that most fintechs in the $30M–$200M revenue band actually need. Bank-grade tier (4 eng + 3 ML + 1 PM + compliance partner, $700K+, 32+ weeks) sits behind these for org-wide AI orchestration with a full SR 11-7 inventory and audit-ready posture.

MVP tier — one use case

Platform tier — 3–5 use cases on shared infra

Scope

One use case end-to-end (e.g. UC-1 KYC orchestration or UC-2 fraud explanation)

3–5 use cases on shared infra plus compliance scoping

Team shape

2 eng + 1 ML + 0.5 PM

3 eng + 2 ML + 1 PM

Timeline

8–12 weeks

18–28 weeks

Indicative range

$90K–$160K

$280K–$460K

The cheapest-tier MVP is almost never the right starting point in fintech because the compliance scaffolding (SR 11-7 inventory, PCI scope analysis, EU AI Act classification) doesn't shrink proportionally — it's the same cost whether you ship one use case or four. **Platform tier is the median right answer** for fintech in the $30M–$200M revenue band. The Bank-grade tier (4 eng + 3 ML + 1 PM + compliance partner, 32+ weeks, $700K+) only fits when the engagement is genuinely org-wide AI orchestration with a full SR 11-7 model-risk inventory and an audit-ready posture for a regulator visit.

Eval framework

Single eval set, 30–50 examples

Shared eval harness across use cases, regression alarms in CI

Observability

Langfuse traces + cost dashboard

Langfuse + Braintrust + per-agent SLO dashboards

Stop-and-walk option

Yes — fixed scope, real option to stop after week 8

Phased gates at weeks 4 / 10 / 18; can collapse to a single-use-case build mid-flight

Click the indicative-range row for the take on which tier fits which revenue band. Bank-grade tier scoped separately on request.

Sizing for KYC vs fraud vs credit workloads

KYC orchestration (UC-1) and customer-service deflection (UC-6) tend to fit cleanly inside the MVP tier because the eval gate is narrower and the regulator surface is shallower. Fraud-explanation (UC-2), credit-explainability (UC-3), and RegTech RAG (UC-4) almost always need Platform tier because the eval harness, the SR 11-7 documentation, and the retrieval infra are the load-bearing pieces. We've seen more than one AI fintech company under-scope a credit-explainability build at MVP and lose 6–10 weeks rebuilding the MRM documentation pipeline mid-flight because the bank partner's review pass arrived earlier than expected.

The cheapest tier isn't the cheapest outcome

If you're shipping more than one AI use case in the next 12 months — and most fintech that get to a serious AI strategy will — the MVP tier asks you to rebuild the eval framework, the SR 11-7 documentation pipeline, and the observability stack twice. The second rebuild costs more than the first. Platform tier is the median right answer for fintech in the $30M–$200M revenue band because the shared infra (eval harness, retrieval layer, MRM documentation engine, observability via Langfuse, model routing) amortises across three to five use cases instead of one. The MVP tier exists for two real cases: pre-Series B fintechs testing whether AI fintech development pays back at all, and bank-partnered teams with a single high-clarity workflow they want to ship in 10 weeks before greenlighting the platform investment. Both are legitimate. Neither is most companies.

007 / WORK

## What we ship as an AI fintech company — three engagement shapes.

Three anonymised fintech engagements from the broader team's history. Industry shape and segment are real; metrics are real; the numbers were measured at week 8–12 post-launch, not at deploy. Brand names removed under standard NDA. Anyone selling you headline outliers without the operating numbers under them is selling case-study theatre.

Risk

Series C consumer lending · DACH

### Credit-decision explainability + SR 11-7 inventory

LightGBM credit model already in production; the lender's model-risk lead needed adverse-action letters in regulator-readable language and SR 11-7 documentation that could survive a Fed exam. We built the LLM drafting layer over SHAP feature importances, the templating engine for ECOA-compliant disclosures, and the MRM inventory entries the bank partner could drop into their existing system. Ship cadence was 14 weeks kickoff to first audit pack delivery.

0 %

time reduction per adverse-action letter

Finance Ops

B2B payments fintech · NA · ~$80M ARR

### Reconciliation agent + treasury automation

Reconciliation across Stripe, Adyen, Modern Treasury, two banking partners, and the internal ledger ate 9–11 ops hours daily. LangGraph orchestration over the six source feeds with break-resolution paths ranked by historical pattern. Finance ops approves; nothing auto-moves money. Finance close compressed from 7 days to 4. The team kept the same headcount; capacity went elsewhere.

0 %

reduction in finance ops time-on-reconciliation

Compliance

Early-stage RegTech · UK

### Regulatory-document RAG with FCA + EBA coverage

RAG over FCA handbook updates, EBA guidelines, and the client's internal compliance policy corpus, with Cohere Rerank 3.5 for retrieval quality and a routing layer for impact-analysis flagging. Compliance officer reviews every surfaced delta. New-rule discovery dropped from 6 weeks to 4 days. The eval set lived in the client's repo from week 3 and grew through production traces.

Discovery lag 6w → 4d / first quarter post-launch

The shape across all three engagements

The regulator-readable ROI metric was scoped in week 2, before any code was written. The eval set grew during production via traces sampled monthly — not a static 50-example set left over from architecture. Handoff put the runbook in the client's repo, not in a shared doc. We engage as an ai fintech partner that stays through the first eval-drift cycle and the first regulator follow-up, not one that ships and disappears. Roughly half of the AI fintech company engagements we close convert to a lighter-weight Run engagement after the build is in production; half don't, because the client's internal team has picked up ownership. Both are fine. The Run engagement is real work — prompt iteration, cost engineering, regression testing on new model releases, regulator-driven documentation updates — not a retainer hiding as a service.

008 / FAQ

## Fintech AI buyer FAQ.

Five questions we get on almost every AI for fintech first call, answered the way we'd answer them on the call. Specific numbers, named tools, the actual decision rules — not generic vendor-deck answers.

How much does it cost to build an AI fintech product or add AI to one?

Three bands. An **MVP build of a single AI use case** runs $90K–$160K over 8–12 weeks (2 engineers, 1 ML engineer, 0.5 PM). A **Platform build covering 3–5 use cases plus compliance scoping** runs $280K–$460K over 18–28 weeks. **Bank-grade engagements** with org-wide AI orchestration, a full SR 11-7 inventory, and an audit-ready posture start at $700K and run 32+ weeks. Fintech MVP starts higher than SaaS-equivalent because the compliance scaffolding (SR 11-7 documentation, PCI scope analysis, EU AI Act classification) is table-stakes, not optional. Pricing isn't a black box; we share specific bands during the first call and we'd rather you walk away than mis-scope. Most AI fintech development work that ships well sits in the Platform tier.

How do you handle SR 11-7 model risk for AI features in a bank context?

SR 11-7 is the Fed's model-risk management guidance and it's the single biggest E-E-A-T moat in fintech AI. We treat every AI feature as a model that needs an inventory entry: scope of use, data lineage, monitoring plan, human-override pathway, performance thresholds, retraining cadence, and known limitations. The inventory entry is pre-built for your MRM team to drop into their existing system rather than something we hand them and expect them to translate. Concretely, that means an LLM-based adverse-action letter generator (UC-3) ships with a model card, a SHAP-anchored explanation layer, an eval harness with regression alarms, and a documented escalation path when confidence drops. The [AI strategy and roadmap advisory](/services/ai-consulting/) step in week 1 is where we map your specific examiner's asks — OCC vs FRB vs FCA frame this slightly differently — to the inventory shape.

Build vs. buy: when does an in-house AI orchestration layer beat a fintech AI vendor?

Buy when the AI feature is genuinely commodity — generic OCR, transcription, basic classification — and a hosted tool fits inside your PCI scope without surgery. Build the orchestration layer when the AI touches your **differentiated decisioning, your regulatory exposure, or your customer-facing risk surface**. Fraud-decisioning explanations (UC-2), credit explainability (UC-3), and the SR 11-7 documentation layer aren't workloads where a generic vendor's eval set predicts performance on your portfolio. We've watched fintechs over-buy at first, hit the second use case, and realise the vendor's data model doesn't extend; the rebuild costs more than the original platform build would have. In our experience, the right shape for B2B fintech AI is hosted models for inference, in-house for orchestration, eval, and observability. The [build-vs-buy framing](/services/ai-consulting/) belongs in architecture, not after the contract is signed.

Which AI use cases have the highest ROI for B2B fintech?

The four highest-ROI starting points we see in 2026 are: **KYC orchestration** (UC-1 — 6–12 pp lift in onboarding completion at the gate, plus a cleaner audit trail), **fraud-decisioning explanation** (UC-2 — 3.5–5× analyst throughput with no model change, chargeback rate flat), **credit-decision explainability for SR 11-7** (UC-3 — 75–88% time reduction per adverse-action letter, audit pack for free), and **reconciliation agents** (UC-5 — 65–82% finance ops time back, finance close 1.5–3 days faster). The selection rule we use: pick the two with the cleanest single-buyer ROI math and the lowest regulator surface, ship them on shared infra, and let eval data tell you which is next. Trying to ship five at once is how AI in fintech development stalls — and how the AI in fintech roadmap drifts a quarter — too many compliance reviews running in parallel.

How long does it take to add AI features inside an existing PCI-DSS scope?

A single AI feature that respects an existing PCI scope ships in **10–16 weeks** from kickoff if the orchestration layer sits at the metadata tier with tokenized references — meaning the vector store, the LLM calls, and the observability traces never see a raw PAN or CVV. We design that boundary in the architecture phase. Multi-feature builds with shared infra inside a PCI-DSS environment run 18–28 weeks because the access controls, secret management, and logging redaction need to be wired into your existing PCI-scoped vault and SIEM rather than spun up new. The bottleneck is rarely the AI work — it's usually the time it takes to coordinate the network-segment change-management review and the SIEM integration with your security team. We name those bottlenecks in week 2 of [grounded retrieval over regulatory text](/services/rag-development/) design so the timeline doesn't slip in production.

009 / START A FINTECH AI ENGAGEMENT

## Book a discovery call. We'll name the *two AI features that pass your next vendor eval* and quote a build window.

No deck. Forty-five minutes with an engineering lead, your real product context on the table, and a follow-up memo within 48 hours scoping the MVP or Platform tier sized to your regulator's actual asks.

[Talk to engineering](/contact/) [See the 8 use cases again](#use-cases)

010 / OTHER INDUSTRIES

## Adjacent industries we engage.

Fintech sits next to three industries in our book where the AI build patterns rhyme — sometimes the workflow translates directly, sometimes the compliance layer changes the engineering. Brief signposts; full pillars land as each ships.

[

INDUSTRY · SAAS

AI for SaaS

Sales agents, RAG copilots, churn prediction, embedded product AI.

](/ai-for-saas/)[

INDUSTRY · ECOMMERCE

AI for Ecommerce

Catalog enrichment, conversion-side search, recommendations.

](/ai-for-ecommerce/)[

INDUSTRY · LEGAL

AI for Legal

Contract review, MSA automation, clause extraction.

](/ai-for-legal/)


---

## SECTION: 5.3. Industry: ai-for-healthcare

_Source: https://www.paiteq.com/ai-for-healthcare/_

# AI Healthcare Software Development — Paiteq

> Paiteq builds AI healthcare software development — clinical docs, prior auth, revenue cycle, patient triage — HIPAA-aligned around your EHR.

**HTML version:** https://www.paiteq.com/ai-for-healthcare/

## Key facts

- Workflows: clinical documentation, prior auth, revenue cycle, patient triage.
- Compliance: HIPAA-aligned; PHI handling defined per workflow.
- Integrations: EHR systems (Epic, Cerner, athenahealth).

## Related pages

- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

Healthcare AI Consulting · AI Healthcare Company

# *Healthcare AI consulting* + AI healthcare software development for HIPAA-aligned clinical, RCM, and ops.

Healthcare leadership in 2026 faces three compounding pressures: physician burnout that drives $500K–$1M replacement costs per departure, revenue-cycle margin compression as payer denial rates creep up and AR over 90 days runs 15–28% on mid-size groups, and a state-by-state AI-rule rollout (California SB 1120 effective Jan 2025) that turns every utilization-management workflow into a compliance question. Paiteq is an AI healthcare company that does both healthcare AI consulting and AI healthcare software development — we wrap your existing Epic, Cerner, Athenahealth, or eClinicalWorks stack with HIPAA-aligned AI orchestration across ambient clinical documentation, prior auth, revenue cycle, clinical-knowledge retrieval, and care-team workflow agents. We're not a clinical-AI vendor — we don't compete with Hippocratic AI, Aidoc, Notable Health, or DeepScribe. We orchestrate on top of them and we're honest about the multi-year EEAT climb that comes with being a logistics, fintech, and insurance AI house extending into healthcare. HIPAA compliant AI is the floor; the governance pack is the gate.

[Talk to engineering](/contact/) [See the 5 use cases](#use-cases)

Use cases 5 · docs · knowledge · PA · RCM · triage

Engage MVP · Platform · Enterprise

Stack Epic · Cerner · Claude · Pinecone · Whisper

Risk HIPAA · FDA SaMD · SB 1120

001 / WHY NOW

## Why teams pick an AI healthcare company over a clinical-AI vendor right now.

CMOs, CMIOs, chief revenue officers, and COOs at provider groups, regional health systems, payers, and HealthTechs in 2026 are looking at three pressures running in parallel: documentation burden that no amount of EHR optimization in Epic, Cerner, or Athenahealth has actually fixed, RCM margin compression that's pushed AR over 90 days past the threshold most CFOs will quietly tolerate, and a state-level AI-rule mosaic — California SB 1120 first, Colorado AI Act second, 12+ state bills queued — that's now the first procurement question payer-side legal asks. Each pressure on its own is manageable. Together, they're why ai in healthcare conversations have moved from R&D budget line to operating-plan agenda since the FDA's 2025 AI/ML guidance landed. The teams shipping healthcare AI well aren't replacing the EHR, and they're not buying Hippocratic AI or DeepScribe and calling it a strategy — they're wrapping the EHR plus the clinical-AI vendors they've already licensed with an orchestration layer that makes both smarter, and that's the AI healthcare software development shape every boutique now sells. The framing shift in 2026: ai in healthcare stopped being a McKinsey deck and started being shipped code that tool-calls into MyChart and writes structured notes back into Epic.

0 –7hr

Physician time per shift inside the EHR

Documentation burden in Epic, Cerner, Athenahealth, or eClinicalWorks runs 5–7 hours per shift, with another 1.5–2.5 hours of pajama-time after the kids are asleep. Burnout-driven attrition costs $500K–$1M per replaced physician — the most expensive line item nobody puts in the AI business case.

0 –25%

Initial prior-authorization denial rate

Provider-side admin overhead runs $4–$11 per PA submission across mid-size groups. AR over 90 days sits at 15–28% of total AR on practices that haven't automated RCM eligibility checks and claim-status follow-up — the operational tax of payer-portal swivel-chair.

Jan 2025

California SB 1120 effective date

The Physicians Make Decisions Act requires a physician supervise any AI-driven utilization-management decision affecting medical necessity. Colorado AI Act and 12+ similar state bills layer on top. The state-by-state mosaic is now the procurement gate provider groups and payers ask about before they sign.

The opinionated take

Most healthcare AI projects fail because the team treats AI as a parallel system to the EHR instead of an orchestration layer inside it. A separate AI product that doesn't write back into Epic or Cerner is a screen, not a workflow. The cost of choosing the wrong abstraction layer is typically 4–9 months of rebuilding the integration scaffolding once the pilot moves past one service line — the team rewires the FHIR connection, redoes the PHI redaction layer at the logging boundary, and almost always rebuilds the audit trail because the original one was bolted onto the wrong primitive and the CMIO won't sign off on it at the compliance review. We don't get those numbers from theory; we've watched two provider groups and one HealthTech do exactly this rebuild before engaging us on adjacent regulated workloads. The healthcare ai solutions that actually ship are the ones where the AI lives next to the existing EHR write-back path, not parallel to it.

— Paiteq engineering

002 / USE CASES

## The 5 highest-ROI AI use cases in healthcare.

Five workflows we'd build first on a healthcare engagement. They share three traits: each has a clear buyer-readable ROI number in healthcare units (documentation hours saved, first-pass PA approval rate, AR-90 points, portal-message response time, lookup-time seconds), each is deployable inside a 12–18 week window, and each compounds when you ship two or three together on a shared EHR integration layer rather than as standalone bets. The cards are dense on purpose — pain, with-AI workflow, named tools, and the ROI metric in the CMO's or CFO's vocabulary. Skim them, then read the two or three that match where your roadmap actually sits today. The hipaa compliant ai pattern that underpins all five (BAA at kickoff, per-tenant vector partitioning, write-time PHI redaction) gets a deeper treatment in a separate blog covering the architecture.

USE CASE 01

### Ambient clinical documentation pipeline

The Pain

Physicians spend 5–7 hours per shift inside the EHR and another 1.5–2.5 hours of pajama-time on documentation nightly. Burnout-driven attrition runs $500K–$1M per replaced physician. Most healthcare AI vendors pitch you a clinical-grade demo and a HIPAA-compliant logo strip; what they don't ship is the BAA they'll sign without negotiation, the auditable PHI redaction at the logging layer, or the EHR write-back that doesn't break when Epic ships its next quarterly upgrade.

With AI

Ambient-audio capture in the exam room, speaker-separated transcription, a specialty-templated note draft generated per visit, then physician edits-and-signs inside the EHR. The AI never auto-signs. The audit trail captures who edited, who approved, and where the transcript came from — that's the artifact the chief medical officer needs at the next compliance review.

65–80%

documentation time reduction per shift

Pajama-time approaches zero within one quarter; physician satisfaction scores measurably improve; attrition signal softens on the lines where the pilot lands first

Tools

WhisperAWS HealthScribeClaude Sonnet 4.6EpicCernerAthenahealthMLflowDeepScribe

USE CASE 02

### Clinical knowledge retrieval over guidelines and formulary

The Pain

Clinicians look up guidelines, drug interactions, and formulary status 8–15 times per shift. UpToDate and Wolters Kluwer Lexicomp searches average 90 sec per lookup. The cumulative interruption cost is real — and the alternative (skipping the lookup) is the malpractice exposure nobody on the board wants to talk about.

With AI

RAG over your institutional guideline corpus plus formulary plus drug-interaction database plus payer policy gives a one-screen contextual answer grounded in the actual documents your medical staff committee approved. Citation-mandatory; no answers outside the corpus. The AI refuses to guess. Every response surfaces the section it came from so the resident or attending can verify in 3 seconds, not 90.

12–25s

average lookup time (from 90 sec)

Guideline-adherence rates improve 4–9 points on monitored care paths; clinician-reported interruption cost drops measurably inside two months on the corpora the medical staff committee owns

Tools

Claude Sonnet 4.6PineconeTurbopufferFirst DatabankWolters Kluwer LexicompCerner MultumEpic

USE CASE 03

### AI prior authorization determination engine

The Pain

AI prior authorization is one of the highest-leverage healthcare AI agent workloads we ship. Prior auth approval cycles run 3–10 business days on the payer side. Initial denial rates run 12–25%, mostly because of incomplete clinical documentation that anyone reading the chart could've flagged at submission time. Provider-side admin overhead is $4–$11 per submission, and the patients waiting on the determination are the ones absorbing the worst of the system's slack.

With AI

A healthcare AI agent reads the medical record plus the payer's specific PA criteria plus historical approval patterns, assembles the submission packet, flags missing documentation BEFORE the submission goes out, and drafts the medical-necessity narrative the utilization-review nurse signs. The submitter reviews — the AI doesn't auto-submit, and per California SB 1120 a physician supervises any UM decision with coverage consequence.

92–97%

first-pass approval rate (from 75–88%)

Submission cycle compresses from 3–10 days to 1–3 days; admin cost per submission drops 35–55%; the appeals queue thins because the front-end packet is right the first time

Tools

Claude Sonnet 4.6PineconeEpicCernerCohere HealthOlive AIAvaility

USE CASE 04

### AI revenue cycle management — eligibility, denials, claim status

The Pain

AI revenue cycle management and AI medical billing share the same operational pain: RCM staff spend 40–60% of their day on payer-portal queries — eligibility checks, claim status, denial reason codes. AR over 90 days sits at 15–28% of total AR on mid-size provider groups that haven't automated the routine swivel-chair work. The denial-management team is busy with manual appeals while the new denials pile up faster than the team can read them.

With AI

An agent runs eligibility checks at point of service, monitors claim status across payer portals, classifies denial reason codes, and drafts appeal letters with clinical citations from the chart. Humans approve appeals; routine status checks run unattended. The agent tool-calls into your billing system (Epic Resolute, Cerner Revenue Cycle, Athenahealth) so the workflow lives where the biller already works, not in a parallel screen nobody opens.

60–80%

RCM staff time on routine queries (reduction)

AR over 90 days drops 4–9 points; first-pass clean-claim rate improves 3–6 points; the denial-management team reallocates to actual appeal complexity instead of phone-tree triage

Tools

LangGraphClaude Sonnet 4.6AvailityChange HealthcareEpic ResoluteCerner Revenue CycleAthenahealth

USE CASE 05

### AI triage and care-team workflow agents

The Pain

AI triage workloads are where the burnout signal actually lives. Nurse triage lines run 4–11 min average call time. Patient portal messages on Epic MyChart and Athenahealth Communicator generate 1.5–3× more inbound than staff can answer. Clinician inbox volume contributes 25–40% of the burnout signals the CMO can actually measure — and it's growing every quarter as portal adoption climbs.

With AI

A grounded healthcare AI agent reads ONLY from the patient's chart plus your published clinical protocols plus scheduling availability. It handles AI triage workflows (with strict refusal-to-diagnose), appointment routing, medication-refill handoffs, and results-question escalation. Urgent symptoms escalate to a human clinician inside protocol-defined windows — the agent's job is the front door, never the diagnosis.

1–4hr

portal-message response time (from 1–3 days)

Clinician inbox volume drops 30–50%; triage-line call deflection 25–40% inside the protocol scope; the messages that DO reach a clinician are the ones that needed a clinician

Tools

Claude Sonnet 4.6PineconeEpic MyChartAthenahealth CommunicatorEpic Secure ChatLangGraph

A pattern across all five: **the ROI numbers are the median of what similarly-shaped boutiques have shipped on healthcare ai software development engagements**, not the headline outlier. Don't pick a use case for its ceiling. Pick the two with the cleanest buyer-readable ROI math for your operating model — primary-care groups with documentation drag start with UC-1 and UC-5; specialty practices with payer-mix complexity start with UC-3 and UC-4; HealthTechs and digital-health vendors usually start with UC-2 and UC-5 because the chart and protocol corpora are what they actually own. The cluster keywords — ai clinical documentation, ambient ai scribe, ai prior authorization, ai medical billing — get their own deeper blog treatment; this pillar is the AI healthcare software development orchestration view, not the per-workflow architecture deep-dive. The next section maps each pain to the Paiteq service pillar that does the engineering.

003 / SERVICE MAPPING

## How Paiteq services map to healthcare needs.

Four common healthcare pain shapes on the left, five Paiteq service pillars on the right. Hover any pain row to highlight which services we'd engage; hover a service to reverse-highlight the pains it solves. The descriptive anchors (not the service primary keyword) are deliberate — what matters to the CMO, CMIO, or CFO is the workflow, not the service title.

Clinical documentation burden and physician burnout

5–7 hours per shift inside the EHR and another 1.5–2.5 hours of pajama-time. Burnout-driven attrition costs $500K–$1M per replaced physician — the line item nobody puts in the AI business case but everyone's chief medical officer feels.

Prior authorization friction and denial cycles

12–25% initial denial rates, 3–10 day approval cycles, $4–$11 per submission admin overhead. The patients waiting on the determination absorb the system's slack while the appeals queue grows.

RCM margin compression and AR aging

40–60% of RCM staff time on routine payer-portal queries; AR over 90 days at 15–28% of total AR on mid-size practices. Denial-management teams busy on manual appeals while new denials pile up unread.

Patient portal and triage-line inbox overload

Portal messages 1.5–3× over staff capacity; triage calls 4–11 min average; 25–40% of measurable burnout signal lives in the clinician inbox. Volume is growing every quarter as portal adoption climbs.

[

Service

AI Agent Development

orchestrating patient-triage and revenue-cycle agents that tool-call against Epic, Cerner, and your clearinghouse

](/services/ai-agent-development/)[

Service

RAG Development

grounded retrieval over clinical guidelines, formulary, and the patient's chart with citation-mandatory answers

](/services/rag-development/)[

Service

LLM Development

clinical note structuring, denial-reason classification, prior-authorization narrative drafting on the medical document stack

](/services/llm-development/)[

Service

AI Consulting

HIPAA architecture review, SaMD-threshold scoping, and the procurement-readiness work that precedes the build

](/services/ai-consulting/)[

Service

MLOps

model governance for FDA-style change control and drift monitoring on clinical workflows

](/services/mlops/)

Why the map looks like this

Building healthcare AI in 2026 is genuinely a multi-discipline engineering job — closer to regulated-platform integration plus clinical-informatics work than to a typical EHR optimization build. Documentation burden routes to three services because a working ambient pipeline is partly [orchestrating patient-triage and revenue-cycle agents that tool-call against Epic, Cerner, and your clearinghouse](/services/ai-agent-development/), partly [clinical note structuring, denial-reason classification, prior-authorization narrative drafting on the medical document stack](/services/llm-development/), and partly [grounded retrieval over clinical guidelines, formulary, and the patient's chart with citation-mandatory answers](/services/rag-development/) so the note draft reflects the protocol the medical staff committee approved.

Prior auth and RCM both route to agent plus LLM plus RAG because the work is fundamentally three different jobs stitched together — reading the payer's PA criteria document corpus (RAG), classifying the denial-reason codes and drafting the medical-necessity narrative (LLM), and tool-calling into Availity, Change Healthcare, Epic Resolute, or Athenahealth to actually push the submission (agent). Patient engagement routes to RAG-first because the grounded chatbot pattern is a retrieval problem with strict scope-limit (the AI reads ONLY the patient's chart plus your published protocols, refuses to answer outside that scope) before it becomes an agent problem. Every one of these touches [model governance for FDA-style change control and drift monitoring on clinical workflows](/services/mlops/) — MLOps is the spine the rest of the engagement hangs off, and [HIPAA architecture review, SaMD-threshold scoping, and the procurement-readiness work that precedes the build](/services/ai-consulting/) is where week 1 lives. The discipline split isn't bureaucracy — it's how the engineering stays defensible across a 24-week Platform build with CMIO, Compliance, RCM, and IT all watching the same use case land.

004 / RISK

## HIPAA, FDA posture, and clinical-AI governance.

Three risk layers shape every healthcare AI engagement we'd run. HIPAA compliant AI plus the BAA plus the PHI architecture is the non-negotiable B2B gate — covered entities and business associates won't let an AI vendor near policy data, claim records, or the chart without it. FDA 21 CFR Part 11 plus the SaMD framework plus the FDA's 2025 AI/ML guidance (Predetermined Change Control Plans, Good Machine Learning Practice) sets the device-vs-tool boundary that determines whether a use case ships as decision-support or as a regulated medical device. The state mosaic — California SB 1120 effective Jan 2025, Colorado AI Act, 12+ state bills in pipeline — is the procurement gate provider-side and payer-side legal now actually asks about. The healthcare buyer's procurement gate is HIPAA compliant AI plus FDA posture plus SB 1120, not generic SaaS-style SOC 2 alone.

Audited annually · Continuous monitoring

-   HIPAA + BAA + PHI
    
    BAA at kickoff · per-tenant vector partitioning · PHI redaction at write-time
    
    AUDITED · 2026
    
-   FDA 21 CFR Part 11 + SaMD
    
    Predetermined Change Control Plans · Good Machine Learning Practice posture
    
    AUDITED · 2026
    
-   California SB 1120 + state mosaic
    
    Physician-supervised UM decisions · Colorado AI Act-aligned governance
    
    AUDITED · 2026
    

Governance pack is the real gate, not the model choice

Every payer-credentialing review and every health-system procurement questionnaire now asks how an AI workflow reached its output and who reviewed it. AI-drafted notes, AI-assembled PA packets, AI-classified denial codes, AI-triaged portal messages — each surfaces a reasoning trail plus confidence score plus the reviewing clinician's signature. The honest take: most healthcare AI vendors skip the governance-pack conversation because it's expensive engineering and a senior-clinical-informatics conversation they don't want to have, and their customers find out the hard way at the first OCR inquiry or the first state-regulator AI question. We don't. The model card, the validation approach, and the physician-supervision documentation are load-bearing — not optional add-ons.

HIPAA + BAA + PHI ARCHITECTURE

The non-negotiable PHI flow

HIPAA's Privacy Rule and Security Rule are unchanged and non-negotiable. Covered entities and business associates need a BAA in place before PHI touches any AI system, and PHI in vector stores, prompts, fine-tuning data, or observability logs is auditable and breach-reportable. We sign the BAA at engagement kickoff — not at security review in week 14. PHI never leaves your VPC; vector stores in Pinecone or Turbopuffer partition per-tenant; embeddings never cross tenants; prompts and outputs are logged through Langfuse with PHI redacted at write-time, not scrubbed post-hoc. Fine-tuning never touches identified PHI without IRB-grade scoping. The HHS Office for Civil Rights' 2024 enforcement priorities broadly include AI and algorithmic accountability — we design for that audit, not last year's. Most healthcare AI vendors will tell you they're HIPAA-compliant. Ask them whether they'll sign the BAA without negotiation. That's the actual gate.

FDA 21 CFR PART 11 + SaMD FRAMEWORK

The device-vs-tool boundary

FDA 21 CFR Part 11 (electronic records and electronic signatures) applies when AI outputs inform clinical decisions captured in regulated records — meaning the audit trail, the signature manifest, and the validation evidence become part of the build, not a documentation pass at launch. The Software-as-a-Medical-Device (SaMD) framework plus the FDA's 2025 AI/ML guidance — Predetermined Change Control Plans (PCCPs) and Good Machine Learning Practice (GMLP) broadly cover clinical-decision-support tools that cross into device territory. Most engagements stay BELOW the SaMD threshold by design — the AI drafts, the clinician approves, the audit trail captures who decided. For engagements that DO cross into SaMD (we're honest when they do), we build the PCCP and GMLP documentation from week 3 forward, not as a pre-market retrofit. The SaMD-threshold call is a week-3 architectural decision; getting it wrong at week 14 is how an AI feature ships to a service line and then gets quietly turned off three months later.

STATE AI HEALTHCARE LAWS (SB 1120 + MOSAIC)

Physician-supervised UM + the state-by-state pipeline

California SB 1120 — the Physicians Make Decisions Act, effective Jan 2025 — requires a physician supervise any AI-driven utilization-management decision affecting medical necessity. Colorado AI Act and 12+ similar state bills layer on top, and the state-by-state mosaic is now the first procurement question payer-side and provider-side legal ask. Every clinical-decision pathway we build includes a documented physician-supervision step before any action with coverage or regulatory consequence — the AI assembles the criteria-match analysis, the physician signs. Utilization-management workflows specifically gate on physician sign-off per SB 1120, and the governance pack documents the supervision protocol in the shape state regulators will ask about at the next market-conduct exam. We document for the state regulator the carrier or provider hasn't met yet, not the regulator from last year. The honest framing on California specifically: SB 1120 is enforceable today; the rest of the state mosaic moves through 2026–2027 adoption pipelines. Design for the binding one first, then layer the rest as governance-pack annotations rather than rebuilds.

005 / ENGAGEMENT

## How a healthcare AI engagement runs at Paiteq.

Five phases. Each has a deliverable, a named owner inside your team, and a gate criterion that has to pass before the next phase starts. The cadence is weekly: a Monday standup with your CMO or CMIO, your Compliance lead, the Revenue Cycle director (when in scope), and your IT lead. Demo every Thursday. HIPAA PHI architecture, FDA SaMD-threshold scoping, and the California SB 1120 governance documentation all track in parallel from week 1 — not as a retrofit at the security review.

Healthcare AI Engagement · 16 weeks (typical Platform tier MVP slice) 5 phases

WEEK 1–2 Discovery + HIPAA Posture

Use-case prioritisation, ROI scoping in healthcare units (documentation hours, PA cycle days, AR-90 points), BAA review, clinical-informatics liaison engaged, stakeholder map (CMO + CMIO + Compliance + Revenue Cycle + IT)

Single buyer-readable ROI number scoped per use case; BAA signed before any PHI architecture conversation

WEEK 3–4 Architecture + SaMD Scoping

EHR integration design against Epic / Cerner / Athenahealth / eClinicalWorks, HIPAA PHI-flow review, SaMD-threshold determination per use case, FDA AI/ML guidance gap-analysis where applicable, state-law mosaic checklist (SB 1120 first)

Architecture signed by your CMIO and Compliance lead before any prompt is written; SaMD scope locked in writing

WEEK 5–10 MVP Build

Runnable agent against eval set plus de-identified clinical data, weekly demo with the clinical-informatics liaison in the room, Langfuse observability with PHI redaction at the logging layer, model cards drafted

Baseline accuracy on the eval set; vector partitioning per tenant verified; physician-in-loop checkpoints tested against the protocol

WEEK 11–16 Production + Governance Pack

Hardening against EHR API failure modes, fallback policies, rollout to a pilot service line, governance pack assembled (intended use, validation, monitoring plan), SB 1120 physician-supervision documentation in place

All eval gates green; physician sign-off path documented; governance pack reviewed by CMO and Compliance

WEEK 17+ Optimise + Handoff

Cost engineering, prompt iteration, runbook in your repo, drift-monitoring alerts wired to the clinical-informatics team, ownership transfer

Two cadence notes for healthcare specifically

The clinical-informatics liaison shows up week 1, not week 12. Half the use cases on this page — UC-1 ambient docs, UC-2 clinical knowledge RAG, UC-5 patient triage — depend on specialty-specific decisions that are genuinely clinical decisions, not engineering ones (which note template the cardiologists actually use, what the symptom-triage refusal-window is for chest pain, when the agent stops and waits for a human). We've found the first-week unblock is almost always getting the CMIO and the clinical-informatics liaison into the architecture conversation before the model registry is locked, because changing the validation approach or the human-in-loop threshold at week 8 costs 3–5× what it costs to design it in at week 1. The second cadence note: governance-pack assembly lands at week 11–16, not after launch. The first OCR-shape readiness review, the first SaMD-threshold sign-off, and the SB 1120 supervision documentation are pre-launch gates. We've seen too many healthcare AI vendors ship a working model that then sits unused for two quarters waiting on a governance pack the team never scoped.

006 / TEAM & PRICING

## Team shape and pricing for a healthcare AI engagement.

Two tier shapes cover roughly 80% of healthcare AI engagements — across mid-size provider groups, regional health systems, payers, and HealthTechs. MVP for a single high-clarity use case with the EHR integration scaffolding sized accordingly; Platform for the multi-use-case build on shared infrastructure plus HIPAA governance that most operators actually need. Enterprise tier (4 eng + 3 ML + 1 PM + 1.5 clinical liaison + SaMD scoping, $720K+, 36+ weeks) sits behind these for org-wide AI orchestration. We're honest about being a logistics, fintech, and insurance AI house extending into healthcare — the engineering patterns rhyme but we don't have a decade of clinical-AI reference deck to wave at you, and any vendor in this market who's promising that without naming the EHR they wrote into is selling case-study theatre.

MVP tier — one use case

Platform tier — 3–5 use cases + HIPAA scaffolding

Scope

One use case shipped to production (e.g. UC-1 ambient documentation or UC-3 prior auth)

3–5 use cases on shared EHR integration plus HIPAA scaffolding

Team shape

2 eng + 1 ML + 0.5 PM + 0.5 clinical-informatics liaison

3 eng + 2 ML + 1 PM + 1 clinical-informatics liaison

Timeline

12–16 weeks

20–32 weeks

Indicative range

$120K–$180K

$320K–$520K

Healthcare MVP starts above insurance-equivalent because the HIPAA architecture review, EHR integration surface (Epic / Cerner / Athenahealth / eClinicalWorks), and the clinical-informatics liaison add 3–4 weeks of overhead vs. a logistics build. **Platform tier is the median right answer** for mid-size provider groups, regional health systems, and HealthTechs that have an EHR already and need AI orchestration across documentation, RCM, and patient engagement. Enterprise tier (4 eng + 3 ML + 1 PM + 1.5 clinical liaison + SaMD scoping, 36+ weeks, $720K+) only fits when the engagement is genuinely org-wide AI orchestration across multiple service lines simultaneously.

Eval framework

Single eval set, 80–150 encounters or submissions

Shared eval harness across use cases, regression alarms on every model release, drift monitors routed to the clinical-informatics team

Observability

Langfuse traces with PHI redaction at write-time + cost dashboard

Langfuse + per-use-case cost attribution + model-card registry + drift alerts to CMIO

Stop-and-walk option

Yes — fixed scope, real option to stop after week 12

Phased gates at weeks 4 / 10 / 16; can collapse to single-use-case mid-flight

Click the indicative-range row for the take on which tier fits which provider-group or HealthTech shape. Enterprise tier scoped separately on request.

Sizing for documentation vs. RCM vs. patient-engagement workloads

Prior auth (UC-3) and RCM automation (UC-4) tend to fit cleanly inside the MVP tier because the eval gate is narrow (first-pass approval rate on a held-out PA submission set, denial-classification accuracy on a sampled denial-code set) and the integration surface is contained to one or two clearinghouses. Ambient documentation (UC-1), clinical knowledge RAG (UC-2), and patient triage agents (UC-5) almost always need Platform tier because the eval harness has to cover specialty variability, the corpus governance has to track every guideline-committee update, and the EHR write-back paths are load-bearing. We'd flag — based on adjacent regulated-vertical work — that more than one mid-size group has under-scoped an ambient-documentation build at MVP and lost 6–8 weeks rebuilding the specialty-template layer mid-flight because the cardiologists' charting style arrived sharper than expected.

The cheapest tier isn't the cheapest outcome

If you're shipping more than one AI use case in the next 12 months — and most healthcare teams that get to a serious AI strategy will — the MVP tier asks you to rebuild the EHR integration layer, the eval framework, the PHI redaction layer, and the HIPAA scaffolding twice. The second rebuild costs more than the first. Platform tier is the median right answer for mid-size provider groups, regional systems, and HealthTechs in the $50M–$1B revenue band because the shared infrastructure (eval harness, EHR adapters, RAG over clinical guidelines and chart corpora, model registry, governance-pack templates, observability via Langfuse with write-time PHI redaction) amortises across three to five use cases instead of one. We'd run MVP for two real cases: pre-scale operators testing whether healthcare AI pays back at all, and specialty practices with a single high-clarity workflow (usually prior auth or RCM automation) they want to ship in 14 weeks before greenlighting the platform investment. Both are legitimate; neither is most organizations.

007 / FAQ

## Healthcare AI buyer FAQ.

Four questions we'd expect on almost every healthcare AI first call, answered the way we'd answer them on the call. Specific numbers, named tools, the actual decision rules — not generic vendor-deck answers. We've kept it to four because we'd rather answer four questions well than five questions thinly.

How much does it cost to add AI to our EHR or RCM stack?

Three bands. An **MVP build of a single use case** — ambient documentation, prior auth, or RCM automation — runs $120K–$180K over 12–16 weeks (2 engineers, 1 ML engineer, 0.5 PM, 0.5 clinical-informatics liaison). A **Platform build covering 3–5 use cases on shared EHR integration plus HIPAA scaffolding** runs $320K–$520K over 20–32 weeks (3 eng + 2 ML + 1 PM + 1 clinical-informatics liaison). **Enterprise engagements** with org-wide AI orchestration plus SaMD scoping start at $720K and run 36+ weeks. Healthcare MVP starts above logistics or insurance equivalents because the HIPAA architecture review, the Epic / Cerner / Athenahealth integration surface, and the clinical-informatics liaison add 3–4 weeks vs. a horizontal build. We've watched two mid-size groups skip the liaison line item to compress the budget; both rebuilt the documentation pipeline at month four because the specialty templates didn't match how their clinicians actually chart. The liaison isn't optional — it's the cheapest line item on the spec.

Build vs. buy: when does in-house AI orchestration beat a clinical-AI vendor (Hippocratic AI, Notable Health, DeepScribe)?

Buy when the AI feature is genuinely commodity for your shape — generic ambient scribe for a primary-care group, off-the-shelf nurse-triage agent for a single specialty, vendor-managed prior-auth platform for a payer with a clean policy corpus. The clinical-AI vendors do those workloads well and their per-provider pricing usually beats a custom build for the first two years. Build the orchestration layer when AI touches your **differentiated workflows, your specific specialty mix, or the EHR-write-back patterns your CMIO cares about**. We're not a clinical-AI vendor — we wrap your existing Epic / Cerner / Athenahealth / eClinicalWorks stack with HIPAA-aligned orchestration, and we're honest about that. We've seen a regional health system buy three clinical-AI tools in 18 months, find that none of them composed into the chief medical officer's specialty-specific workflow, and end up needing an orchestration layer on top of all three. That orchestration layer is what we build. We design a [grounded retrieval layer over clinical guidelines, formulary, and the patient's chart with citation-mandatory answers](/services/rag-development/) that wraps the clinical-AI vendor you've licensed, not replaces it.

How do you handle HIPAA and California SB 1120 when an AI agent influences a clinical or utilization-management decision?

The first thing to be straight about: our agents never make binding clinical or coverage decisions. They draft, route, flag, and assemble — physicians and utilization-review nurses approve. The alignment with HIPAA (BAA signed at kickoff, PHI never leaves your VPC, vector stores per-tenant-partitioned, prompts and outputs logged with PHI redacted at write-time rather than scrubbed afterward) is an architecture choice at week 3, not a documentation pass at week 14. For California SB 1120 (the Physicians Make Decisions Act, effective Jan 2025), every utilization-management workflow includes a documented physician-supervision step before any decision with coverage consequence — the AI assembles the criteria-match analysis, the physician signs. Colorado AI Act and the 12+ state bills moving through legislatures layer on top, and the governance pack documents for the state regulator the carrier or provider hasn't met yet, not the regulator from last year. HHS OCR's 2024 enforcement priorities include AI and algorithmic accountability broadly; we design for that audit, not last year's. The opinionated take most healthcare AI vendors skip: an AI workflow without a documented physician-oversight step isn't a compliance edge case in 2026, it's a procurement-conversation dead-end with any UM-touching service line.

Realistic timeline for clinical ROI on a mid-size provider group or HealthTech?

Honest answer: 12–18 weeks from kickoff for the first measurable documentation-time or AR-90 delta on a single use case, and the lift compounds for another 2–3 quarters as the eval data tightens the agent's confidence thresholds and the clinical-informatics team starts trusting the drift signals. The fastest single-use-case wins we'd target on a healthcare engagement: RCM automation (UC-4) at 12 weeks to first measurable AR-90 delta because the eval set is the carrier's denial-reason-code corpus; prior auth (UC-3) at 14 weeks to first-pass-approval delta on a single payer-line combination. The slower wins: ambient documentation (UC-1) and patient triage (UC-5), which both need 16–20 weeks before the eval set covers enough specialty variability or protocol-edge-case variability to trust the agent's outputs without heavy physician review. The honest framing we owe you up front: we're a logistics, fintech, and insurance AI house extending into healthcare. We've shipped HIPAA-aligned orchestration patterns in adjacent regulated verticals, and the engineering rhymes. But we don't have a 10-year healthcare reference deck — anyone in this market who's promising you one without naming the EHR they wrote into is selling case-study theatre. We'd rather scope conservatively and beat the timeline than promise a number that doesn't survive the first specialty rollout.

008 / START A HEALTHCARE AI ENGAGEMENT

## Book a discovery call. We'll name the *two AI features that'll move documentation hours or AR-90* and quote a build window.

No deck. Forty-five minutes with an engineering lead, your CMIO or revenue-cycle director in the room, and a follow-up memo within 48 hours scoping the MVP or Platform tier sized to your EHR and service-line mix.

[Talk to engineering](/contact/) [See the 5 use cases again](#use-cases)

009 / OTHER INDUSTRIES

## Adjacent industries we engage.

Healthcare sits next to three industries in our book where the AI build patterns rhyme — fintech and insurance for the regulated-data posture, SaaS for the platform-integration shape. Brief signposts; full pillars land as each ships.

[

INDUSTRY · FINTECH

AI for Fintech

KYC, fraud detection, model-risk governance under SR 11-7.

](/ai-for-fintech/)[

INDUSTRY · INSURANCE

AI for Insurance

Claims processing, underwriting model risk, FNOL triage, grounded chatbots.

](/ai-for-insurance/)[

INDUSTRY · SAAS

AI for SaaS

Sales agents, RAG copilots, churn prediction, embedded product AI.

](/ai-for-saas/)


---

## SECTION: 5.4. Industry: ai-for-ecommerce

_Source: https://www.paiteq.com/ai-for-ecommerce/_

# AI for Ecommerce — Paiteq

> Paiteq builds AI for ecommerce: catalog enrichment, AI search, personalization, returns triage — wrapped around your existing Shopify/Klaviyo stack.

**HTML version:** https://www.paiteq.com/ai-for-ecommerce/

## Key facts

- Workflows: catalog enrichment, AI search, personalization, returns triage.
- Stack integrations: Shopify, Klaviyo, OMS/3PL APIs.

## Related pages

- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI for Ecommerce · AI for Retail · AI Ecommerce Development

# *AI for ecommerce* + AI for retail — AI ecommerce development that lifts conversion without rebuilding your stack.

Ecommerce teams in 2026 sit in a three-way squeeze: funded competitors shipping AI features in 4–6 weeks, peak-traffic AI cost spikes that turn Black Friday into a CFO incident, and 50K–500K SKU catalogs sitting on 40–70% thin metadata that no merchandising team can manually enrich at scale. Paiteq does AI ecommerce development inside your existing ecommerce and retail stack — Shopify, BigCommerce, commercetools, Klaviyo, Algolia — wrapping catalog, search, personalization, cart recovery, returns, and forecasting in a layer that makes the stack smarter without replacing it. AI for retail and AI for ecommerce buyers tend to look the same on paper; the orchestration shape is what changes. We stay through the first eval-drift cycle and the first peak-traffic day, not the deploy.

[Talk to engineering](/contact/) [See the 7 use cases](#use-cases)

Use cases 7 · catalog · search · personalization · returns · ops

Engage MVP · Platform · Enterprise

Stack Claude · Pinecone · Algolia · Klaviyo · Shopify

Risk PCI-DSS · GDPR Art 22 · Brand-safety

001 / WHY NOW

## Why ecommerce and retail teams pick AI ecommerce development partners right now.

Ecommerce founders and CTOs in 2026 face three pressures running in parallel: AI feature parity with funded competitors, peak-traffic cost spikes that can blow up the AI line item in a single week, and catalog metadata gaps that no manual merchandising team can fill at scale. Each pressure on its own would be manageable. Together, they're why AI for ecommerce has moved from R&D experiment to board-level agenda since 2024, and why every AI in ecommerce conversation we walk into now starts with a CFO question rather than a CTO one. AI for retail used to sit in an innovation team; in 2026 it sits in the operating plan. The teams shipping well aren't replacing Shopify, Klaviyo, or Algolia — they're wrapping those primitives with an orchestration layer that makes them smarter. AI for Shopify retailers specifically lands inside an existing Shopify Plus or headless Shopify build without surgery.

0 –6w

Competitor AI feature cadence

Funded ecommerce competitors shipping AI features in 4–6 weeks; pre-AI roadmaps lose buyer comparisons on demo day.

0 –20×

Peak-day AI cost spike risk

Black Friday / Prime Day inference costs spike 5–20× without routing discipline; uncontrolled, AI becomes the most expensive P&L line that week.

0 –70%

Catalog SKUs with thin metadata

Mid-market catalogs (50K–500K SKUs) sit on 40–70% missing descriptions, alt text, attribute taxonomy. AI enrichment compresses cost-per-SKU $0.85 → $0.06–$0.12.

PRESSURE 01

Cadence: 4–6 week competitor AI feature sprints

Funded DTC brands and multi-brand marketplaces are shipping AI features in 4–6 week sprints — AI search, generative product descriptions, agent-driven cart recovery, returns triage — and the buyer comparison goes badly when the customer demoes a competitor that has them and you don't. We've watched a perfectly-run Shopify Plus retailer lose a wholesale buyer over a single missing AI feature the competitor shipped in five weeks on Claude Sonnet 4.6 plus Algolia rerank. The bottleneck isn't model capability — it's the eval framework, the brand-voice constraints, and the integration surface against your existing Klaviyo flows and your existing Algolia index. Those take 4–6 weeks regardless of which model you pick. Most AI ecommerce development teams underestimate the integration surface and over-budget for the model layer; we typically see the opposite distribution work better.

PRESSURE 02

Peak-day AI cost spike: the 5–20× problem

Black Friday inference costs spike 5–20× on naive AI feature builds. Prime Day for marketplace sellers does the same. Without a routing discipline — Claude Sonnet to Claude Haiku to a fine-tuned smaller model as load climbs — a successful AI feature becomes the single most expensive line item on the P&L that week. We've seen a Cyber Monday bill that ran 11× a normal Monday because the team shipped a generation-on-every-page-view pattern with no cap. The fix isn't to cap the feature; it's to route it. LiteLLM or OpenRouter for routing, Langfuse for per-use-case cost telemetry, and a smaller fine-tuned model hosted on Modal or Together or Fireworks for the peak tier. Cost stays predictable. Quality stays inside the eval-acceptable band. CFO stops side-eyeing the AI line every Monday.

PRESSURE 03

Catalog metadata gap: 40–70% of SKUs short of ship-ready

Mid-market catalogs sit on 40–70% items with thin descriptions, missing alt text, and inconsistent attribute taxonomy. The merchandising team caps out at 200–400 SKUs per week of manual enrichment. New seasons, new collabs, new brands on a marketplace — the backlog grows faster than headcount can shrink it, and the cost-per-SKU at $0.40–$1.20 means the math doesn't close. AI enrichment compresses cost-per-SKU to $0.06–$0.12 and lifts throughput 12–18× at quality your merchandising team will actually approve, as long as the brand-voice RAG layer and the human-sampling cadence are designed in. We've shipped this pattern across DTC fashion (180K SKUs), multi-brand marketplaces (~500K SKUs), and Shopify Plus retailers (~40K SKUs) and the cost-per-SKU and quality numbers hold across all three shapes.

The opinionated take

Most ecommerce AI projects fail because the team treats AI as a feature parallel to the stack instead of an orchestration layer inside it. Ecommerce that wins in 2026 doesn't replace Shopify, Klaviyo, or Algolia — it makes them smarter. The cost of choosing the wrong abstraction layer is typically 6–12 months of rebuilding the migration data once the AI feature scales beyond a pilot use case: the team rewires the catalog data flow, redoes the brand-safety gates, and frequently rebuilds the eval harness because the original one was bolted onto the wrong primitive. We don't get that number from theory.

— Paiteq engineering

002 / USE CASES

## The 7 highest-ROI AI use cases in ecommerce.

Below are the seven workflows we see ecommerce teams build first. They share three traits: each has a clear conversion-readable ROI number, each is deployable inside a 6–16 week window, and each compounds when you ship two or three together on shared infra rather than as standalone bets. The cards are dense on purpose — pain, with-AI workflow, named tools, and the ROI metric in the ecommerce buyer's vocabulary. Skim them, then read the two or three that match where your roadmap actually sits today.

USE CASE 01

### Catalog enrichment and AI-generated product descriptions

The Pain

A 50K–500K SKU catalog typically sits on 40–70% items with thin or missing descriptions, missing alt text, and inconsistent attribute taxonomy. Manual enrichment runs $0.40–$1.20 per SKU; merchandising teams cap out at 200–400 SKUs per week, and the backlog grows faster than headcount can shrink it.

With AI

A pipeline takes the product image plus the supplier feed plus your brand-voice guidelines and generates a structured attribute extraction, an 80–120 word SEO description, alt text, and a three-tier taxonomy assignment. Humans review on a sampling cadence rather than per item. Brand-voice RAG over your tone-of-voice doc keeps the generation inside your style guide; an image-attribute extractor reads the photo for colour, fit, material, and visible feature signals the supplier feed didn't capture.

12–18×

SKU enrichment throughput

Cost-per-SKU $0.85 → $0.06–$0.12; search relevance up 14–22% measured on click-through-to-conversion

Tools

Claude Sonnet 4.6GPT-4 VisionGeminiAlgoliaConstructordbtPineconeSnowflake

USE CASE 02

### AI search and intent-aware merchandising

The Pain

Algolia, Coveo, and native platform search return string matches. A query like "running shoes for wide feet under $120" returns brand-name matches not intent matches; 32–48% of search sessions end in "no relevant results" on long-tail queries. The conversion math gets ugly because the buyer is on a comparison day, not a browse day.

With AI

A query-understanding layer translates buyer intent into structured Algolia or Coveo facets and reranks the result set with semantic similarity over the catalog. We don't replace your search engine — we make its inputs and outputs smarter. The reranker reads a buyer's session signal alongside the query so a returning customer who's been looking at trail-running shoes for two weeks doesn't get reset to a cold result set.

18–28%

search → conversion lift

"No results" rate drops from ~40% → 8–14%; long-tail query coverage up measurably across the catalog

Tools

Claude HaikuPineconeTurbopufferCohere Rerank 3.5AlgoliaCoveoConstructor

USE CASE 03

### Personalization and recommendation explainability

The Pain

Klaviyo, Bloomreach, and Nosto recommendations work — but the merchandising team can't see WHY a customer got the "warm picks" row. When a campaign underperforms, the team can't debug; when legal asks why a particular customer sees a particular row, you can't answer. GDPR Article 22 makes the answer mandatory once enforcement bites.

With AI

An explanation layer on top of your existing recommender. It drafts "we showed this because" reasons for the merchandising team's review console and surfaces the same signal to customers on demand. Your recommender stays — Klaviyo, Nosto, Bloomreach, whichever — and the explanation engine reads the same signals the recommender used. The customer-facing transparency surface ships as a small Vercel AI SDK component you embed wherever you need the disclosure.

1–2 days

merchandising debug cycle (from 1–2 weeks)

Article 22 disclosure surface ships in ~4 weeks and becomes a sales asset for enterprise contracts that ask for it

Tools

Claude Sonnet 4.6KlaviyoNostoBloomreachPineconeVercel AI SDKLangfuse

USE CASE 04

### Agent-driven cart-recovery and checkout assistance

The Pain

Cart abandonment runs 68–82% across mid-market ecommerce. Email and SMS recovery captures 4–8% of abandoned carts; the rest is left on the table because the recovery message is generic and the buyer's actual blocker — sizing, stock, returns clarity — never gets named.

With AI

An agent reads the abandonment context — which products, time on page, prior purchase history, support history — and drafts a personalized recovery message. Where the buyer opted in, an in-app checkout-assist chat can answer "is this in stock in red size 9?" or "what's the return window for international orders?" against your live Shopify Storefront API, BigCommerce, or commercetools systems. The agent tool-calls; it doesn't guess.

11–18%

cart recovery rate (from 4–8%)

Checkout-assist conversion lift 8–14% on assisted sessions; AOV up 4–9% from accurate cross-sell context

Tools

LangGraphClaude Sonnet 4.6KlaviyoHubSpotShopify PlusBigCommercecommercetools

USE CASE 05

### Returns triage and RMA cost compression

The Pain

Returns processing burns 3–7 minutes per RMA — photo review, condition assessment, refund-vs-exchange decision, restock route. At 8–15% return rates, ops cost is 18–32% of the original sale margin on the returned cohort. That's where the DTC margin goes.

With AI

An agent reads the customer's return reason plus photos plus order context, classifies the condition (resellable, refurbish, dispose), routes to the right path, and drafts the customer response. Humans approve the exception cases; routine ones auto-process inside the policy guardrails your ops team set. The agent never refunds outside policy — it just removes the manual-trace step that ate the team's afternoon.

35–90s

RMA processing time (from 3–7 min)

Restock-vs-dispose accuracy up; customer refund-issued time compresses 24–72 hrs → 1–4 hrs

Tools

Claude Sonnet 4.6GPT-4 VisionLoopReturnlyHappy ReturnsNetSuiteBrightpearl

USE CASE 06

### Peak-traffic resilience and cost-cap discipline for AI features

The Pain

AI inference costs spike 5–20× on Black Friday, Prime Day, and flash-sale days. Without cost caps, a successful AI feature becomes the most expensive line item on the P&L for that week, and the CFO comes asking on Monday morning. We've watched it. It's not fun.

With AI

A model-routing layer that downshifts from Claude Sonnet to Claude Haiku to a fine-tuned smaller model based on traffic load and per-session cost budget. Quality stays in the eval-acceptable band; cost stays predictable. The cost telemetry is per-use-case so the CFO can read "search reranking cost us $X on Cyber Monday" instead of one undifferentiated AI line.

1.4–2.2×

peak-day AI cost spike (from 5–20×)

On-call pages from cost-runaway alerts drop to near-zero; per-use-case attribution lands in your existing BI

Tools

LiteLLMOpenRouterLangfuseLlama 4 70BMistral SmallModalTogetherFireworks

USE CASE 07

### Inventory and demand forecasting agents

The Pain

Demand planners run weekly forecasts; missed signals — TikTok virality, regional weather, competitor promo — create 2–6 week stockout windows. Excess inventory ties up 8–18% of working capital. Both ends of the error distribution hurt the P&L.

With AI

An agent reads sales velocity plus external signals (social trends, weather, licensed competitor-promo data) and surfaces "this SKU is about to spike — increase reorder by N units" recommendations for the planner. The planner approves; nothing auto-orders. The base forecast is classical ML; the agent's job is the signal-narrative layer that makes the planner's review faster.

8–15%

forecast accuracy lift on volatile SKUs

Stockout windows compress 35–55%; working capital tied in excess inventory drops 12–22%

Tools

XGBoostLightGBMMLflowClaude Sonnet 4.6SnowflakeBigQueryNetSuiteShopify Plus

A pattern worth flagging across all seven AI for ecommerce workflows above — and a working framing for AI in ecommerce more broadly: **the ROI numbers are the median of what we and similarly-shaped agencies have shipped**, not the headline outlier. Don't pick a use case for its ceiling. Pick the two with the cleanest conversion-readable ROI math for your stage — Shopify Plus retailers with a long-tail search problem start with UC-1 and UC-2; DTC brands with a cart-abandonment hole start with UC-4 and UC-3; multi-brand marketplaces with an ops drag start with UC-5 and UC-7. The next section maps each pain to the Paiteq service that does the actual engineering.

003 / SERVICE MAPPING

## How Paiteq services map to ecommerce needs.

Four common ecommerce pain shapes on the left, five Paiteq service pillars on the right. Hover any pain row to highlight which services we'd engage; hover a service to reverse-highlight the pains it solves. The descriptive anchors (not the service primary keyword) are deliberate — what matters to you is the workflow, not the service title.

AI feature parity pressure

Funded ecommerce competitors ship AI features in 4–6 weeks; pre-AI roadmaps lose buyer comparisons on demo day.

Catalog scale and content velocity

50K–500K SKU catalogs with 40–70% thin metadata; manual enrichment doesn't scale past 400 SKUs per week.

Peak-traffic resilience and cost discipline

AI inference spikes 5–20× on Black Friday and Prime Day; without routing discipline the AI line item runs the week.

Personalization without recommender lock-in

Klaviyo, Bloomreach, and Nosto recommendations work — explaining them under Article 22 and debugging them weekly doesn't.

[

Service

AI Agent Development

designing autonomous agent systems

](/services/ai-agent-development/)[

Service

RAG Development

grounded retrieval over product catalogs and policy docs

](/services/rag-development/)[

Service

LLM Development

model selection, fine-tuning, and evaluation

](/services/llm-development/)[

Service

Generative AI

brand-controlled content generation

](/services/generative-ai/)[

Service

AI Integration

drop-in AI integration into Shopify, BigCommerce, and Klaviyo stacks

](/services/ai-integration/)

Why the map looks like this

AI ecommerce development in 2026 is genuinely a multi-discipline engineering job — closer to platform integration work than to a typical Shopify-app build. Feature-parity pressure routes to three services because shipping AI search rerank is partly [designing autonomous agent systems](/services/ai-agent-development/), partly [model selection, fine-tuning, and evaluation](/services/llm-development/), and partly [drop-in AI integration into Shopify, BigCommerce, and Klaviyo stacks](/services/ai-integration/). Catalog scale routes to LLM work plus [brand-controlled content generation](/services/generative-ai/) plus agent orchestration because the enrichment pipeline isn't a single LLM call — it's an attribute-extraction step, a generation step, a brand-voice gate, and a quality-sampling routing decision.

Peak-traffic resilience routes to agent work plus integration because the model-routing layer (LiteLLM or OpenRouter) has to sit alongside your existing CDN and your existing autoscaling — not in a parallel system. Personalization without recommender lock-in routes to agent work, [grounded retrieval over product catalogs and policy docs](/services/rag-development/), and LLM work because the explanation layer reads the same signals your recommender used, retrieves the policy context, and drafts the merchant-facing or customer-facing transparency surface under Article 22. The discipline split isn't bureaucracy — it's how the engineering stays high-quality across a 16-week Platform build with merchandising, legal, and ops all watching the same use case.

004 / RISK

## Operational risk and data posture for ecommerce.

Three risk layers shape every AI for ecommerce engagement we run. PCI-DSS v4.0 is table stakes for anyone touching payment flows. GDPR Article 22 governs automated decisions — personalization, dynamic pricing, returns triage — and the EU DSA reinforces it with recommender-transparency obligations for large platforms. Brand-safety and AI content provenance close the loop on generated content, image search, and auto-merchandising. The ecommerce buyer's gate is brand safety plus payment scoping plus recommender transparency — not regulator-driven compliance in the fintech sense.

Audited annually · Continuous monitoring

-   PCI-DSS v4.0
    
    Card-data scope · tokenized metadata only
    
    AUDITED · 2026
    
-   GDPR Art 22
    
    Recommender transparency · explanation surface
    
    AUDITED · 2026
    
-   Brand-safety
    
    Content provenance · IP/licensing checks
    
    READY
    

Brand safety is the real gate, not a footnote

Every enterprise contract and every wholesale partner conversation now runs a brand-safety pass on your AI features. AI-generated product descriptions can hallucinate features the supplier didn't ship; AI image search can surface counterfeit or prohibited items if the embedding space isn't gated; auto-merchandising can violate brand or licensing guidelines if the IP check isn't wired in. FTC AI deception guidance covers all of it. The honest take: most "AI-powered ecommerce" marketing skips the brand-safety conversation entirely because it's uncomfortable. We don't. The brand-voice RAG layer and the IP/licensing gates are load-bearing, not optional add-ons.

PCI-DSS V4.0

PCI-DSS scope posture

AI features designed not to touch raw card data. The orchestration sits at the tokenized-metadata tier; vector stores never contain a PAN or CVV; LLM calls receive transaction and product context but not card primitives; observability traces redact PII at the logging layer. Secrets live in your existing PCI-scoped vault. The network-segment boundary stays where it already is — the assessor's report-on-compliance shouldn't change because we shipped a cart-recovery agent. Most ecommerce teams we engage with already have a clean PCI-scoped environment; our job is to design the AI work so the scope doesn't expand. Scope creep at audit is the failure mode here, and it's preventable. We've designed catalog-enrichment pipelines, cart-recovery agents, and returns-triage systems all to sit outside the cardholder-data environment, and the assessor sign-off has held across every engagement.

GDPR ART 22

Recommender transparency posture

Personalized recommendations, dynamic pricing, and returns-triage decisions are all automated decisions covered by Article 22, and reinforced by EU DSA recommender-transparency obligations for large platforms. Every automated decision in our builds is paired with a human-review fallback plus an explanation surface. Drafted disclosures — "we showed this because X, Y, Z" — are shippable in the merchandising console for debug and surfaced to the customer on demand. The transparency surface is a small Vercel AI SDK component, not a separate product, so it embeds wherever the disclosure needs to land. The pragmatic read: most enforcement we've watched lands not on the recommendation itself but on the merchant's inability to explain it when asked. The explanation layer turns that question into a 30-second answer.

BRAND-SAFETY

Content provenance and IP posture

AI-generated product descriptions can hallucinate features; AI image search can surface counterfeit or prohibited items; auto-merchandising can violate brand or licensing guidelines. FTC AI-deception guidance covers deceptive AI content, and the EU DSA adds platform-tier obligations for large marketplaces. Our generation pipeline gates on three things: a brand-voice RAG layer (the AI can't write outside your style guide), IP and licensing checks (image and trademark matching against a blocklist), and confidence-thresholded human review for high-risk categories — supplements, regulated goods, age-gated items. None of these are optional add-ons; they're the gates that keep the enterprise wholesale buyer's brand-safety pass from turning into a renegotiation. The honest take: most AI for retail vendors skip this entirely, and their customers find out the hard way at the first wholesale-partner review.

005 / ENGAGEMENT

## How an ecommerce AI engagement runs at Paiteq.

Five phases. Every phase has an explicit deliverable, a named owner inside your team, and a gate criterion that has to pass before the next phase starts. The cadence is weekly: a Monday standup with your Head of Ecommerce, Merchandising lead, Engineering lead, and Ops lead. Demo every Thursday. Brand-safety and PCI scoping track in parallel from week 1, not as a retrofit at security review.

Ecommerce AI Engagement · 16 weeks (typical Platform tier) 5 phases

WEEK 1–2 Discovery

Use-case prioritisation, conversion-side ROI scoping, stakeholder map (Head of Ecommerce + Merchandising + Engineering + Ops)

Single conversion-readable ROI number scoped per use case

WEEK 3–4 Architecture + Risk Scoping

Stack lock, PCI-DSS scope analysis, Article 22 explanation-surface design, brand-safety policy draft

Architecture signed by your ops lead and your legal contact before any prompt is written

WEEK 5–10 MVP Build

Runnable agent against eval set plus your real catalog, weekly demo, observability via Langfuse, peak-traffic cost caps wired in

Baseline accuracy hit on eval set; PCI scope unchanged; cost telemetry per use case

WEEK 11–16 Production + Peak Readiness

Hardening, fallback policies, model-routing for peak days, rollout, runbook for Black Friday / Prime Day on-call

All eval gates green; peak-day cost ceiling pre-tested at 3× expected load

WEEK 17+ Optimise + Handoff

Cost engineering, prompt iteration, runbook in your repo, eval-drift monitoring, ownership transfer

Two cadence notes for ecommerce specifically

The merchandising lead shows up week 1, not week 8. Half the use cases on this page — UC-1 catalog enrichment, UC-2 AI search, UC-3 personalization — depend on decisions that are genuinely merchandising decisions, not engineering ones (brand voice, attribute taxonomy, recommendation diversity). We've found the first-week unblock is almost always getting merchandising into the architecture conversation before the stack is locked, because changing the brand-voice RAG corpus or the attribute schema at week 4 costs 2–3× what it costs to design it in at week 1. The second cadence note: peak-day readiness lands at week 11–16, not after launch. Black Friday and Prime Day are real deadlines; we pre-test the model-routing layer at 3× expected load before sign-off so the first peak day isn't the first stress test. Most ecommerce AI vendors discover their cost ceiling at midnight on Cyber Monday. That's a bad way to learn it.

006 / TEAM & PRICING

## Team shape and pricing for an ecommerce AI engagement.

Two tier shapes cover roughly 80% of ecommerce AI engagements we run — across DTC brands, multi-brand marketplaces, and AI for Shopify Plus retailers specifically. MVP for a single high-clarity use case with the brand-safety scaffolding sized accordingly; Platform for the multi-use-case build on shared infra that most retailers in the $25M–$200M GMV band actually need. Enterprise tier (4 eng + 3 ML + 1 PM, $550K+, 28+ weeks) sits behind these for org-wide AI orchestration across merchandising, marketing, and ops simultaneously.

MVP tier — one use case

Platform tier — 3–5 use cases on shared infra

Scope

One use case end-to-end (e.g. UC-1 catalog enrichment or UC-5 returns triage)

3–5 use cases on shared infra plus brand-safety and PCI scoping

Team shape

2 eng + 1 ML + 0.5 PM

3 eng + 2 ML + 1 PM

Timeline

6–10 weeks

14–22 weeks

Indicative range

$70K–$120K

$220K–$380K

Ecommerce MVP starts lower than the equivalent fintech tier because the compliance scaffolding is lighter — PCI scoping is a defined surface, not an open-ended SR 11-7 inventory. **Platform tier is the median right answer** for ecommerce in the $25M–$200M GMV band. The Enterprise tier (4 eng + 3 ML + 1 PM, 28+ weeks, $550K+) only fits when the engagement is genuinely org-wide AI orchestration across merchandising, marketing, ops, and customer service simultaneously.

Eval framework

Single eval set, 30–50 examples

Shared eval harness across use cases, regression alarms in CI

Observability

Langfuse traces + cost dashboard

Langfuse + per-use-case cost attribution + peak-day load test harness

Stop-and-walk option

Yes — fixed scope, real option to stop after week 6

Phased gates at weeks 4 / 8 / 14; can collapse to a single-use-case build mid-flight

Click the indicative-range row for the take on which tier fits which GMV band. Enterprise tier scoped separately on request.

Sizing for catalog vs. checkout vs. ops workloads

Catalog enrichment (UC-1) and returns triage (UC-5) tend to fit cleanly inside the MVP tier because the eval gate is narrow and the integration surface is contained. AI search rerank (UC-2), personalization explainability (UC-3), and demand forecasting (UC-7) almost always need Platform tier because the eval harness, the feature store, and the retrieval infra are the load-bearing pieces. We've seen more than one mid-market retailer under-scope an AI-search build at MVP and lose 4–6 weeks rebuilding the eval set mid-flight because the merchandising team's quality bar arrived sharper than expected.

The cheapest tier isn't the cheapest outcome

If you're shipping more than one AI use case in the next 12 months — and most ecommerce teams that get to a serious AI strategy will — the MVP tier asks you to rebuild the eval framework, the model-routing layer, and the observability stack twice. The second rebuild costs more than the first. Platform tier is the median right answer for retailers in the $25M–$200M GMV band because the shared infra (eval harness, retrieval layer, model routing via LiteLLM, observability via Langfuse) amortises across three to five use cases instead of one. The MVP tier exists for two real cases: pre-scale retailers testing whether AI ecommerce development pays back at all, and Shopify Plus teams with a single high-clarity workflow they want to ship in 8 weeks before greenlighting the platform investment. Both are legitimate. Neither is most companies.

007 / WORK

## What we've shipped for ecommerce companies.

Three anonymised ecommerce engagements from the broader team's history. Segment and GMV band are real; metrics are real; the numbers were measured 60–90 days post-launch, not at deploy. Brand names removed under standard NDA. Anyone selling you headline outliers without the operating numbers under them is selling case-study theatre.

Merchandising

Mid-market DTC fashion · $40M GMV · NA

### Catalog enrichment + AI search rerank

A 180K SKU fashion catalog with ~58% thin descriptions and a long-tail search problem. We shipped the enrichment pipeline (Claude Sonnet plus GPT-4 Vision over the product photos) and the AI search rerank layer on top of Algolia in 9 weeks. Cost-per-SKU dropped $0.91 to $0.09 on enriched items; search-to-conversion lifted 21%; "no results" rate fell from 38% to 11% on long-tail queries.

0 %

search → conversion lift / 90d post-launch

Ops

Multi-brand marketplace · $120M GMV · EU

### Returns triage agent + brand-safety gates

Return rate sat at 14% across 40 brands with wildly different return policies. We shipped a triage agent (Claude Sonnet plus GPT-4 Vision for photo condition assessment) wired into Loop and Happy Returns, with per-brand policy guardrails and human-approval queues for exceptions. RMA processing compressed from 5.2 minutes mean to 48 seconds on routine cases. Customer refund-issued time dropped from 36 hours to 3 hours.

RMA processing 5.2m → 48s on routine cases

Growth

Shopify Plus retailer · $25M GMV · NA

### Cart-recovery agent + peak-traffic cost discipline

Cart abandonment at 76% with a 5.8% recovery rate via existing Klaviyo flows. We shipped a recovery agent (LangGraph plus Claude Sonnet with tool-calling into Shopify Storefront) plus a model-routing layer (LiteLLM) for peak days. Recovery rate climbed to 14.2%; Cyber Monday inference cost ran 1.7× a normal day instead of the previous year's 11×; AOV on assisted sessions up 7%.

0 %

Cart recovery → 14.2%

The shape across all three engagements

The conversion-readable ROI metric was scoped in week 2, before any code was written. The eval set grew during production via traces sampled monthly — not a static 50-example set left over from architecture. Handoff put the runbook in the client's repo, not in a shared doc. We engage as an ecommerce AI partner that stays through the first peak-traffic day and the first eval-drift cycle, not one that ships and disappears. Roughly half of the AI ecommerce engagements we close convert to a lighter-weight Run engagement after the build is in production; half don't, because the client's internal team has picked up ownership. Both outcomes are fine. The Run engagement is real work — prompt iteration, cost engineering, peak-day load testing, regression testing on new model releases — not a retainer hiding as a service.

008 / FAQ

## Ecommerce AI buyer FAQ.

Five questions we get on almost every AI for ecommerce first call, answered the way we'd answer them on the call. Specific numbers, named tools, the actual decision rules — not generic vendor-deck answers.

How much does it cost to add AI to our ecommerce site?

Three bands. An **MVP build of a single AI use case** — catalog enrichment, returns triage, or cart-recovery agent — runs $70K–$120K over 6–10 weeks (2 engineers, 1 ML engineer, 0.5 PM). A **Platform build covering 3–5 use cases on shared infra** runs $220K–$380K over 14–22 weeks. **Enterprise engagements** with org-wide AI orchestration across merchandising, marketing, and ops start at $550K and run 28+ weeks. Ecommerce MVP starts lower than fintech-equivalent because PCI scoping is a defined surface, not an open-ended model-risk inventory. Most AI ecommerce development work that ships well sits in the Platform tier — the shared infra (eval harness, model routing via LiteLLM, observability via Langfuse) amortises across the use cases instead of getting rebuilt every project.

Build vs. buy: when does in-house AI orchestration beat a Shopify-app vendor?

Buy when the AI feature is genuinely commodity — generic image tagging, basic recommendations, off-the-shelf chatbot — and a Shopify app fits inside your existing stack without surgery. Build the orchestration layer when AI touches your **differentiated catalog, your conversion-side decisioning, or your brand-safety surface**. Catalog enrichment with your tone-of-voice (UC-1), AI search rerank tuned to your category mix (UC-2), and recommendation explainability under Article 22 (UC-3) aren't workloads where a generic vendor's benchmark predicts performance on your catalog. We've watched mid-market retailers buy three different Shopify AI apps in 18 months, hit the second use case, and realise the apps don't compose. The rebuild costs more than a clean Platform build would have. Our [drop-in AI integration into Shopify, BigCommerce, and Klaviyo stacks](/services/ai-integration/) sits at the orchestration layer, not at the storefront — your existing apps stay where they earn their keep.

How do you handle PCI-DSS scope when adding AI features to checkout?

We design AI features so they don't touch raw card data. The orchestration sits at the tokenized-metadata tier: the vector store never contains a PAN or CVV, LLM calls receive transaction and product context but not card primitives, and observability traces redact PII at the logging layer. Secrets live in your existing PCI-scoped vault rather than a vendor's. The network-segment boundary stays where it already is — your assessor's report-on-compliance shouldn't change because we shipped a cart-recovery agent. Most ecommerce teams we engage with already have a clean PCI-scoped environment under PCI-DSS v4.0; our job is to design the AI work so the scope doesn't expand. Scope creep at audit is the failure mode here, and it's preventable. The [grounded retrieval over product catalogs and policy docs](/services/rag-development/) design pattern keeps the vector store on the metadata side of the line by default.

Which AI use cases have the highest ROI for mid-market ecommerce ($10M–$200M GMV)?

The four highest-ROI starting points we see in 2026 are: **catalog enrichment** (UC-1 — 12–18× SKU throughput, $0.85 → $0.06–$0.12 cost-per-SKU, search relevance up 14–22%), **AI search and rerank** (UC-2 — 18–28% search-to-conversion lift, "no results" rate down from 40% to 8–14%), **cart recovery agents** (UC-4 — recovery rate 4–8% → 11–18%, AOV up 4–9% on assisted sessions), and **returns triage** (UC-5 — RMA processing 3–7 min → 35–90 sec, refund-issued time 24–72 hrs → 1–4 hrs). Pick the two with the cleanest conversion-readable ROI math for your stage, ship them on shared infra, and let the eval data tell you which is next. Trying to ship five at once is how AI ecommerce development stalls — too many merchandising approvals running in parallel and the team loses the plot by week 8.

How long until we see conversion lift?

Honest answer: 8–14 weeks from kickoff for the first measurable conversion lift on a single use case, and the lift compounds for another 2–3 quarters as eval data trains the prompts and the rerank weights. AI for Shopify retailers tends to land faster than headless commerce builds because the integration surface is smaller. The fastest single-use-case wins we've shipped: catalog enrichment at 6 weeks to first measurable search lift; cart-recovery agent at 7 weeks to first recovery-rate delta. The slower wins: personalization explainability (UC-3) and demand forecasting (UC-7), which both need 12–16 weeks before the eval set is rich enough to trust the agent's outputs without heavy review. We'd rather scope conservatively and beat the timeline than promise a number that needs a CFO conversation in week 9. [Model selection, fine-tuning, and evaluation](/services/llm-development/) in week 1 is where the timeline either gets real or stays fictional.

009 / START AN ECOMMERCE AI ENGAGEMENT

## Book a discovery call. We'll name the *two AI features that'll move conversion or AOV* and quote a build window.

No deck. Forty-five minutes with an engineering lead, your real product context on the table, and a follow-up memo within 48 hours scoping the MVP or Platform tier sized to your catalog and traffic shape.

[Talk to engineering](/contact/) [See the 7 use cases again](#use-cases)

010 / OTHER INDUSTRIES

## Adjacent industries we engage.

Ecommerce sits next to three industries in our book where the AI build patterns rhyme — sometimes the workflow translates directly, sometimes the data posture changes the engineering. Brief signposts; full pillars land as each ships.

[

INDUSTRY · SAAS

AI for SaaS

Sales agents, RAG copilots, churn prediction, embedded product AI.

](/ai-for-saas/)[

INDUSTRY · FINTECH

AI for Fintech

KYC, fraud detection, model-risk governance under SR 11-7.

](/ai-for-fintech/)[

INDUSTRY · LOGISTICS

AI for Logistics

Routing agents, shipment Q&A, claims triage, ETA prediction.

](/ai-for-logistics/)


---

## SECTION: 5.5. Industry: ai-for-insurance

_Source: https://www.paiteq.com/ai-for-insurance/_

# Custom AI Insurance Development — Paiteq

> Paiteq does custom AI insurance development — claims, underwriting, FNOL triage, grounded chatbots — wrapped around your existing PAS/CMS/ACORD stack.

**HTML version:** https://www.paiteq.com/ai-for-insurance/

## Key facts

- Workflows: claims, underwriting, FNOL triage, grounded chatbots.
- Integrations: PAS, CMS, ACORD-aligned stacks.

## Related pages

- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)
- [AI Consulting](https://www.paiteq.com/services/ai-consulting/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI for Insurance · AI Insurance Development Company

# *AI for insurance* + custom AI insurance development — claims, underwriting governance, FNOL, and actuarial AI around your PAS/CMS stack.

Insurance leadership in 2026 sits in a three-way squeeze: combined-ratio pressure on every line as catastrophe-loss frequency keeps stair-stepping up, regulator scrutiny on every AI decision now that the NAIC AI Model Bulletin is moving through 20+ states' adoption pipelines, and a talent gap on actuarial plus claims that won't close in this hiring market. Paiteq is an AI insurance development company doing AI for insurance carriers, MGAs, brokers, and InsurTechs — we wrap your existing PAS (Guidewire, Duck Creek, Insurity), CMS (Snapsheet, Hi-Marley), and ACORD-data pipelines with AI decisioning across claims orchestration, underwriting model governance, FNOL triage, loss-run extraction, subrogation, and actuarial AI workflows. We're a custom AI insurance development partner that sits on top of Shift Technology, Gradient AI, and Roots.ai rather than competing with them. We stay through the first NAIC market-conduct review and the first eval-drift cycle, not the deploy.

[Talk to engineering](/contact/) [See the 7 use cases](#use-cases)

Use cases 7 · claims · UW gov · FNOL · actuarial · subrogation

Engage MVP · Platform · Enterprise

Stack Guidewire · Duck Creek · Claude · Pinecone · MLflow

Risk SOC 2 · NAIC AI Bulletin · MRM · ASOP 56

001 / WHY NOW

## Why insurance teams are evaluating AI for insurance from an AI insurance development company right now.

Chief claims officers, chief actuaries, and COOs at carriers and MGAs in 2026 face three pressures running in parallel: claims-leakage and LAE-ratio drift that no amount of adjuster headcount can fix, actuarial-governance load that absorbs senior talent every audit cycle even before a state regulator asks a question, and a service-call deflection gap that pushes AHT and E&O exposure simultaneously. Each pressure on its own would be manageable. Together, they're why ai in insurance industry conversations have moved from R&D experiment to operating-plan agenda since the 2023 NAIC AI Model Bulletin landed, and why every AI for insurance discussion we walk into now starts with a combined-ratio question rather than a tech question. The teams shipping insurance AI well aren't replacing Guidewire, Duck Creek, Insurity, Shift Technology, or Gradient AI — they're wrapping those primitives with an orchestration layer that makes them smarter, and that's the ai insurance development company shape every Round-3 zerovol SERP boutique now sells. The framing shift in 2026: AI in insurance industry stopped being a McKinsey slide and started being shipped code that tool-calls into the carrier's PAS.

0 –5d

FNOL → claim file assembled and routed

Each P&C claim file runs 18–40 documents (police reports, photos, repair estimates, prior policy, customer statements). Clerks spend 45–80 min assembling per file before adjuster review; leakage on missed coverage runs 3–8% of LAE.

0 –6w

Per-cycle actuarial governance time

Underwriting models drift quarterly but get reviewed annually. Each bias-test and disparate-impact pack runs two senior actuaries 3–6 weeks per audit cycle. State regulator questions land as 200-tab spreadsheets a month after the ask.

0 –14%

Subrogation recovery rate (industry baseline)

Mid-market P&C carriers recover on 7–14% of subrogation-eligible claims. Roughly 25–40% of recoverable opportunities never get worked because the subrogation review happens 60–120 days after claim close, not at FNOL.

The opinionated take

Most insurance AI projects fail because the team treats AI as a parallel system to the PAS instead of an orchestration layer inside it. A separate AI product that doesn't tool-call into ClaimCenter or PolicyCenter is a screen, not a decision. The cost of choosing the wrong abstraction layer is typically 5–12 months of rebuilding the integration scaffolding once the AI feature scales beyond a pilot — the team rewires the policy-forms ingestion, redoes the NPPI redaction layer, and almost always rebuilds the audit trail because the original one was bolted onto the wrong primitive and the chief actuary won't sign off on it. We don't get those numbers from theory; we've watched two carriers and one MGA do exactly this rebuild before engaging us.

— Paiteq engineering

002 / USE CASES

## The 7 highest-ROI AI use cases in insurance.

Below are the seven workflows we see insurance teams build first. They share three traits: each has a clear buyer-readable ROI number in insurance units (LAE points, combined-ratio movement, FNOL hours, severity-misclassification rate, subrogation recovery rate, underwriter time freed), each is deployable inside a 10–18 week window, and each compounds when you ship two or three together on a shared PAS/CMS integration layer rather than as standalone bets. The cards are dense on purpose — pain, with-AI workflow, named tools, and the ROI metric in the chief claims officer's or chief actuary's vocabulary. Skim them, then read the two or three that match where your roadmap actually sits today. The ai claims processing pattern below (UC-1) gets a deeper architecture treatment in a separate blog covering the FNOL-to-file agent loop.

USE CASE 01

### AI claims processing engine — FNOL to file-ready to routing

The Pain

AI claims processing is the workflow most carriers ask about first. From FNOL submission to claim file assembled and routed to the adjuster takes 2–5 business days on mid-market P&C carriers. Each file is 18–40 documents — police reports, photos, repair estimates, prior policy, customer statements — and claims clerks spend 45–80 min per file assembling before an adjuster ever reads it. Leakage from missed coverage details runs 3–8% of loss-adjustment expense. Most insurance AI vendors sell you an AI claims processing demo and a deck full of LAE-ratio screenshots; what they don't ship is the NAIC governance pack that lets your chief actuary sign off.

With AI

An agent reads the inbound FNOL (form submission, call transcript, email, photos, attached PDFs), classifies the claim type, pulls the matching policy plus prior claims plus relevant endorsements from your PAS, drafts a coverage analysis with a flag list of issues for the adjuster, and routes to the right segment queue. The adjuster approves or edits — no auto-decisions on coverage. The agent's job is the document assembly and the coverage read, not the liability call.

4–10 hrs

FNOL → file-ready (from 2–5 days)

Clerk time per file drops 45–80 min → 8–14 min; missed-coverage leakage drops 1.5–3 points; LAE ratio improves 2–4 points on mid-volume lines

Tools

Claude Sonnet 4.6GPT-4 VisionPineconeTurbopufferGuidewire ClaimCenterDuck Creek ClaimsSnapsheetHi-Marley

USE CASE 02

### Underwriting model risk and governance

The Pain

Underwriting models drift quarterly but most carriers review them annually. Bias-test artifacts get rebuilt manually by the actuarial team every audit cycle — 3–6 weeks of work per cycle. When the state regulator asks "show us the disparate-impact analysis for ZIP-code-correlated features", the answer is a 200-tab spreadsheet that took two senior actuaries a month to produce. AI underwriting without a continuous governance layer isn't AI underwriting; it's a model that'll get pulled at the next market-conduct exam.

With AI

A model-governance layer runs the bias tests, drift tests, and feature-stability tests continuously on every underwriting model in production. It surfaces a quarterly governance pack the chief actuary signs off and surfaces real-time alerts when a model crosses a drift threshold. The actuary stays the decision-maker; the platform produces the artifacts they need to sign off on. The base underwriting model stays classical ML (LightGBM or XGBoost) per ASOP No. 56 — the governance wrap is what's new.

2–4d

actuarial governance per cycle (from 3–6w)

Drift-related rate-inadequacy events catch 60–90 days earlier; state-regulator response time on AI-model questions compresses from weeks to hours

Tools

MLflowLightGBMXGBoostFairlearnAequitasEvidentlyClaude Sonnet 4.6

USE CASE 03

### Grounded policy assistant for service and agent workflows

The Pain

Policyholder service calls average 6–11 minutes; agent service calls average 12–22 minutes. 40–60% of those calls are answered by a policy provision a grounded assistant could have read out loud — but generic conversational AI hallucinates coverage details, and a wrong coverage answer is an E&O exposure with measurable downstream cost. A policy assistant that hallucinates a deductible isn't an assistant, it's a litigation primer.

With AI

A grounded policy assistant reads ONLY from your actual policy forms (with endorsements applied), the active claim record, and the agent's authorization scope. It refuses to answer outside that scope. Every answer cites the policy section it came from. It surfaces "I'm not sure — let me get an adjuster" instead of guessing. It handles policyholder FAQs, claim-status, and routine agent endorsement requests; complex coverage interpretation routes to a human. The AI insurance agent variant reads the broker's book and the carrier's appetite simultaneously, surfacing appetite-fit and renewal flags for the ai for insurance agents workflow.

32–48%

deflection rate on policyholder calls

Agent service-call AHT drops 18–28%; every AI answer is policy-cited and logged for E&O defensibility

Tools

Claude Sonnet 4.6PineconeLangChainGuidewire PolicyCenterDuck Creek ProducerInsurity

USE CASE 04

### FNOL triage and severity prediction

The Pain

Severity classification at FNOL is mostly heuristic — adjusters segment claims by line of business and dollar threshold. Total-loss vs. repairable on auto runs 12–18% wrong at intake; coverage-litigation flags on liability claims get missed 8–15% of the time and surface later as bad-faith exposure. Mid-tier carriers don't lose money on the big claims they triage right; they lose it on the medium claims that should've been triaged big.

With AI

A triage model reads the FNOL inputs — description, photos, location, prior-claims history, policy coverage — and predicts severity tier plus total-loss probability plus an early litigation-risk score. High-severity or high-litigation-risk claims route to senior adjusters immediately instead of waiting 5–14 days for re-triage. The triage call isn't auto-binding; it's a queue-priority signal with a confidence score the adjuster can override.

3–6%

severity-misclassification rate (from 12–18%)

High-severity claims hit senior-adjuster desks within 2–4 hours instead of 5–14 days; bad-faith flags catch 70–85% of cases that would have surfaced later

Tools

XGBoostLightGBMMLflowClaude Sonnet 4.6GPT-4 VisionGuidewire ClaimCenterDuck Creek Claims

USE CASE 05

### Loss-run extraction from commercial submissions

The Pain

Commercial underwriters receive loss runs as PDFs, Excels, and emailed text — every submission in a different shape. Manual extraction takes 25–60 min per loss run; data-entry errors run 4–9%. Underwriters spend 35–50% of their day keying loss runs, not doing actual underwriting judgment. That ratio's the real reason MGAs can't scale a UW team without a four-month ramp.

With AI

An extraction pipeline reads the inbound loss-run document (any format), normalizes to ACORD or your internal schema, validates against the submission's effective dates and prior-policy history, and pushes the structured record into your underwriting workbench. The underwriter reviews the parse and spot-checks; the underwriter never re-keys. Low-confidence fields surface with the source-document snippet so the spot-check is 30 seconds, not 30 minutes.

~0.5%

data-entry error rate (from 4–9%)

Loss-run keying drops 25–60 min → 90 sec – 4 min per submission; underwriter time freed for actual risk judgment up 30–45%

Tools

Claude Sonnet 4.6Mistral OCRAWS TextractPineconeGuidewire UnderwritingPortalDuck Creek ProducerACORD

USE CASE 06

### Subrogation analyzer and recovery surfacing

The Pain

Subrogation recovery rate sits at 7–14% across mid-market P&C carriers. Recoverable claims get missed because the subrogation review happens 60–120 days after claim close, and adjusters don't flag liability shifts at FNOL. Industry estimate: 25–40% of recoverable opportunities never get worked. The dollars are sitting on the floor — the carrier just doesn't have an analyst who reads every file in time.

With AI

An analyzer reads every claim file in real time, flags the ones with subrogation indicators (third-party fault, contractual recovery rights, comparative-negligence shifts), drafts the demand-letter outline, and surfaces the package to the subrogation team within 7–10 days of claim opening — not 60–120 days after close. The team reviews and pursues; the agent is the eyes on every file, not the negotiator.

13–22%

subrogation recovery rate (from 7–14%)

Missed-recovery rate drops 25–40% → 8–15%; recovery dollars hit the books 90–180 days earlier on average

Tools

Claude Sonnet 4.6PineconeGuidewire ClaimCenterDuck Creek ClaimsBTI Solutions

USE CASE 07

### Agentic ops for renewals, endorsements, and quote prep

The Pain

Renewal prep on commercial books burns 4–9 hours per account per cycle. Underwriting assistants assemble the renewal package — loss runs, exposure changes, market data, prior endorsements — before the underwriter looks at it. Endorsement processing carries a 5–11 business-day backlog on mid-size MGAs. The agentic ai insurance pattern that works here isn't "replace the underwriter"; it's "stop paying an UW assistant to do data entry".

With AI

A renewal agent assembles the prep package autonomously (pulls prior policy, runs the loss-run extraction, summarizes exposure changes, drafts the renewal narrative), then routes to the underwriter for decision — not data entry. An endorsement agent handles routine endorsements start-to-finish with policyholder confirmation; complex endorsements surface for underwriter review. The agent runs against your underwriting guidelines via grounded retrieval, not against generic LLM judgment.

25–55 min

renewal prep per account (from 4–9 hrs)

Endorsement backlog drops 5–11d → same-day to 2d on routine endorsements; UW assistants reallocate to broker support instead of data assembly

Tools

LangGraphClaude Sonnet 4.6PineconeVerisk LightSpeedISO MarketStanceACORDGuidewire PolicyCenterDuck Creek Producer

A pattern worth flagging across all seven workflows above: **the ROI numbers are the median of what we and similarly-shaped boutiques have shipped on custom AI insurance development engagements**, not the headline outlier. Don't pick a use case for its ceiling. Pick the two with the cleanest buyer-readable ROI math for your operating model — personal-lines carriers with FNOL backlog start with UC-1 and UC-4; commercial MGAs with UW keying drag start with UC-5 and UC-7; specialty carriers with a recovery gap start with UC-6 and UC-1. Adjacent specializations — actuarial AI for rate-adequacy work, agentic AI insurance patterns for renewal-prep loops, AI for insurance agents workflows on the broker side — get their own deeper treatments; this pillar is the AI for insurance orchestration view, not the per-workflow deep-dive. The next section maps each pain to the Paiteq service pillar that does the actual engineering.

003 / SERVICE MAPPING

## How Paiteq services map to insurance needs.

Four common insurance pain shapes on the left, five Paiteq service pillars on the right. Hover any pain row to highlight which services we'd engage; hover a service to reverse-highlight the pains it solves. The descriptive anchors (not the service primary keyword) are deliberate — what matters to the chief claims officer or chief actuary is the workflow, not the service title.

Claims leakage and FNOL backlog

FNOL → file-ready runs 2–5 days; missed-coverage leakage costs 3–8% of LAE. Document assembly burns 45–80 min per file before an adjuster reads anything.

Underwriting model risk and actuarial governance

UW models drift quarterly, reviewed annually. Bias-test packs take 3–6 weeks per audit cycle. State-regulator AI questions land as month-long spreadsheet exercises.

Policyholder and agent service-call load

40–60% of service calls answer a policy provision; generic chatbots hallucinate coverage and create E&O exposure. AHT on agent calls runs 12–22 min.

Commercial underwriting data-entry drag

Underwriters spend 35–50% of their day keying loss runs and ACORD forms. Error rate runs 4–9%; MGA UW teams can't scale without a four-month ramp.

[

Service

AI Agent Development

orchestrating renewal, endorsement, and subrogation agents that tool-call against your PAS

](/services/ai-agent-development/)[

Service

RAG Development

grounded retrieval over policy forms, endorsements, and prior-claim history

](/services/rag-development/)[

Service

LLM Development

claims-document classification, loss-run extraction, and model selection across the insurance document stack

](/services/llm-development/)[

Service

Machine Learning Development

training underwriting, severity, and total-loss models on your historical book

](/services/machine-learning-development/)[

Service

MLOps

model governance, drift monitoring, and the ASOP No. 56 plus SR 11-7-adjacent MRM scaffolding your chief actuary signs off on

](/services/mlops/)

Why the map looks like this

Building insurance AI in 2026 is genuinely a multi-discipline engineering job — closer to platform integration plus regulated-ML work than to a typical PAS-customisation build. Claims leakage routes to three services because a working claims agent is partly [orchestrating renewal, endorsement, and subrogation agents that tool-call against your PAS](/services/ai-agent-development/), partly [grounded retrieval over policy forms, endorsements, and prior-claim history](/services/rag-development/), and partly [claims-document classification, loss-run extraction, and model selection across the insurance document stack](/services/llm-development/). Underwriting governance routes to ML plus MLOps plus LLM because a working UW model wrap isn't a single LLM call — it's a base classical model (LightGBM or XGBoost) under ASOP No. 56, a continuous bias-test layer (Fairlearn or Aequitas), drift monitoring (Evidently), and an LLM narrative layer that drafts the chief actuary's governance pack.

Policyholder-service load routes to RAG plus agent plus LLM because the grounded chatbot pattern is fundamentally a retrieval problem — the AI's job is to read the policyholder's actual policy forms with endorsements applied and refuse to answer outside that scope, not to make the coverage call. Commercial UW data-entry drag routes to LLM plus RAG plus agent because loss-run extraction at scale needs [training underwriting, severity, and total-loss models on your historical book](/services/machine-learning-development/) for the base extraction, ACORD-form RAG for normalisation, and agent orchestration for the submission-to-workbench push. And every one of these touches [model governance, drift monitoring, and the ASOP No. 56 plus SR 11-7-adjacent MRM scaffolding your chief actuary signs off on](/services/mlops/) — MRM isn't a separate workstream, it's the spine the rest of the engagement hangs off. The discipline split isn't bureaucracy — it's how the engineering stays high-quality across a 24-week Platform build with claims, actuarial, compliance, and IT all watching the same use case land.

004 / RISK

## Operational risk and model governance for insurance.

Three risk layers shape every insurance AI engagement we run. SOC 2 Type II plus insurance-data posture is the B2B procurement gate — carriers, MGAs, and brokers won't let an AI vendor touch policy data, claim records, or NPPI without the attestation. The NAIC AI Model Bulletin (adopted 2023, now in 20+ states' adoption pipelines) plus state layers (Colorado SB 21-169 on algorithmic-discrimination in life UW; NY DFS Insurance Circular Letter No. 7 (2024) on third-party AI) sets the governance baseline. Model risk and actuarial governance under SR 11-7-adjacent MRM and ASOP No. 56 (Modeling, effective 2020 — its scope broadly covers AI/ML predictive models used in actuarial work; the American Academy of Actuaries' 2024 AI Practice Notes operationalize it for ML) closes the loop with the chief actuary. The insurance buyer's procurement gate is SOC 2 plus NAIC plus MRM — not regulator-driven privacy in the SaaS sense.

Audited annually · Continuous monitoring

-   SOC 2 Type II
    
    Insurance-data posture · per-tenant partitioning · NPPI redaction
    
    AUDITED · 2026
    
-   NAIC AI Model Bulletin
    
    5 AI Principles · Colorado SB 21-169 + NY DFS Circular Letter 7
    
    AUDITED · 2026
    
-   MRM + ASOP No. 56
    
    SR 11-7-adjacent · model cards · validation reports · champion-challenger
    
    AUDITED · 2026
    

Governance pack is the real gate, not the model choice

Every market-conduct exam and every reinsurance partner's questionnaire now asks how an AI system reached its decision and who reviewed it. AI-routed claims, AI-classified severity, AI-extracted loss runs, AI-drafted demand letters — each surfaces a reasoning trail plus confidence score plus the reviewing human's signature. The honest take: most insurance AI vendors skip the governance-pack conversation entirely because it's expensive engineering and a senior-actuary conversation they don't want to have, and their customers find out the hard way at the first state-regulator AI inquiry or the first internal MRM review. We don't. The model card, the validation report, and the champion-challenger setup are load-bearing, not optional add-ons.

SOC 2 TYPE II

Insurance-data posture

Carriers, MGAs, and brokers require SOC 2 attestation before any AI vendor touches policy data, claim records, or NPPI (non-public personal information — named insureds, SSN fragments, claim numbers tied to PII). Reinsurance partners ask the same question one layer up. We design AI features so insurance data never leaves your VPC: vector stores in Pinecone or Turbopuffer partition per insurer-tenant; embeddings never cross tenants; observability logs in Langfuse redact NPPI at the logging layer, not as a post-hoc scrub. The engagement signs DPA plus SOC 2 attestation review at kickoff. NAIC AI Model Bulletin §5 (third-party AI governance) applies — the carrier owns the AIS risk; our role is to design the controls that let them own it cleanly. Most carrier procurement teams we engage with already run a clean SOC 2 environment; our job is to design the AI scope so the attestation holds at next year's audit.

NAIC AI MODEL BULLETIN + STATE MOSAIC

AIS governance and state-level adoption

The NAIC Model Bulletin on the Use of AI Systems by Insurers (adopted 2023, now in 20+ states' adoption pipelines) sets the governance baseline. Colorado SB 21-169 (the first US state to enforce algorithmic-discrimination rules in life-insurance underwriting) and NY DFS Insurance Circular Letter No. 7 (2024) on third-party AI add layers. Every AIS we deploy carries a governance pack: intended purpose, training-data lineage, validation approach, bias-test results (Fairlearn or Aequitas), ongoing monitoring plan (Evidently), human-oversight protocol. The NAIC's five AI Principles — fair and ethical, accountable, compliant, transparent, and secure/safe/robust — map to specific engineering controls; they're not a checklist exercise. The framing matters: NAIC bulletin adoption is a state-by-state pipeline, not binding federal law everywhere, and the carrier's risk is the regulator they haven't met yet, not the regulator from last year.

MRM (SR 11-7-ADJACENT) + ASOP NO. 56

Model risk and actuarial governance

Insurers are extending banking-style Model Risk Management (the Fed's SR 11-7 framework, applied analogously rather than as a directly-binding rule) to AI/ML models in claims, underwriting, and pricing. The actuarial profession's ASOP No. 56 (Modeling, effective 2020) defines the chief actuary's responsibilities for model governance, validation, and ongoing review — and its scope already covers AI/ML predictive models used in actuarial work, operationalized by the American Academy of Actuaries' 2024 AI Practice Notes. Without MRM plus ASOP No. 56 alignment, your AI model can't be used in rate filings or reserves. Every production AI model carries a model card (intended use, limitations, performance bounds, known failure modes), a validation report, a drift-monitoring config (Evidently), a champion-challenger setup, and a documented human-override path. The chief actuary signs off; we build the artifacts they need to sign off on. Model risk isn't an afterthought — it's the architecture decision at week 3.

005 / ENGAGEMENT

## How an insurance AI engagement runs at Paiteq.

Five phases. Every phase has an explicit deliverable, a named owner inside your team, and a gate criterion that has to pass before the next phase starts. The cadence is weekly: a Monday standup with your Chief Claims Officer or Chief Underwriting Officer, the Chief Actuary, your Compliance lead, and your IT lead. Demo every Thursday. SOC 2 insurance-data posture, NAIC AI Model Bulletin governance-pack design, and the MRM scaffolding under ASOP No. 56 all track in parallel from week 1 — not as a retrofit at the security review.

Insurance AI Engagement · 18 weeks (typical Platform tier) 5 phases

WEEK 1–2 Discovery

Use-case prioritisation, LAE-ratio or recovery-rate ROI scoping, stakeholder map (Chief Claims Officer + Chief Actuary + Compliance + IT)

Single buyer-readable ROI number scoped per use case (LAE pts, combined-ratio pts, FNOL hours, recovery-rate %)

WEEK 3–4 Architecture + MRM Scoping

Stack lock against your PAS/CMS/ACORD layer, SOC 2 insurance-data posture review, NAIC AI Model Bulletin governance-pack design, MRM scaffolding scoped

Architecture signed by your chief actuary and chief claims officer before any prompt is written

WEEK 5–10 MVP Build

Runnable agent against eval set plus your real claims or UW data, weekly demo, observability via Langfuse, NPPI redaction wired in, model cards drafted

Baseline accuracy hit on eval set; vector partitioning per tenant verified; reasoning-trail logging in place

WEEK 11–18 Production + Governance Pack

Hardening, fallback policies for PAS or vector-store outages, rollout, NAIC governance pack assembled, MRM validation report drafted with the chief actuary

All eval gates green; champion-challenger setup live; governance pack signed by chief actuary; market-conduct-exam-ready

WEEK 19+ Optimise + Handoff

Cost engineering, prompt iteration, runbook in your repo, drift-monitoring alerts wired to actuarial team, ownership transfer

Two cadence notes for insurance specifically

The chief actuary shows up week 1, not week 12. Half the use cases on this page — UC-2 UW model governance, UC-4 FNOL triage, UC-7 agentic renewals — depend on decisions that are genuinely actuarial decisions, not engineering ones (what the confidence threshold is for an auto-routed claim, how the bias-test pack maps to the carrier's rate filings, when the agent stops and waits for a human). We've found the first-week unblock is almost always getting the chief actuary into the architecture conversation before the model registry is locked, because changing the validation approach or the human-in-loop threshold at week 8 costs 3–5× what it costs to design it in at week 1. The second cadence note: the governance pack assembly lands at week 11–18, not after launch. The first NAIC AIS principles review, the first MRM validation report, and the chief actuary's sign-off are pre-launch gates, not post-launch deliverables. We've seen too many insurance AI vendors ship a working model that then sits unused for two quarters waiting on a governance pack the team never scoped.

006 / TEAM & PRICING

## Team shape and pricing for an insurance AI engagement.

Two tier shapes cover roughly 80% of insurance AI engagements we run — across carriers, MGAs, and brokers. MVP for a single high-clarity use case with the PAS/CMS integration scaffolding sized accordingly; Platform for the multi-use-case build on shared infrastructure plus MRM scaffolding that most operators in the $100M–$1B written-premium band actually need. Enterprise tier (4 eng + 3 ML + 1 PM + 1 actuary-liaison, $680K+, 32+ weeks) sits behind these for org-wide AI orchestration across claims, underwriting, servicing, and MRM simultaneously. As an ai insurance development company we don't pretend the MVP tier is the right answer for everyone — it's a stepping stone for half our clients and a stop point for the other half.

MVP tier — one use case

Platform tier — 3–5 use cases + MRM scaffolding

Scope

One use case shipped to production (e.g. UC-4 FNOL triage or UC-5 loss-run extraction)

3–5 use cases on shared PAS/CMS integration layer plus MRM scaffolding

Team shape

2 eng + 1 ML + 0.5 PM

3 eng + 2 ML + 1 PM + 0.5 actuary-liaison

Timeline

10–14 weeks

18–28 weeks

Indicative range

$100K–$160K

$290K–$460K

Insurance MVP starts above logistics because NAIC AI Model Bulletin documentation plus MRM scaffolding adds 2–4 weeks vs. a logistics build. **Platform tier is the median right answer** for mid-market carriers and MGAs in the $100M–$1B written-premium band. The Enterprise tier (4 eng + 3 ML + 1 PM + 1 actuary-liaison, 32+ weeks, $680K+) only fits when the engagement is genuinely org-wide AI orchestration across claims, underwriting, servicing, and MRM simultaneously.

Eval framework

Single eval set, 50–100 claims or submissions

Shared eval harness across use cases, regression alarms in CI on every model release, drift monitors wired to actuarial team

Observability

Langfuse traces + cost dashboard + NPPI redaction logging

Langfuse + per-use-case cost attribution + model-card registry + drift alerts routed to chief actuary

Stop-and-walk option

Yes — fixed scope, real option to stop after week 10

Phased gates at weeks 4 / 10 / 18; can collapse to single-use-case build mid-flight

Click the indicative-range row for the take on which tier fits which written-premium band. Enterprise tier scoped separately on request.

Sizing for claims vs. UW vs. servicing workloads

FNOL triage (UC-4) and loss-run extraction (UC-5) tend to fit cleanly inside the MVP tier because the eval gate is narrow (severity-accuracy on a held-out claims set, extraction error rate on a sampled submissions set) and the integration surface is contained. Claims-processing engines (UC-1), grounded chatbots (UC-3), UW model governance (UC-2), and subrogation analyzers (UC-6) almost always need Platform tier because the eval harness, the governance-pack infrastructure, and the policy-forms corpus are the load-bearing pieces. We've seen more than one mid-market carrier under-scope a claims-agent build at MVP and lose 6–10 weeks rebuilding the policy-forms RAG layer mid-flight because the chief claims officer's coverage-read quality bar arrived sharper than expected.

The cheapest tier isn't the cheapest outcome

If you're shipping more than one AI use case in the next 12 months — and most insurance teams that get to a serious AI strategy will — the MVP tier asks you to rebuild the PAS/CMS integration layer, the eval framework, the NPPI redaction layer, and the MRM scaffolding twice. The second rebuild costs more than the first. Platform tier is the median right answer for mid-market carriers and MGAs in the $100M–$1B written-premium band because the shared infrastructure (eval harness, PAS adapters, RAG over policy forms and historical claims, model registry, governance-pack templates, observability via Langfuse) amortises across three to five use cases instead of one. We run MVP for two real cases: pre-scale operators testing whether insurance AI pays back at all, and specialty carriers with a single high-clarity workflow (usually loss-run extraction or subrogation surfacing) they want to ship in 12 weeks before greenlighting the platform investment. Both are legitimate; neither is most companies.

007 / WORK

## What we've shipped for insurance teams (anonymized).

Three anonymised insurance engagements from the broader team's history. Segment and written-premium band are real; metrics are real; the numbers were measured 90 days post-launch on the claims and recovery engagements and at the first audit cycle on the underwriting engagement, not at deploy. Brand names removed under standard NDA. Anyone selling you headline outliers without the operating numbers under them is selling case-study theatre.

Claims

Regional P&C carrier · $420M written premium · NA

### FNOL-to-file claims agent + coverage-read RAG

A P&C carrier with FNOL → file-ready sitting at 3.8 days mean and clerk time at 68 min per file. We shipped a claims agent (Claude Sonnet 4.6 plus GPT-4 Vision on damage photos and PDFs) with RAG over the policy-forms library and prior-claims corpus via Pinecone, tool-calling against Guidewire ClaimCenter for the case management. FNOL → file-ready compressed to 6.4 hours mean; clerk time dropped 68 min → 11 min; missed-coverage leakage dropped 2.1 LAE points; LAE ratio improved 3.4 points on auto-physical-damage lines inside two quarters.

FNOL → file-ready 3.8d → 6.4hr / 90d post-launch

Underwriting

Mid-market MGA · $180M written premium · commercial P&C · NA

### Loss-run extraction + UW model-governance wrap

An MGA with UW assistants spending ~42% of their day re-keying loss runs and bias-test prep eating 4.5 weeks per audit cycle. We shipped a loss-run extraction pipeline (Claude Sonnet 4.6 plus Mistral OCR, RAG over ACORD form definitions via Turbopuffer) and an MRM governance wrap (MLflow plus Fairlearn plus Evidently) on the underwriting models. Keying dropped to 2.1 min per submission at 0.4% error; actuarial governance compressed from 4.5 weeks to 2.8 days per cycle; the chief actuary signed off the first NAIC governance pack at week 18.

0 %

UW keying → 6% of day; gov pack 4.5w → 2.8d

Recovery

Specialty carrier · $90M written premium · liability lines · NA

### Subrogation analyzer + early-flag agent

Subrogation recovery rate sat at 9.1% with reviews happening ~85 days after claim close. We shipped a subrogation analyzer (Claude Sonnet 4.6 plus RAG over five years of historical subrogation files via Pinecone) that scored every open claim for liability-shift and third-party-fault indicators within 7 days of FNOL. Recovery rate climbed to 16.8%; missed-recovery dropped from ~32% to 11%; recovery dollars hit the books a mean 112 days earlier than baseline.

0 %

Subrogation recovery → 16.8%

The shape across all three engagements

The buyer-readable ROI metric was scoped in week 2, before any code was written — LAE-ratio target on the P&C carrier engagement, UW-time and governance-cycle target on the MGA engagement, recovery-rate target on the specialty engagement. The eval set grew during production via sampled traces, not a static set left over from architecture. The NAIC governance pack got signed by the chief actuary at week 18 or earlier on every engagement that needed one. Handoff put the runbook in the client's repo, not in a shared doc. We engage as a custom ai insurance development partner that stays through the first audit cycle and the first regulator AI inquiry, not one that ships and disappears. Roughly half of the insurance AI engagements we close convert to a lighter-weight Run engagement after the build is in production; half don't, because the client's internal team has picked up ownership. Both outcomes are fine. The Run engagement is real work — prompt iteration, cost engineering, drift retraining, governance-pack refresh on every model release — not a retainer hiding as a service.

008 / FAQ

## Insurance AI buyer FAQ.

Five questions we get on almost every insurance AI first call, answered the way we'd answer them on the call. Specific numbers, named tools, the actual decision rules — not generic vendor-deck answers.

How much does it cost to add AI to our PAS/CMS stack?

Three bands. An **MVP build of a single AI use case** — FNOL triage, loss-run extraction, or a grounded chatbot — runs $100K–$160K over 10–14 weeks (2 engineers, 1 ML engineer, 0.5 PM). A **Platform build covering 3–5 use cases on a shared PAS/CMS integration layer plus MRM scaffolding** runs $290K–$460K over 18–28 weeks (3 eng + 2 ML + 1 PM + 0.5 actuary-liaison). **Enterprise engagements** with org-wide AI orchestration across claims, underwriting, servicing, and MRM start at $680K and run 32+ weeks. Insurance MVP starts slightly above logistics-equivalent because the NAIC AI Model Bulletin governance pack plus model-risk scaffolding adds 2–4 weeks vs. a logistics build — every insurance AI development company that ships seriously now budgets that overhead from week 3, not as a retrofit at security review. Most insurance AI work that ships well sits in the Platform tier — the shared infrastructure (eval harness, model registry, PAS adapters, NPPI redaction, observability via Langfuse) amortises across the use cases instead of getting rebuilt every project.

Build vs. buy: when does in-house AI orchestration beat an InsurTech vendor (Shift Technology, Gradient AI, Roots.ai)?

Buy when the AI feature is genuinely commodity — basic claims-fraud scoring, off-the-shelf submission triage, generic chatbot — and the vendor's data and product surface already covers your line mix. Build the orchestration layer when AI touches your **differentiated claims handling, your underwriting decisioning, or your subrogation recovery**. FNOL-to-file claims agents with your specific coverage forms (UC-1), grounded chatbots tuned to your endorsements (UC-3), and subrogation analyzers reading your historical recovery patterns (UC-6) aren't workloads Shift Technology or Gradient AI will build for you — they sell platform breadth, not your specific decisioning. We've watched a $400M regional carrier buy two InsurTech tools in 18 months, realise neither composed into the chief claims officer's workflow, and rebuild the orchestration layer on top of both tools for less than the third tool's annual license. We design an [grounded retrieval layer over policy forms, endorsements, and prior-claim history](/services/rag-development/) that wraps the InsurTech you've licensed, not replaces it.

How do you handle NAIC AI Model Bulletin governance when an agent is making claims or underwriting decisions?

The first thing to be straight about: our agents never make binding coverage or underwriting decisions. They draft, route, flag, and assemble — humans approve. That alignment with the NAIC AI Principles (fair and ethical, accountable, compliant, transparent, and secure/safe/robust) that the Model Bulletin operationalizes is the architecture choice at week 3, not a documentation pass at week 14. Every AI system gets a governance pack: intended purpose, training-data lineage, validation approach, bias-test results (via Fairlearn or Aequitas), ongoing monitoring plan (Evidently), and a documented human-oversight protocol. State-level layers stack on top — Colorado SB 21-169 for life UW, NY DFS Insurance Circular Letter No. 7 (2024) for third-party AI — and the pack maps to each state's market-conduct-exam shape. The NAIC bulletin was adopted in 2023 and is moving through 20+ states' adoption pipelines; we document for the regulator the carrier hasn't met yet, not the regulator from last year. The opinionated take most insurance AI vendors skip: an AIS without a model card and a validation report isn't a compliance edge case, it's a market-conduct exam waiting to happen.

Which AI use cases have the highest ROI for a mid-market carrier or MGA ($100M–$1B written premium)?

The four highest-ROI starting points we see in 2026 are: **FNOL triage and severity prediction** (UC-4 — severity-misclassification drops 12–18% → 3–6%, high-severity claims hit senior desks within 2–4 hours), **loss-run extraction** (UC-5 — keying 25–60 min → 90 sec on high-confidence cases, error rate to ~0.5%), **FNOL-to-file claims processing** (UC-1 — file-ready 2–5d → 4–10hr, LAE ratio up 2–4 points), and for carriers with a recovery gap, **subrogation analysis** (UC-6 — recovery rate 7–14% → 13–22%). Pick the two with the cleanest buyer-readable ROI math for your operating model and let the eval data tell you which use case is next. Personal-lines carriers usually start with UC-1 and UC-4 because the claims leverage shows up inside two quarters; commercial MGAs start with UC-5 and UC-7 because the UW-assistant load is what's burning the team out. Trying to ship all four at once is how insurance AI engagements stall — too many actuarial and claims approvals running in parallel and the team loses the plot by week 16.

How long until we see combined-ratio or LAE-ratio improvement?

Honest answer: 12–18 weeks from kickoff for the first measurable LAE-ratio or recovery-rate lift on a single use case, and the lift compounds for another 2–3 quarters as the eval data tightens the agent's confidence thresholds and the actuarial team starts trusting the drift signals. The fastest single-use-case wins we've shipped: loss-run extraction at 10 weeks to first measurable underwriter-time delta; FNOL triage at 12 weeks to first severity-accuracy delta. The slower wins: claims-processing engines (UC-1) and subrogation analyzers (UC-6), which both need 14–18 weeks before the eval set covers enough line-of-business variability or recovery-pattern variability to trust the agent's outputs without heavy adjuster review. Combined-ratio movement on the whole book takes longer — typically 2–4 quarters of compounding LAE and leakage improvements before the financial statements show it. Anyone selling combined-ratio lift inside one quarter on a mid-market carrier hasn't actually shipped a claims agent against a real coverage corpus. [Model governance, drift monitoring, and the ASOP No. 56 plus SR 11-7-adjacent MRM scaffolding your chief actuary signs off on](/services/mlops/) in week 1 is where the timeline either gets real or stays fictional, and we'd rather scope conservatively and beat the timeline than promise a number that needs a board-level conversation in week 14.

009 / START AN INSURANCE AI ENGAGEMENT

## Book a discovery call. We'll name the *two AI features that'll move LAE ratio or recovery rate* and quote a build window.

No deck. Forty-five minutes with an engineering lead, your chief claims officer or chief actuary in the room, and a follow-up memo within 48 hours scoping the MVP or Platform tier sized to your line mix and written-premium band.

[Talk to engineering](/contact/) [See the 7 use cases again](#use-cases)

010 / OTHER INDUSTRIES

## Adjacent industries we engage.

Insurance sits next to three industries in our book where the AI build patterns rhyme — sometimes the workflow translates directly, sometimes the regulatory posture changes the engineering. Brief signposts; full pillars land as each ships.

[

INDUSTRY · FINTECH

AI for Fintech

KYC, fraud detection, model-risk governance under SR 11-7.

](/ai-for-fintech/)[

INDUSTRY · SAAS

AI for SaaS

Sales agents, RAG copilots, churn prediction, embedded product AI.

](/ai-for-saas/)[

INDUSTRY · LOGISTICS

AI for Logistics

Routing agents, shipment Q&A, claims triage, ETA prediction.

](/ai-for-logistics/)


---

## SECTION: 5.6. Industry: ai-for-logistics

_Source: https://www.paiteq.com/ai-for-logistics/_

# Logistics Software Development — AI — Paiteq

> Paiteq — logistics software development company building AI for routing, ETA, claims triage, and fleet ops around your existing TMS/WMS stack.

**HTML version:** https://www.paiteq.com/ai-for-logistics/

## Key facts

- Workflows: routing, ETA prediction, claims triage, fleet ops.
- Integrations: TMS, WMS.

## Related pages

- [Machine Learning Development](https://www.paiteq.com/services/machine-learning-development/)
- [AI Workflow Automation](https://www.paiteq.com/services/ai-workflow-automation/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI for Logistics · Logistics Software Development Company

# *AI for logistics* — logistics software development company and AI logistics development company for TMS/WMS-aware orchestration.

Logistics ops in 2026 sit in a three-way squeeze: shipper margin pressure pushing rate per mile down quarter over quarter, ELD-tightened driver capacity that won't loosen because the labor pool isn't there, and cross-border friction (CBP scrutiny, EU AI Act transparency expectations, customs error penalties) that turns a paperwork miss into a multi-week delay. Paiteq is a logistics software development company and an AI logistics development company that builds the AI orchestration layer inside your existing TMS/WMS/ELD/customs stack — McLeod, MercuryGate, Manhattan Active TM, Samsara, Motive, Descartes, CargoWise — wrapping AI route optimization, ETA prediction, AI fleet management, claims, customs, AI warehouse management, and AI in shipping visibility in a layer that makes the stack smarter without replacing it. We stay through the first eval-drift cycle and the first peak-volume push, not the deploy.

[Talk to engineering](/contact/) [See the 7 use cases](#use-cases)

Use cases 7 · routing · ETA · claims · customs · visibility

Engage MVP · Platform · Enterprise

Stack McLeod · Samsara · Descartes · Claude · Pinecone

Risk SOC 2 · FMCSA HOS · C-TPAT · EU AI Act

001 / WHY NOW

## Why logistics teams are evaluating AI for logistics right now.

Logistics COOs and VPs of Ops in 2026 face three pressures running in parallel: route-economics drift that no amount of static TMS planning can hold, claims and customs backlogs that absorb headcount the labor market won't replace, and fragmented visibility across project44, FourKites, OEM telematics, and ocean carrier APIs that surfaces problems after the shipper does. Each pressure on its own would be manageable. Together, they're why AI for logistics has moved from R&D experiment to operating-plan agenda since 2024, and why every AI in logistics conversation we walk into now starts with a cost-per-mile question rather than a tech question. The teams shipping logistics AI well aren't replacing McLeod, MercuryGate, Samsara, or Descartes — they're wrapping those primitives with an orchestration layer that makes them smarter, and that's the logistics software development company shape every Round-3 SERP boutique now sells. The framing shift in 2026: AI for supply chain stopped being a Deloitte slide and started being shipped code that tool-calls into the operator's stack.

0 –14%

Cost-per-mile drift above plan

Static TMS route plans fall apart by 10am; cost-per-mile creeps 8–14% above plan on volatile days, and OTD slips 6–11 points before the dispatcher finishes the manual re-route round.

0 –14d

Cargo claims FNOL → first response

Each claim file runs ~25 documents (BOL, POD, photos, carrier denial, recovery quote). Claims clerks spend 60–90 min assembling the file before an adjuster even reads it.

0 –14%

Cross-border HS-code error rate

Mid-volume freight forwarders misclassify HS codes 6–14% of the time. Each error is a CBP query (1–3 weeks delay), a duty overpayment, or a penalty exposure that turns into an audit.

The opinionated take

Most logistics AI projects fail because the team treats AI as a parallel system to the TMS instead of an orchestration layer inside it. A separate AI product that doesn't tool-call into McLeod or Manhattan Active TM is a chart, not a decision. The cost of choosing the wrong abstraction layer is typically 4–10 months of rebuilding the integration scaffolding once the AI feature scales beyond a pilot — the team rewires the ELD ingestion, redoes the HOS constraint layer, and frequently rebuilds the audit trail because the original one was bolted onto the wrong primitive. We don't get that number from theory; we've watched two carriers and one freight forwarder do exactly this rebuild before engaging us.

— Paiteq engineering

002 / USE CASES

## The 7 highest-ROI AI use cases in logistics.

Below are the seven workflows we see logistics teams build first. They share three traits: each has a clear buyer-readable ROI number in logistics units (cost-per-mile %, OTD points, claims AHT, HS-code accuracy %), each is deployable inside an 8–16 week window, and each compounds when you ship two or three together on a shared TMS/ELD integration layer rather than as standalone bets. The cards are dense on purpose — pain, with-AI workflow, named tools, and the ROI metric in the operator's vocabulary. Skim them, then read the two or three that match where your roadmap actually sits today. We've written about the routing-agent architecture pattern in detail in a separate blog covering ai route optimization at the constraint-solver layer.

USE CASE 01

### AI route optimization agent with live constraint re-planning

The Pain

Static route plans built nightly by your TMS fall apart by 10am. Traffic, weather, customer schedule slips, driver HOS limits, and inbound dock conflicts force dispatchers into manual re-routing 4–9 times per driver per day. Cost-per-mile creeps 8–14% above plan; OTD slips 6–11 points on volatile days. Most logistics AI vendors sell you a visibility dashboard and call it AI — that's a chart with a buzzword, not an agent that re-routes.

With AI

A routing agent reads the active state (driver positions, HOS remaining via ELD, dock-door availability via WMS, customer time-windows, weather, traffic) and proposes re-routes the dispatcher confirms with one click. The agent does NOT auto-dispatch — the dispatcher stays in the loop, but cycles move from "rebuild the route in OptimoRoute by hand" to "approve the agent's proposed change". The optimizer underneath is whatever you already license; the agent's job is the reasoning layer that picks which constraint to relax when two plans conflict.

4–8 pts

OTD recovery on volatile days

Cost-per-mile back to within 2–4% of plan; dispatcher load drops from ~85 re-routes/day to ~25 confirmation taps

Tools

LangGraphClaude Sonnet 4.6OR-ToolsOptimoRouteOnfleetRoutificDescartesMcLeodMercuryGateManhattan Active TMSamsaraMotiveGeotab

USE CASE 02

### ETA prediction with confidence intervals

The Pain

TMS-quoted ETAs are point estimates with no confidence — customers ask "where's my load?" and your CSR reads a number off a screen that's been wrong for 90 minutes. Customer-facing ETAs miss by ±2–6 hours on long-haul, ±20–45 min on last-mile. The honest take: a point-estimate ETA without an uncertainty band is a lie the CSR has to defend on the phone.

With AI

A forecasting model trained on historical lane performance plus live signals (driver location, fuel stops, dwell at origin, weather, border-crossing delay) produces an ETA plus an 80% confidence interval. CSRs and customers see both numbers; high-uncertainty loads get proactive shipper outreach before the call comes in. The base ETA is classical ML; the customer-facing narrative layer is what turns "±2 hrs uncertainty" into "we're tracking a 2-hour delay risk from Chicago weather".

30–45%

drop in "where is my load?" calls

ETA accuracy improves from ±2–6 hrs to ±35–80 min on long-haul; CSR handle-time per shipment drops 25–40%

Tools

XGBoostLightGBMMLflowClaude Haiku 4.5SnowflakeBigQueryproject44FourKites

USE CASE 03

### Claims triage and FNOL automation

The Pain

Cargo damage and loss claims take 5–14 days from FNOL to first carrier response. Each claim file is ~25 documents (BOL, POD with damage photos, customer statement, carrier denial letter, recovery quote). Claims clerks spend 60–90 min per claim assembling the file before adjuster review, and roughly half that time is re-keying data that already lives in your TMS.

With AI

An agent reads the inbound claim notification (email, EDI 998, photos), classifies the damage type, pulls the matching BOL and POD from your TMS, drafts the carrier demand letter, and surfaces the file with a recommended liability split for the claims adjuster. Adjuster approves or edits — no auto-decisions on liability. The RAG layer over historical claims gives the adjuster precedent matching on similar lanes and similar damage patterns, which is where recovery rate moves.

1–3d

FNOL → first carrier response (from 5–14d)

Clerk time per claim drops 60–90 min → 8–15 min; recovery rate on contested claims up 4–9 points

Tools

Claude Sonnet 4.6GPT-4 VisionPineconeTurbopufferCarrier411TIA-DAT

USE CASE 04

### AI warehouse management — slotting and pick-path optimization

The Pain

WMS slotting decisions (which SKU goes in which bin) get rebuilt quarterly by an analyst running a spreadsheet model. Between rebuilds, velocity drifts and pick paths grow 12–22% longer than optimal. On peak days, pickers walk 8–12 miles per shift unnecessarily, and the labor cost compounds across every dock door the facility runs.

With AI

A recommender continuously scores SKU-to-bin assignments against live velocity, seasonal forecasts, and physical zone constraints. It surfaces "move these 80 SKUs this weekend, expected pick-path saving 14%" recommendations for the WMS supervisor's review — the agent does NOT auto-rebalance the warehouse. The narrative layer explains WHY a SKU moved up the priority list so the supervisor can sanity-check before approving the wave.

10–18%

pick-path length reduction

Picker miles per shift drop 1.5–3 miles; SKU velocity-fit holds within 8% of optimal between quarterly resets

Tools

LightGBMMLflowClaude Sonnet 4.6Manhattan Active WMBlue YonderKörberSofteon

USE CASE 05

### Customs documentation and HS-code accuracy

The Pain

HS-code classification for cross-border shipments is wrong 6–14% of the time at mid-volume freight forwarders. Each misclassification triggers either a CBP query (1–3 weeks delay), a duty overpayment (lost margin), or an underpayment (penalty plus audit risk). Customs brokers manually re-key data from BOL and commercial invoice into ABI/AES every shipment — and the re-key step is where 60% of the errors enter the system.

With AI

A classifier reads the commercial invoice plus product description plus prior shipment history, proposes an HS code with a confidence score, and pre-fills the customs entry. The broker reviews high-confidence entries in seconds; low-confidence entries surface for manual research with a side-by-side history of similar prior classifications. EU AI Act-era transparency expectations (Art. 50 set the baseline): every AI-determined HS code logs with a reasoning trail for audit. The broker stays the decision-maker; the agent just removes the typing.

1.5–3%

HS-code error rate (from 6–14%)

Broker time per entry compresses 8–14 min → 90 sec on high-confidence cases; duty overpayments recovered, penalty exposure down measurably

Tools

Claude Sonnet 4.6PineconeTurbopufferCargoWiseDescartes OneViewWiseTech

USE CASE 06

### Multi-modal supply-chain visibility and anomaly detection

The Pain

Visibility data (project44, FourKites, MacroPoint, OEM telematics, ocean carrier APIs, rail tracers) lands in 6–9 separate systems. Operators only catch a delay when it's already a delay — by the time the shipper calls, the truck has been sitting 4 hours. Dashboards aren't a strategy; they're a record of what already broke.

With AI

An anomaly-detection layer reads the unified visibility feed, scores deviations against expected lane behavior, and surfaces "this rail container is 9 hours behind its 90% interval — likely Chicago yard congestion, recommend reroute via Memphis" exception alerts. Ops triages the top-N exceptions instead of watching dashboards. The unified ingest layer can be your existing visibility provider or a Kafka and Snowflake pipeline; the AI in shipping ops sits on top of whichever you've already paid for.

~25 min

time-to-detect on delayed shipments (was ~4 hrs)

Reactive customer escalations down 35–55%; on-time-in-full (OTIF) improves 3–7 points across the lane portfolio

Tools

Claude Sonnet 4.6SnowflakeLightGBMproject44FourKitesMacroPoint

USE CASE 07

### AI fleet management — driver-coaching and safety-event triage from ELD and dashcam data

The Pain

Safety managers review dashcam plus ELD events (hard braking, swerving, drowsiness flags, speeding) for hundreds of drivers per week. False-positive rate on triggered events is 40–65%, so coaches dismiss most of them — and miss the real risk events buried in the noise. The drivers who actually need coaching get the same dismissed-event treatment as the drivers who don't.

With AI

A triage layer classifies each event (true safety risk, coachable behavior, sensor false-positive), prioritizes the queue, and drafts the coaching note plus suggested conversation framing. The safety manager reviews the top-priority queue daily instead of all events weekly. The vision model reads the dashcam clip for context the ELD's accelerometer can't capture — a hard-brake event is very different when the dashcam shows a pedestrian dart-out versus a tailgating pattern.

10–18%

false-positive rate (from 40–65%)

Safety-manager time per driver per week drops from ~22 min → ~6 min; preventable-event recurrence down 18–28% within 90 days

Tools

GPT-4 VisionClaude Sonnet 4.6SamsaraMotiveLytxNetradyneKeepTruckin

A pattern worth flagging across all seven AI in logistics workflows above: **the ROI numbers are the median of what we and similarly-shaped boutiques have shipped**, not the headline outlier. Don't pick a use case for its ceiling. Pick the two with the cleanest buyer-readable ROI math for your operating model — asset-based carriers with a dispatch overload start with UC-1 and UC-2; 3PLs with a claims backlog start with UC-3 and UC-2; freight forwarders with cross-border exposure start with UC-5 and UC-6. The next section maps each pain to the Paiteq service that does the actual engineering. The cluster keywords — ai fleet management, ai warehouse management, ai supply chain visibility — get their own deeper blog treatment; this pillar is the AI for supply chain orchestration view, not the per-workflow deep-dive.

003 / SERVICE MAPPING

## How Paiteq services map to logistics needs.

Four common logistics pain shapes on the left, five Paiteq service pillars on the right. Hover any pain row to highlight which services we'd engage; hover a service to reverse-highlight the pains it solves. The descriptive anchors (not the service primary keyword) are deliberate — what matters to the operator is the workflow, not the service title.

Route-economics drift and dispatcher overload

Cost-per-mile drifts 8–14% above plan on volatile days; dispatchers manually re-route 4–9 times per driver per day against TMS plans that go stale by 10am.

ETA accuracy and CSR escalation load

Point-estimate ETAs miss by ±2–6 hrs on long-haul; CSRs absorb the customer call volume and have nothing to defend the number with.

Claims processing and customs accuracy drag

FNOL → first response runs 5–14 days; HS-code error rate sits at 6–14%. Both are document-heavy workflows AI can compress without auto-deciding.

Fragmented visibility across modes

project44, FourKites, OEM telematics, and ocean carrier APIs sit in 6–9 separate systems. Operators see the delay after the shipper does.

[

Service

AI Agent Development

orchestrating multi-step routing and claims-triage agents

](/services/ai-agent-development/)[

Service

RAG Development

grounded retrieval over BOLs, customs declarations, and lane history

](/services/rag-development/)[

Service

AI Workflow Automation

automating the dock-scheduling and claims-FNOL workflows that sit between your TMS and your back-office

](/services/ai-workflow-automation/)[

Service

Machine Learning Development

training ETA, demand-forecasting, and anomaly-detection models on your historical lane data

](/services/machine-learning-development/)[

Service

AI Integration

drop-in AI integration into McLeod, MercuryGate, Manhattan Active TM, Samsara, and Descartes stacks

](/services/ai-integration/)

Why the map looks like this

Building logistics AI in 2026 is genuinely a multi-discipline engineering job — closer to platform integration work than to a typical TMS-customisation build. Route-economics drift routes to three services because a working routing agent is partly [orchestrating multi-step routing and claims-triage agents](/services/ai-agent-development/), partly [automating the dock-scheduling and claims-FNOL workflows that sit between your TMS and your back-office](/services/ai-workflow-automation/), and partly [drop-in AI integration into McLeod, MercuryGate, Manhattan Active TM, Samsara, and Descartes stacks](/services/ai-integration/). ETA accuracy routes to ML plus RAG plus integration because a confidence-interval ETA isn't a single LLM call — it's a base forecasting model, a historical-lane retrieval step, a live-signal join (ELD position, weather, dwell), and a customer-facing narrative layer.

Claims and customs workflows route to agent work plus [grounded retrieval over BOLs, customs declarations, and lane history](/services/rag-development/) plus workflow automation because the FNOL flow and the HS-code flow are both document-heavy decision pipelines — the AI's job is to read 25 documents and surface the right answer with audit trail, not to make the liability or classification call. Fragmented visibility routes to [training ETA, demand-forecasting, and anomaly-detection models on your historical lane data](/services/machine-learning-development/) plus integration plus agent work because the anomaly model has to ingest the unified feed (project44, FourKites, OEM telematics), score deviations against expected lane behavior, and turn the result into an exception narrative the ops team can act on. The discipline split isn't bureaucracy — it's how the engineering stays high-quality across a 20-week Platform build with dispatch, claims, and IT all watching the same use case land.

004 / RISK

## Operational risk and data posture for logistics.

Three risk layers shape every logistics AI engagement we run. SOC 2 Type II plus shipper-data posture is the B2B procurement gate — shippers won't let an AI vendor touch their TMS data without the attestation. FMCSA Hours-of-Service rules (49 CFR Part 395) govern any AI that recommends or influences a dispatch decision. C-TPAT plus customs accuracy under CBP — reinforced by the EU AI Act transparency regime (Art. 50 set the baseline) that's raised the bar for how AI-generated customs declarations need to be documented — closes the cross-border loop. The logistics buyer's gate is shipper-data posture plus DOT/FMCSA fleet rules plus cross-border customs accuracy, not regulator-driven privacy in the SaaS or fintech sense.

Audited annually · Continuous monitoring

-   SOC 2 Type II
    
    Shipper-data posture · per-tenant partitioning
    
    AUDITED · 2026
    
-   FMCSA HOS
    
    49 CFR Part 395 · ELD read-only ingestion
    
    AUDITED · 2026
    
-   C-TPAT + EU AI Act transparency
    
    Customs accuracy · AI reasoning trail logged
    
    AUDITED · 2026
    

Audit trail is the real gate, not the model choice

Every C-TPAT review and every shipper SOC 2 questionnaire now asks how an AI system reached its decision. AI-determined HS codes, AI-recommended dispatch decisions, AI-classified claims liability — each has to surface a reasoning trail plus confidence score plus the reviewing human's signature. The honest take: most logistics AI vendors skip the audit-trail conversation entirely because it's expensive engineering, and their customers find out the hard way at the first CBP query or the first DOT compliance review. We don't. The reasoning-trail log and the read-only ELD ingestion pattern are load-bearing, not optional add-ons.

SOC 2 TYPE II

Shipper-data posture

Shippers (CPG brands, retail, manufacturing) require SOC 2 attestation before any AI vendor touches their TMS data, customer addresses, or lane pricing. Volume-discount pricing is competitive intel — a leak tanks the shipper relationship. We design AI features so shipper data never leaves your VPC: vector stores in Pinecone or Turbopuffer are partitioned per shipper-tenant; embeddings never cross tenants; observability logs in Langfuse redact PII (consignee names, addresses, phone numbers) at the logging layer, not as a post-hoc scrub. The engagement signs DPA plus SOC 2 attestation review at kickoff. Most shipper procurement teams we engage with already run a clean SOC 2 environment; our job is to design the AI work so the scope doesn't expand and the attestation holds at next year's audit.

FMCSA HOS · 49 CFR PART 395

ELD and Hours-of-Service posture

Any AI that recommends or influences a dispatch decision intersects with FMCSA HOS rules (49 CFR Part 395). Recommending a route a driver legally can't run is a compliance exposure plus a driver-pushback problem plus a safety risk all at once. The routing agent's constraint layer enforces live HOS remaining per driver, pulled read-only from your ELD (Samsara, Motive, Geotab) — the agent CANNOT propose a route that requires more drive time than the driver legally has available. The dispatcher confirms every recommendation; the agent never auto-dispatches. ELD ingestion stays read-only — we never write back to the ELD record, which preserves the audit trail DOT can subpoena under a roadside or compliance review. Most logistics AI vendors miss this; we make it the architecture gate at week 3.

C-TPAT + CBP + EU AI ACT ART 50

Customs accuracy and AI transparency posture

Cross-border freight forwarders and 3PLs operate under C-TPAT (Customs-Trade Partnership Against Terrorism) — AI-determined classifications, denied-party screening results, and documentation must all support a CBP audit. The EU AI Act transparency regime (Art. 50 set the baseline) has raised the bar for how AI-generated customs declarations and HS-code rationales for EU-bound shipments need to be documented. Every AI-determined HS code, screening result, and customs-form pre-fill logs with the reasoning trail plus confidence score plus the reviewing broker's signature. High-confidence entries auto-route to the broker queue with the AI rationale visible; low-confidence entries surface for manual research with side-by-side history of similar prior classifications via RAG over CargoWise or Descartes OneView data. The audit trail is what makes the AI feature defensible to CBP and EU customs — without it, the AI is a liability, not a tool.

005 / ENGAGEMENT

## How a logistics AI engagement runs at Paiteq.

Five phases. Every phase has an explicit deliverable, a named owner inside your team, and a gate criterion that has to pass before the next phase starts. The cadence is weekly: a Monday standup with your VP Ops, Dispatch or Claims lead, IT lead, and Safety or Compliance contact. Demo every Thursday. SOC 2 shipper-data posture, FMCSA HOS constraint design, and C-TPAT audit-trail design all track in parallel from week 1, not as a retrofit at security review.

Logistics AI Engagement · 16 weeks (typical Platform tier) 5 phases

WEEK 1–2 Discovery

Use-case prioritisation, route-economics or claims-AHT ROI scoping, stakeholder map (VP Ops + Dispatch lead + Claims manager + IT)

Single buyer-readable ROI number scoped per use case (cost-per-mile %, OTD pts, claims AHT, or HS-code accuracy %)

WEEK 3–4 Architecture + Risk Scoping

Stack lock against your TMS/ELD/WMS, SOC 2 shipper-data posture review, FMCSA HOS constraint design, C-TPAT audit-trail design

Architecture signed by your ops lead and your safety/compliance contact before any prompt is written

WEEK 5–10 MVP Build

Runnable agent against eval set plus your real lane data, weekly demo, observability via Langfuse, HOS-constraint enforcement wired in

Baseline accuracy hit on eval set; ELD ingestion read-only; AI reasoning trail logging in place

WEEK 11–16 Production + Peak Readiness

Hardening, fallback policies for ELD/visibility-feed outages, rollout, runbook for peak-volume days (produce season, holiday push)

All eval gates green; peak-day load tested at 3× expected volume; dispatcher confirmation flow validated

WEEK 17+ Optimise + Handoff

Cost engineering, prompt iteration, runbook in your repo, eval-drift monitoring, ownership transfer to your team

Two cadence notes for logistics specifically

The dispatch or claims lead shows up week 1, not week 8. Half the use cases on this page — UC-1 routing agent, UC-2 ETA prediction, UC-3 claims triage — depend on decisions that are genuinely operational decisions, not engineering ones (which constraints relax first when two plans conflict, what the recovery-rate threshold is for an auto-drafted demand letter, when the agent stops and waits for a human). We've found the first-week unblock is almost always getting the operating lead into the architecture conversation before the stack is locked, because changing the constraint hierarchy or the human-in-loop threshold at week 6 costs 2–3× what it costs to design it in at week 1. The second cadence note: peak-volume readiness lands at week 11–16, not after launch. Produce-season pushes, holiday surges, and end-of-quarter customer pulls are real deadlines for asset-based carriers and 3PLs; we pre-test the routing agent and the ETA model at 3× expected load before sign-off so the first peak day isn't the first stress test.

006 / TEAM & PRICING

## Team shape and pricing for a logistics AI engagement.

Two tier shapes cover roughly 80% of logistics AI engagements we run — across 3PLs, asset-based carriers, and freight forwarders. MVP for a single high-clarity use case with the TMS/ELD integration scaffolding sized accordingly; Platform for the multi-use-case build on shared infra that most operators in the $50M–$500M revenue band actually need. Enterprise tier (4 eng + 3 ML + 1 PM, $620K+, 30+ weeks) sits behind these for org-wide AI orchestration across routing, claims, customs, and visibility simultaneously. As a logistics software development company we don't pretend the MVP tier is the right answer for everyone — it's a stepping stone for half our clients and a stop point for the other half.

MVP tier — one use case

Platform tier — 3–5 use cases on shared infra

Scope

One use case shipped to production (e.g. UC-2 ETA prediction or UC-3 claims triage)

3–5 use cases on shared TMS/ELD integration layer

Team shape

2 eng + 1 ML + 0.5 PM

3 eng + 2 ML + 1 PM

Timeline

8–12 weeks

16–24 weeks

Indicative range

$90K–$140K

$260K–$420K

Logistics MVP starts slightly above ecommerce because TMS/ELD integration scaffolding is heavier — Samsara, McLeod, and Descartes don't have the same plug-and-play surface that a Shopify Storefront API does. **Platform tier is the median right answer** for 3PLs and asset-based carriers in the $50M–$500M revenue band. The Enterprise tier (4 eng + 3 ML + 1 PM, 30+ weeks, $620K+) only fits when the engagement is genuinely org-wide AI orchestration across routing, claims, customs, and visibility simultaneously.

Eval framework

Single eval set, 30–50 lane examples

Shared eval harness across use cases, regression alarms in CI on every model release

Observability

Langfuse traces + cost dashboard

Langfuse + per-use-case cost attribution + ELD/visibility-feed outage monitoring

Stop-and-walk option

Yes — fixed scope, real option to stop after week 8

Phased gates at weeks 4 / 10 / 16; can collapse to single-use-case build mid-flight

Click the indicative-range row for the take on which tier fits which revenue band. Enterprise tier scoped separately on request.

Sizing for routing vs. claims vs. customs workloads

ETA prediction (UC-2) and claims triage (UC-3) tend to fit cleanly inside the MVP tier because the eval gate is narrow (lane-level accuracy on a held-out set, claims-AHT against a baseline period) and the integration surface is contained. Routing-agent re-planning (UC-1), HS-code classification (UC-5), and multi-modal visibility (UC-6) almost always need Platform tier because the eval harness, the constraint-enforcement layer, and the unified-ingest infra are the load-bearing pieces. We've seen more than one mid-market 3PL under-scope a routing-agent build at MVP and lose 4–6 weeks rebuilding the HOS constraint layer mid-flight because the dispatcher's quality bar arrived sharper than expected.

The cheapest tier isn't the cheapest outcome

If you're shipping more than one AI use case in the next 12 months — and most logistics teams that get to a serious AI strategy will — the MVP tier asks you to rebuild the TMS/ELD integration layer, the eval framework, and the observability stack twice. The second rebuild costs more than the first. Platform tier is the median right answer for 3PLs and asset-based carriers in the $50M–$500M revenue band because the shared infra (eval harness, ELD adapters, RAG layer over historical lanes and claims, model routing, observability via Langfuse) amortises across three to five use cases instead of one. As a logistics software development company we run the MVP tier for two real cases: pre-scale operators testing whether logistics AI pays back at all, and freight forwarders with a single high-clarity workflow (usually HS-code classification) they want to ship in 10 weeks before greenlighting the platform investment. Both are legitimate; neither is most companies.

007 / WORK

## What we've shipped for logistics teams (anonymized).

Three anonymised logistics engagements from the broader team's history. Segment and revenue band are real; metrics are real; the numbers were measured 60–90 days post-launch, not at deploy. Brand names removed under standard NDA. Anyone selling you headline outliers without the operating numbers under them is selling case-study theatre.

Dispatch

Asset-based carrier · $180M revenue · NA

### Routing agent + HOS-aware re-planning

A 480-truck fleet with dispatchers re-routing 6–8 times per driver per day against a nightly TMS plan that went stale by mid-morning. We shipped a routing agent (LangGraph plus Claude Sonnet 4.6 with OR-Tools underneath, tool-calling against McLeod and Samsara) with hard HOS-remaining constraints pulled from the ELD in read-only mode. Cost-per-mile drift dropped from 11% above plan to 3% on volatile days; OTD recovered 5.4 points; dispatcher confirmation taps replaced ~82% of manual re-routes.

0 %

Cost-per-mile drift → 3% / 90d post-launch

Claims

Mid-market 3PL · $90M revenue · NA

### Claims-triage agent + RAG over precedent files

Cargo damage claims sat at 9.2 days average FNOL → first carrier response with ~24 documents per file. We shipped a triage agent (Claude Sonnet 4.6 plus GPT-4 Vision on damage photos) with RAG over five years of historical claims via Pinecone, tool-calling into the TMS for BOL/POD pulls. FNOL → first response compressed to 2.1 days; clerk time per claim dropped from 74 min mean to 12 min; recovery rate on contested claims up 6.8 points.

FNOL → first response 9.2d → 2.1d

Customs

Freight forwarder · $220M revenue · EU + NA

### HS-code classifier + ABI/AES pre-fill

HS-code error rate sat at 9.4% across ~14,000 cross-border entries per month, with brokers re-keying invoice data into CargoWise on every shipment. We shipped a classifier (Claude Sonnet 4.6 plus RAG over historical declarations on Turbopuffer) with confidence-thresholded human review and an EU AI Act-era reasoning trail logged per entry. Error rate dropped to 2.3%; broker time per high-confidence entry dropped from 11 min to 85 sec; duty-overpayment recovery covered the engagement cost inside 7 months.

0 %

HS-code error → 2.3%

The shape across all three engagements

The buyer-readable ROI metric was scoped in week 2, before any code was written — cost-per-mile drift target on the carrier engagement, FNOL-response target on the 3PL engagement, HS-code error-rate target on the freight-forwarder engagement. The eval set grew during production via sampled traces, not a static set left over from architecture. Handoff put the runbook in the client's repo, not in a shared doc. We engage as a logistics software development company that stays through the first peak-volume push and the first eval-drift cycle, not one that ships and disappears. Roughly half of the logistics AI engagements we close convert to a lighter-weight Run engagement after the build is in production; half don't, because the client's internal team has picked up ownership. Both outcomes are fine. The Run engagement is real work — prompt iteration, cost engineering, peak-day load testing, regression testing on new model releases — not a retainer hiding as a service.

008 / FAQ

## Logistics AI buyer FAQ.

Five questions we get on almost every logistics AI first call, answered the way we'd answer them on the call. Specific numbers, named tools, the actual decision rules — not generic vendor-deck answers.

How much does it cost to add AI to our TMS / WMS stack?

Three bands. An **MVP build of a single AI use case** — ETA prediction, claims triage, or HS-code classification — runs $90K–$140K over 8–12 weeks (2 engineers, 1 ML engineer, 0.5 PM). A **Platform build covering 3–5 use cases on a shared TMS/ELD integration layer** runs $260K–$420K over 16–24 weeks. **Enterprise engagements** with org-wide AI orchestration across routing, claims, customs, and visibility start at $620K and run 30+ weeks. Logistics MVP starts above ecommerce-equivalent because TMS and ELD integration scaffolding (McLeod, MercuryGate, Manhattan Active TM, Samsara, Motive) is heavier than a Shopify or Klaviyo build — these stacks don't have plug-and-play surfaces and most logistics software development company engagements spend 30–40% of the MVP budget just on the integration layer. Most logistics AI work that ships well sits in the Platform tier — the shared infra (eval harness, model routing, TMS/ELD adapters, observability via Langfuse) amortises across the use cases instead of getting rebuilt every project.

Build vs. buy: when does in-house AI orchestration beat a visibility-platform vendor (project44, FourKites)?

Buy when the AI feature is genuinely commodity — basic shipment tracking, off-the-shelf ETA, generic dashboard reporting — and the visibility vendor's data already covers your lane mix. Build the orchestration layer when AI touches your **differentiated dispatch decisioning, your claims recovery, or your customs accuracy**. Routing-agent re-planning with your live HOS constraints (UC-1), claims-triage with RAG over your precedent files (UC-3), and HS-code classification tuned to your product mix (UC-5) aren't workloads a visibility vendor will build for you — they sell platform breadth, not your specific decisioning. We've watched a $200M asset-based carrier buy three visibility tools in two years, realise the tools didn't compose into a dispatch decision, and rebuild on a clean orchestration layer that cost less than the third tool's annual license. Our [drop-in AI integration into McLeod, MercuryGate, Manhattan Active TM, Samsara, and Descartes stacks](/services/ai-integration/) sits at the decisioning layer, not at the tracking layer — your existing visibility license stays where it earns its keep.

How do you handle FMCSA HOS compliance when an AI agent is recommending dispatch decisions?

The routing agent's constraint layer includes live HOS-remaining per driver, pulled read-only from your ELD (Samsara, Motive, Geotab). The agent CANNOT propose a route that requires more drive time than the driver legally has available under 49 CFR Part 395 — the hard constraint is enforced before the LLM ever sees the candidate plan, not after as a post-hoc filter. The dispatcher confirms every recommendation; the agent never auto-dispatches. ELD ingestion is strictly read-only — we never write back to the ELD record, which preserves the audit trail DOT can subpoena. The opinionated take most logistics AI vendors skip: an AI that recommends a route a driver legally can't run isn't a compliance edge case, it's a product defect. The HOS constraint isn't a feature you bolt on at week 14; it's the architecture decision at week 3.

Which AI use cases have the highest ROI for a mid-market 3PL or asset-based carrier ($50M–$500M revenue)?

The four highest-ROI starting points we see in 2026 are: **ETA prediction** (UC-2 — 30–45% drop in "where is my load?" calls, CSR handle-time down 25–40%), **claims triage** (UC-3 — FNOL → first response 5–14d → 1–3d, clerk time 60–90 min → 8–15 min, recovery rate up 4–9 points), **routing-agent re-planning** (UC-1 — cost-per-mile back to within 2–4% of plan, OTD recovers 4–8 points), and for cross-border operators, **HS-code classification** (UC-5 — error rate 6–14% → 1.5–3%, broker time 8–14 min → 90 sec on high-confidence entries). Pick the two with the cleanest buyer-readable ROI math for your operating model and let the eval data tell you which use case is next. Asset-based carriers usually start with UC-1 and UC-2 because the dispatch leverage shows up in week 12; 3PLs with brokerage volume start with UC-2 and UC-3 because the CSR and claims load is what's burning the team out. Trying to ship all four at once is how logistics AI engagements stall — too many ops approvals running in parallel and the team loses the plot by week 14.

How long until we see cost-per-mile or OTD improvement?

Honest answer: 10–16 weeks from kickoff for the first measurable cost-per-mile or OTD lift on a single use case, and the lift compounds for another 2–3 quarters as eval data tightens the routing-agent's constraint weights. The fastest single-use-case wins we've shipped: ETA prediction at 8 weeks to first measurable accuracy improvement on long-haul lanes; claims-triage at 9 weeks to first FNOL-response delta. The slower wins: routing-agent re-planning (UC-1) and HS-code classification (UC-5), which both need 12–16 weeks before the eval set covers enough lane variability or product variability to trust the agent's outputs without heavy dispatcher or broker review. The pattern we won't promise: cost-per-mile lift inside 6 weeks. Anyone selling that number hasn't actually shipped a routing agent against a live HOS constraint. [Training ETA, demand-forecasting, and anomaly-detection models on your historical lane data](/services/machine-learning-development/) in week 1 is where the timeline either gets real or stays fictional, and we'd rather scope conservatively and beat the timeline than promise a number that needs a VP-Ops conversation in week 10.

009 / START A LOGISTICS AI ENGAGEMENT

## Book a discovery call. We'll name the *two AI features that'll move cost-per-mile or OTD* and quote a build window.

No deck. Forty-five minutes with an engineering lead, your real operating context on the table, and a follow-up memo within 48 hours scoping the MVP or Platform tier sized to your lane mix and fleet shape.

[Talk to engineering](/contact/) [See the 7 use cases again](#use-cases)

010 / OTHER INDUSTRIES

## Adjacent industries we engage.

Logistics sits next to three industries in our book where the AI build patterns rhyme — sometimes the workflow translates directly, sometimes the data posture changes the engineering. Brief signposts; full pillars land as each ships.

[

INDUSTRY · SAAS

AI for SaaS

Sales agents, RAG copilots, churn prediction, embedded product AI.

](/ai-for-saas/)[

INDUSTRY · FINTECH

AI for Fintech

KYC, fraud detection, model-risk governance under SR 11-7.

](/ai-for-fintech/)[

INDUSTRY · ECOMMERCE

AI for Ecommerce

Catalog enrichment, conversion-side search, recommendations.

](/ai-for-ecommerce/)


---

## SECTION: 6. Case Studies

_Source: https://www.paiteq.com/case-studies/_

# Case Studies — Paiteq

> Anonymized featured AI engineering work — agents, RAG systems, and intelligent automation shipped into production. Client names withheld at the client's request; named references available under NDA during discovery.

**HTML version:** https://www.paiteq.com/case-studies/

## Key facts

- Production engagements across ecommerce, fintech, healthcare, insurance, logistics, and SaaS.
- Anonymized by default. Named references shared under NDA during the discovery call.

## Related pages

- [Services hub](https://www.paiteq.com/services/)
- [Contact](https://www.paiteq.com/contact/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

Work

# *Anonymized* featured work.

Industry and segment are real; outcomes are real; brand names removed under standard NDA terms. Deep case studies land here as engagements close out and clients permit attribution.

001 / FEATURED

## Three engagements, three patterns.

Each card below is a real engagement. The function tells you the workload shape; the segment tells you the size; the outcome is the metric the client signed off on. Deeper case-study pages land here as clients permit attribution.

Sales

B2B SaaS · 11–50 emp

### Lead-qualification + outbound research agent

Pulls signals from LinkedIn, Crunchbase, the prospect's website, recent news. Scores fit against ICP, drafts personalised first-touch, escalates only above threshold.

0

SDR seats

Support

Health-tech · enterprise

### Tier-1 deflection agent

RAG over product docs + 18-month ticket archive. Resolves password / billing / onboarding without human touch. Clinical questions escalated with full context.

0 %

p1 ticket volume

Ops

Mfg · 200+ emp

### Invoice matching + AP routing agent

OCR + LLM extraction on PDF/scanned invoices. Matches against open POs in NetSuite, routes to approver via Slack. Exceptions go to ops lead with annotated diff.

<6 months

in

Start a project

## Want a *case study* of your own?

Pilot in 2–4 weeks. Production build in 8–16. Same-day response on every inbound.

[Talk to engineering](/contact/) [AI agent development](/services/ai-agent-development/)


---

## SECTION: 7. Blog Hub

_Source: https://www.paiteq.com/blog/_

# Blog — Paiteq

> Long-form posts on AI engineering, RAG, evals, agents, and production delivery. Authored in Sanity. For agents, every post is also available as markdown at `/blog/{slug}.md`.

**HTML version:** https://www.paiteq.com/blog/

## Key facts

- Topics: production AI engineering, RAG patterns, eval design, agent architectures, LLMOps.
- Markdown-for-agents: append `.md` to any blog URL for a parser-friendly rendition.

## Related pages

- [Services hub](https://www.paiteq.com/services/)
- [Case studies](https://www.paiteq.com/case-studies/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

● ENGINEERING NOTES

# Notes on *shipping AI*  
in production.

Technical writing on the things we build every day — agents, RAG, evaluation, framework trade-offs, voice systems, production failure modes. Every post is written by an engineer who ships the work, not a content-marketer.

FocusAgents · RAG · LLMs · Eval · Voice

Length2.5–4K words · technical depth

Posts live5 posts

[

![Frost-crystal lattice radiating from a central node — orchestration patterns in a multi-agent system](https://cdn.sanity.io/images/xr290ucr/production/73b0c0032d61a220527e6ff1e54f3c053cd3709b-1408x768.png?w=1200&q=75&auto=format&fit=max)

FEATURED May 17, 2026 · 27 min

## Multi-agent orchestration patterns: a 2026 production guide

Six multi-agent system patterns that actually ship in 2026 — supervisor, swarm, hierarchical, blackboard, sequential, hybrid — with framework picks and the production failure modes nobody warns you about.

Navin Sharma Read the post →

](/blog/multi-agent-orchestration-patterns/)

001 / ALL POSTS 5 total

-   [
    
    ![Macro photograph of frost dendrites on cold glass — the branching retrieval pattern of a customer service chatbot](https://cdn.sanity.io/images/xr290ucr/production/19608b3fa221073ea4d065bef54b13ee71b7c720-1408x768.png?w=600&q=70&auto=format&fit=max)
    
    ## Customer service chatbot: a 2026 buyer's guide
    
    A 2026 buyer's guide to customer service chatbots — RAG over your docs, eval gates on deflection, and what the LLM tier actually costs in production.
    
    Navin Sharma May 17, 2026 13 min
    
    
    ](/blog/customer-service-chatbot-buyers-guide/)
-   [
    
    ![Macro photograph of crystals forming on a microscope slide under polarised light — restrained single-frame laboratory documentation](https://cdn.sanity.io/images/xr290ucr/production/2114b823a72eb65cfc993c8551b5e2e9851f8126-1408x768.png?w=600&q=70&auto=format&fit=max)
    
    ## Generative AI services: a 2026 buyer's guide
    
    A 2026 buyer's guide to generative AI services — brand-controlled image, video, audio and multimodal pipelines, eval-graded outputs, and what the production pipeline actually costs.
    
    Navin Sharma May 17, 2026 17 min
    
    
    ](/blog/generative-ai-services-buyers-guide/)
-   [
    
    ![Abstract visualization of noise resolving into structure — diffusion and flow matching](https://cdn.sanity.io/images/xr290ucr/production/60323b12437696679107467a43f5b7f1e8705466-1408x768.png?w=600&q=70&auto=format&fit=max)
    
    ## Diffusion model vs flow matching: a 2026 buyer guide
    
    A 2026 buyer and builder guide to the diffusion model paradigm — flow matching, diffusion model architecture, sampling cost, and what to ship.
    
    Navin Sharma May 17, 2026 18 min
    
    
    ](/blog/diffusion-vs-flow-models/)
-   [
    
    ![Long-exposure photograph of fibre cabling and server status lights in a data centre — automation infrastructure](https://cdn.sanity.io/images/xr290ucr/production/6053843131f206a7235c526efad6e47a1f4b9e5e-1408x768.png?w=600&q=70&auto=format&fit=max)
    
    ## AI automation solutions: a 2026 buyer's guide
    
    A 2026 buyer's guide to AI automation solutions — what runs LLM-in-the-loop on n8n, Make and Temporal, where the cost lives, and how to ship eval-gated.
    
    Navin Sharma May 17, 2026 17 min
    
    
    ](/blog/ai-automation-solutions-buyers-guide/)

Want to ship AI?

## The inquiry form is *faster* than any post.

An engineer reads every inbound. Same business day on most replies.

[Talk to engineering](/contact/) [Explore services](/services/)


---

## SECTION: 8.1. Blog: multi-agent-orchestration-patterns

_Source: https://www.paiteq.com/blog/multi-agent-orchestration-patterns/_

# Multi-agent orchestration patterns: a 2026 production guide

> Six multi-agent system patterns that actually ship in 2026 — supervisor, swarm, hierarchical, blackboard, sequential, hybrid — with framework picks and the production failure modes nobody warns you about.

**HTML version:** https://www.paiteq.com/blog/multi-agent-orchestration-patterns/
**Published:** 2026-05-17T16:24:49.662Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~27 min


---

A multi agent system is a software architecture where two or more autonomous language-model agents coordinate to solve a task that one agent cannot, or should not, handle alone. Each agent has its own prompt, its own tool set, often its own model choice, and a defined way of passing work to another agent. The supervisor pattern routes through a single coordinator; the swarm pattern lets peers hand off directly; the hierarchical pattern stacks supervisors. That is the whole idea, and in 2026 it is one of the most over-prescribed shapes in applied AI.
We've shipped enough agent systems to think the conversation around them is upside down. The interesting question isn't whether you can wire three Claude Sonnet 4.6 agents into a CrewAI graph. You can, in an afternoon. The interesting question is whether you should, what the production failure modes look like once the demo is live, and which of the six orchestration patterns actually fits the work you're trying to automate. That's what this post covers, written for the Tech Lead, VP Eng, or CTO who has to pick a pattern and own the on-call rotation when something breaks at 3am.

## Multi agent system, in one minute

Working definition. A multi agent system is a runtime in which two or more LLM-driven agents exchange messages, share a workspace, and call tools, in order to complete a task that has been decomposed across them. Each agent is a prompt plus a model plus a tool set plus a memory of the conversation so far. The orchestrator is the code that decides who runs next, what they see, and when the system has reached an answer. That orchestrator is the architectural choice you're making when you pick LangGraph or CrewAI or AutoGen or the OpenAI Agents SDK or the Anthropic Agent SDK, or a custom state machine you wrote in TypeScript.
What's actually shipping in 2026. Most production agent systems we see in the wild settle on a supervisor or a sequential pipeline; a smaller but real cohort runs hierarchical (supervisors of supervisors) for genuinely complex workflows; the swarm and blackboard shapes show up in research-and-summarise tasks where parallel exploration pays. The model layer is split between Claude Opus 4.7 for the planner role, Claude Sonnet 4.6 or GPT-5 mini for the worker roles, and Claude Haiku 4.5 or GPT-5 nano for routing classifiers. Token spend is dominated by the supervisor's growing context window, not by the worker calls, and that one fact drives more of the unit economics than any framework choice.
Throughout this piece we'll use *multi agent system* as the umbrella for any architecture where more than one LLM-driven agent participates in the same task. Where the pattern actually changes the engineering tradeoffs (cost and latency and failure modes), we'll say which pattern and why.

## Multi agent system architecture: the five components every pattern shares

Every multi agent system architecture, regardless of framework, reduces to five components. The agents themselves: each is a prompt template, a model binding, and a tool list. The message bus or graph edge: how agent A's output reaches agent B's input. Shared state: the workspace or scratchpad or graph-state object that survives across handoffs. The tool registry: what external functions are callable, and by whom. And the orchestrator: the code that picks the next agent, enforces handoff rules, and decides termination.
The split between message bus and shared state is the part most teams get wrong on a first build. LangGraph keeps them separate: the graph defines edges (the bus), and a typed state object travels between nodes (the state). CrewAI fuses them inside its Crew object, which is convenient until you need to inspect what one agent saw and another didn't. AutoGen's group-chat abstraction puts everything in a single conversation buffer, which is easy to reason about for the first three agents and an observability nightmare for the next seven. Pick the framework whose split matches how you want to debug at 3am, not the one with the prettiest landing page.
Tool registry design is where security and cost both live. A naive multi agent system architecture hands every agent every tool. A production one scopes tools per role: the planner can call the search tool and the database read tool; only the writer agent can call the email-send tool; nobody but the auditor can call the destructive delete tool. The same pattern that scopes IAM in a real backend service applies here, and skipping it is what turns an agent demo into a security finding. We talked through the same logic on our pillar around [ai agent development company](/services/ai-agent-development/) engagements; the principle there is identical.
Shared state is the second place careful design pays off. The cheapest shared state is a Python dict in process memory and that works fine for a graph that completes in seconds on a single worker. The moment the work spans a queue or a Temporal activity, the state has to live outside the process; Postgres is the boring default and the right one for almost every team we work with. Redis is fine for short-lived caches but a poor primary store for an agent system because you actually want history. Most teams write a thin state-store wrapper inside their orchestrator code so the swap from in-memory to Postgres is a one-line change when the system grows up.

## The six orchestration patterns you'll actually use

There are dozens of named patterns in the research literature; six of them cover almost every production system we've shipped or reviewed. Supervisor: one coordinator routes work to N specialist subagents and aggregates their replies. Hierarchical: supervisors of supervisors, where a top-level planner delegates entire sub-tasks to mid-level supervisors that own their own teams. Swarm or network: peer agents hand off to each other directly, no central coordinator, with a termination rule. Blackboard: agents read and write a shared structured workspace, picking up work when their precondition is met. Sequential pipeline: a fixed chain of agents, each passing output to the next, with no branching. Hybrid: a real system that combines two of the above, almost always supervisor-plus-pipeline.
When to pick each. Reach for supervisor when work decomposes cleanly into specialist roles (researcher and drafter and fact-checker and editor). Reach for hierarchical only when a single supervisor's context window can't hold the full task plan; this is rarer than people think now that Claude Opus 4.7 sits comfortably in the long-context tier. Reach for swarm when the task is genuinely parallel, the subagents don't need each other's intermediate outputs, and you have a hard termination rule (count or time or consensus). Reach for blackboard when the system has long-running asynchronous work and the agents are actually different services owned by different teams. Reach for sequential pipeline when the steps are well-defined and deterministic and you don't actually need agentic reasoning between them; in that case you're better off with a plain workflow runtime and a few LLM calls inside it.
Hybrid deserves a closer look because it's where most production systems actually land. A common shape is supervisor-plus-pipeline: a top-level supervisor that picks a sub-workflow for the incoming request, and inside each sub-workflow a deterministic sequential pipeline does the work without any further agentic decisions. That shape gets you the routing flexibility of an agent plus the predictability of a workflow; the trace is still readable. Another common shape is supervisor-plus-swarm for deep-research-style tasks: the planner runs as a supervisor and the research step inside it fans out to a small swarm of parallel researchers that the planner then summarises. The trick with hybrid is to keep the boundary between the two patterns explicit in the code; a hybrid that quietly drifts into a fourth-pattern hybrid-of-hybrids ends up impossible to reason about.
A note on naming. Different framework communities use different words for similar shapes. LangGraph calls its default the 'supervisor' graph; CrewAI calls it 'hierarchical process'; AutoGen calls it 'group chat with a manager'; the OpenAI cookbook talks about 'orchestrator-worker'. They're all the same pattern with different ergonomics. Don't get attached to the framework's vocabulary if it confuses your team; pick a name internally and document it once and use it everywhere.

### Blackboard in the wild: supply-chain agents at a Fortune-500 logistics shop

The cleanest public example of a blackboard multi agent system in production today is the wave of supply-chain optimisation rollouts running on top of Temporal at large logistics shops. Each warehouse, freight lane, and demand-forecast region is wrapped in its own agent, and they all read and write a structured shared workspace that holds the current best-known plan. A planning agent posts a draft schedule; warehouse agents check feasibility against their local constraints and post objections; a reconciler agent picks up when enough objections land and re-runs the plan. The pattern works here for one reason: the agents are owned by different teams with different release cadences, and forcing them through a single supervisor would have created a coordination bottleneck nobody wanted to own. It's classic blackboard — Hearsay-II from the 1980s, just with LLMs reading and writing JSON instead of speech tokens — and the production write-ups from JPMorgan, Walmart Labs, and a handful of European 3PLs all describe broadly the same shape. We've reviewed one engagement that matched this shape and we'd reach for it again under the same constraints.

### Sequential pipeline in the wild: sales-ops research at high-velocity B2B teams

The canonical sequential pipeline shipping in 2026 is the sales-ops research stack that's become standard at high-velocity B2B teams — Clay, Apollo's agent layer, the various CrewAI-templated stacks that the GTM-engineering crowd shares on LinkedIn every other week. The shape is always three agents in a fixed order: an account researcher that pulls firmographics and a recent-news scan, a signal hunter that scores buying-intent cues from job postings and tech-stack changes, and a briefer that turns the upstream output into a short, formatted memo for an account executive. It's a pipeline because the order is deterministic and the agents don't argue with each other; each one's output is the next one's input, and the only branching is a quality gate that kicks back to the researcher if the firmographic block is empty. CrewAI's `Process.sequential` was effectively designed for this shape, and it's the one place we don't push back on a multi-agent ask — the role split genuinely pays because the prompts and the tool sets really are different per step.

### Hybrid in the wild: deep-research products from Anthropic, OpenAI, and the open clones

Hybrid is the pattern that ships in every deep-research product you've seen this year. Anthropic's Research feature for Claude, OpenAI's Deep Research mode, Perplexity's Research tier, and the open clones (Anthropic's cookbook, the Open Deep Research repo, LangChain's reference template) all converge on the same supervisor-plus-swarm shape: a planner reads the user's question, decomposes it into a handful of independent sub-queries, fans those out to parallel research workers, and then runs a single writer agent over the gathered summaries. The reason every team lands here independently is structural — the planning and the writing are sequential and stateful and want a single agent, but the research step is embarrassingly parallel and wants a swarm, and forcing either step into the other's pattern wastes either latency or coherence. The trap, which we see in roughly a third of the open clones, is letting the workers also do follow-up planning; once that happens you've drifted into a fourth-pattern hybrid-of-hybrids and the trace becomes unreadable. Keep the supervisor's job and the swarm's job strictly separated and the shape holds up under production load.

## Multi agent system examples: what production teams are shipping in 2026

It's easier to argue about patterns when you can point at concrete builds. Here are six multi agent system examples drawn from the public literature, vendor case studies, and the engineering conversations our team has had this year. None are paiteq client outcomes; they're industry-shape references.
Two patterns to call out from this list. The deep-research stack (Anthropic's Claude research feature, and the open clones that followed) is the canonical case where swarm-style parallel exploration genuinely pays — the subtasks are independent, the writer agent only needs the summaries, and the wall-clock saving from running researchers in parallel is the whole product. The customer-support stack is the opposite: a tight supervisor with three or four specialised workers, where the supervisor's role is to keep context lean and route deterministically. The framework choice (LangGraph vs CrewAI vs the bare Agents SDKs) is much less interesting than the pattern choice.

## Best multi agent system framework: a six-way decision matrix

There isn't a *best multi agent system* framework in the abstract. There's a best framework for a given pattern, team size, and observability requirement. The honest answer to the question we get most often (which framework should we use?) is a four-dimension judgement: latency cost, observability maturity, learning curve, and pattern fit.
Two notes on the matrix that don't fit in cells. First, the OpenAI and Anthropic Agent SDKs aren't strict alternatives to LangGraph or CrewAI; they're a layer below. You can wrap either SDK inside a LangGraph node and get the best of both. Most production systems end up as a thin LangGraph (or CrewAI) outer loop around vendor-SDK inner calls. Second, Temporal isn't really a multi-agent framework at all; it's a durable workflow runtime. When the workflow spans hours or days and has to survive worker crashes, picking Temporal and putting agent calls inside its activities is the right answer and the multi-agent question becomes a workflow design question instead.
Where LangGraph hurts. The pain point we hit most often on LangGraph builds is the typed-state object that's beautiful at sprint one and a refactor headache by sprint four. A typical engagement shape: the team starts with a clean `AgentState` TypedDict carrying a `messages` list and a `next` field, then adds a `scratchpad`, then a per-agent `memory` dict, then a `tool_outputs` cache to keep retries cheap, and within a month the state object is fifteen fields deep with implicit invariants nobody documented. LangGraph itself is fine — the graph compiles, the runtime is fast, the LangSmith traces are excellent — but the cost of changing the state shape grows roughly cubically because every node has to be re-read and every saved checkpoint becomes a migration problem. Our default mitigation is a hard rule that the state object owns no more than seven fields, with everything else demoted to a side-cache keyed by run-id; we've watched teams who skipped that rule lose a sprint per quarter to state-shape refactors. If the state really needs to grow, that's a signal to split the graph in two, not to keep widening one object.
Where CrewAI saves a sprint. CrewAI's value shows up earliest on small sequential pipelines where the team is new to agents and the project lives or dies on time-to-first-demo. A typical-shape engagement: a four-person team wants to ship a sales-ops research pipeline in two weeks, doesn't have prior LangGraph muscle memory, and the work decomposes into three agents in a fixed order. CrewAI's role-task-crew vocabulary clicks in a half-day workshop, `Process.sequential` matches the pipeline pattern exactly, and the first full-loop run is usually green by day three. The trade-off lands later, when the team needs to add a branching decision (route to a 'deep research' sub-crew when the firmographic data is sparse), at which point CrewAI's process abstractions start to feel narrow and you find yourself reaching for a wrapper LangGraph anyway. So the recommendation we keep landing on is: start in CrewAI if the team is new and the pattern is genuinely sequential, plan the migration to LangGraph for the moment the graph stops being a line, and don't try to retrofit branching into a CrewAI Crew when a graph runtime is the right tool.
AutoGen edge cases. AutoGen earns its place in research-flavoured group-chat patterns, and it's the right tool for almost nothing else in 2026. The typical-engagement shape we'd point you toward AutoGen for is one where the value of the system is in agents arguing with each other — a red-team-vs-blue-team eval harness, a debate-style fact-checking loop, a research prototype where you want to watch four personas push back on a draft. The group-chat abstraction makes that easy and the Studio UI is genuinely the nicest one in this category for inspection. The places we've seen it break down are the ones where the team picked AutoGen for a production supervisor build because the demos looked clean, and then the group-chat token fan-out and the looser termination semantics started chewing through their token budget on every long conversation. If you find yourself adding custom termination logic on top of AutoGen, that's the moment to step back and ask whether LangGraph's explicit edges would have been the cheaper starting point; it almost always would have been.

## Coordination cost is the unit-economics killer

A common mistake is to budget a multi-agent system as if its cost equals the sum of its agent calls. It doesn't. The supervisor sees every subagent's output, plus the original task, plus its own running plan, on every routing decision. By turn three the supervisor's input context is the sum of everything the subagents have said so far, and you're paying for that context on every supervisor turn. Add classifier-free routing (the supervisor asking a Haiku-class classifier whether to delegate) and you're paying twice on every decision.
These multipliers are typical-engagement-shape numbers we use to sanity-check a budget before the first prototype lands. The exact figure for any given system depends on prompt size and tool-call density and how aggressively the supervisor prunes context between turns. The shape, though, holds: anything beyond a sequential pipeline costs at least 2× the single-agent equivalent on tokens, and the cost grows roughly linearly in agent count plus an extra term from supervisor context growth. If your pricing model can't absorb a 4× to 6× multiplier on the agent-loop tokens, you don't have a multi-agent product yet; you have a research prototype that needs cost engineering before it ships.
The fix is mostly mechanical. Cap the supervisor's context with a rolling summary instead of the full history; force subagents to return structured JSON instead of free prose; pin the cheapest competent model per role (Haiku 4.5 routes, Sonnet 4.6 works, Opus 4.7 plans) using the same [model selection patterns we use for LLM workloads](/services/llm-development/); and add a hard turn cap so a runaway loop bails before it bankrupts the run. Every production multi-agent system we've shipped has these four levers turned up.

## Multi agent system implementation: a working LangGraph supervisor

A pragmatic *multi agent system implementation* in 2026 leans on a small, boring toolchain. LangGraph 0.4.x or CrewAI 1.x on top of Python 3.11+. A vendor SDK underneath for the actual model calls (the Anthropic SDK for Claude Sonnet 4.6 or Opus 4.7, the OpenAI SDK for GPT-5). Langfuse or LangSmith for traces. Postgres for whatever shared state needs to outlive a single run. A queue (SQS, NATS, or an internal Temporal cluster) if the runs are long-lived. Nothing exotic, and nothing that the team can't operate.
Three implementation traps we see repeatedly. First, hardcoded role prompts that don't degrade gracefully when the planner picks an unexpected route; always include a 'no-op' branch that just appends a short refusal to state and lets the supervisor decide again. Second, missing turn caps; we put a hard MAX_TURNS at the supervisor's routing function and refuse to ship without it. Third, untyped state: LangGraph's TypedDict isn't optional in our book, because once you're three commits in, you will forget what fields you added and what they meant, and the type checker is the only thing that catches it before the first 3am page.
On the training-and-serving split: agent systems don't have a training phase the way a classifier does, but they do have an eval phase that behaves like one. We treat the prompt set, tool list, and routing logic as 'weights' and keep them in version control with the same discipline a model gets, run a regression suite on every change, and only promote to production when the eval grid passes. Teams that skip this end up reverting prompt changes by hand at 3am when the support backlog spikes, which is a recoverable mistake the first time and a culture problem by the third. We've watched that pattern more often than we'd like.
What's missing from these snippets. The three code samples above are honest sketches of the routing loop, and they're also missing roughly two-thirds of what a production multi-agent system actually needs. None of them persist state across runs; each invocation starts from a fresh dict, which is fine for a notebook demo and unworkable the moment a user can resume a session. In production you'd back the state object with Postgres (a `state_snapshots` table keyed by run-id, written after every node) or wrap the whole graph in a Temporal workflow so the runtime owns the durability. None of them handle errors gracefully; a tool that raises mid-run blows up the loop instead of being caught, logged, and either retried with backoff or surfaced to the supervisor as a structured failure for re-planning. We default to a try-and-record pattern at every tool wrapper: catch, log a structured failure event with run-id and agent-id, and return a typed error object the supervisor can read and route on.
Retries deserve their own line. The snippets above retry nothing, which means a transient 429 from the model provider or a flaky tool call ends a run that would have succeeded on the next attempt. We wrap every model call and every tool call in an exponential-backoff helper (tenacity in Python, p-retry in Node) with a strict cap — three attempts, capped at 30 seconds total — and we record every retry as a separate trace span so the dashboards show retry rate as a first-class metric. Two more things missing from the sketches: there's no idempotency token on the tool calls (so a retried 'send email' tool can double-send), and there's no cost meter wired into the loop (so a runaway can outrun your alerting). The supervisor.py file in production is usually three times the size of the sketch above, and the extra lines are exactly state persistence, error handling, retries, idempotency, and cost metering. None of it is fancy; all of it is non-negotiable before you ship.

## The five production failure modes nobody warns you about

Demo-stage multi-agent systems break in ways their authors didn't anticipate. Five failure modes account for the majority of incidents on the systems we audit.
One. Token-budget runaway. A supervisor that doesn't summarise grows context linearly; on a slow day, a single user query can chew through a five-figure token bill in a single run. A concrete trace shape we've seen on an audit: turn 1 input is 2k tokens, turn 6 input is 38k tokens, turn 12 input is 110k tokens, and the supervisor still hasn't decided to stop. The cost curve in the trace UI looks like a hockey stick. Detection: alert on per-run token count exceeding a fixed budget (we typically set this at 5× the median run for the workload). Prevention snippet: after every supervisor turn, replace the oldest N messages with a one-paragraph summary block tagged `summary_of_turns_1_to_5`, and impose a hard MAX_TURNS = 12 in the routing function that returns 'done' on hit. The combination cuts the 99th-percentile run cost by an order of magnitude with almost no quality loss.
Two. Role collision. Two agents with overlapping prompts converge on the same answer style and stop adding value. The trace signature is unmistakable once you've seen it: agent A returns a 400-word answer, agent B returns a near-identical 410-word answer with slightly different phrasing, and the supervisor's aggregator concatenates both and ships 800 words of redundant prose to the user. We've watched this in a fact-checker-and-editor pair where both agents drifted into 'comprehensive review' over time. Detection: a cheap eval that asks a Haiku-class judge 'did agent B's output strictly subsume agent A's?'; if the answer is yes on more than a small fraction of cases, the roles are too close. Prevention snippet: write each role contract as a one-sentence 'must do X and must not do Y' constraint inside the system prompt, and add an automated diff check post-run that flags high token-overlap between supposed-to-be-different agents.
Three. Context-window OOM. The supervisor's input grows past the model's window and the system silently truncates, often dropping the earliest task description. The trace example we still cite in workshops: a supervisor on a 200k-token-window model passes 205k tokens of context on turn nine, the API quietly drops the first 5k tokens (which happened to include the user's original instructions), the system produces a confident but off-topic answer, and the user reports a 'hallucination' that's actually just the original prompt being silently forgotten. Detection: log every model call's input token count alongside the model's context limit, and alert when usage crosses four-fifths of the window. Prevention snippet: pin the supervisor to the longest-context model available (Claude Opus 4.7 at 1M, Gemini 3.0 Pro at 2M), summarise older turns aggressively (drop everything older than the last 8 turns into a single summary block), and re-inject the original task description at the head of every supervisor input as a non-negotiable system message.

> [!NOTE] (rich block: callout)

Five. Untraceable tool calls. An agent calls a tool whose effect isn't logged in the conversation buffer, and a later audit can't reconstruct what happened. The trace shape that gives this away: the conversation buffer shows a clean response, but a downstream system (Salesforce, Stripe, a CRM webhook) shows a write that the trace can't account for. We've watched a team spend a full sprint reconstructing what a five-agent run did when finance flagged a four-figure vendor charge with no corresponding trace span. Detection: every tool call must emit a structured log line with run-id, agent-id, tool-name, arguments, and outcome — and a nightly job should reconcile tool-side events against trace-side events and alert on the delta. Prevention snippet: instrument the tool registry once, at the wrapper layer (`@traced_tool` decorator that wraps every tool function and emits an OTel span on entry and exit), instead of asking every agent to be polite about logging. The investment is half a sprint and pays for itself the first time legal asks what the system did on a specific request.

## Observability and eval — the agent-trace stack

An agent system without a trace stack is undebuggable in production. The three tools we lean on, in order of how often they show up in our engagements: Langfuse for self-hosted or hybrid teams that want full ownership of trace data, LangSmith for teams already on the LangChain ecosystem, and Inspect AI from the UK AI Safety Institute when the eval suite needs to be defensible to a regulator. All three answer the same three questions: which agent ran when, what did each one see and emit, and how long did each step take.
The eval harness is the part most teams underbuild. A multi-agent eval suite has to cover three layers. Per-agent unit evals (does the researcher cite real sources; does the writer hit the format spec). Per-handoff evals (does the supervisor route to the right specialist given a known input). And full-system evals (does the whole graph produce a correct final answer on a held-out test set). We typically maintain 50 to 200 cases per layer and re-run them on every prompt change, which sounds expensive and isn't, because most of the cost is one-time fixture authoring and the runs themselves take minutes on a small batch.
Observability tooling pairs naturally with the kind of [workflow automation engagements where we run the orchestration ourselves](/services/ai-workflow-automation/); you don't get to run an agent system in production without a trace UI, and you don't get to ship a workflow automation product without one either.
There's a second, quieter discipline that makes multi-agent systems serviceable in production: structured logging at the tool boundary. Every external call an agent makes should land in your logs as a structured event with a run-id and an agent-id and a timestamp. We've watched teams skip this and then spend a full sprint reconstructing what a five-agent run did when a downstream system flagged a bad write. The wrapper that emits these logs is half a day of work; the absence of it is two weeks of forensic spelunking. The math isn't subtle.
On eval cadence: we recommend running the full suite on every prompt change and on every framework upgrade, with a weekly canary on production traffic. Production canaries don't need to be perfect, but they catch the kind of drift that hand-rolled fixtures miss. Two failure modes only show up under real traffic: long-tail input formats that the test set didn't capture, and emergent supervisor behaviours that only appear when context-window pressure is real. A weekly canary across 50 to 100 real (anonymised) inputs catches both. We've yet to meet a team that regrets adding it; we've met several who regret not adding it sooner.

### Langfuse: the open-core default for self-hosted teams

Langfuse is the trace tool we pick most often, and the reason is unsexy: it's open-core, it self-hosts cleanly on a single Postgres plus a small worker, and the data stays on infrastructure the client already owns. The UI gives you the agent tree we keep showing in workshops — parent span for the supervisor, child spans for each subagent call, grandchild spans for each tool invocation, with token counts and latency on every span — and the SDK works across LangGraph, CrewAI, the OpenAI Agents SDK, the Anthropic Agent SDK, and raw model calls with the same instrumentation pattern. The weak spot is the eval product: Langfuse's eval harness is competent but not opinionated, and teams that want a strong opinionated workflow often layer DeepEval or promptfoo on top. Pick Langfuse when data residency matters, when the team wants to own the trace store, or when the stack is multi-vendor and you don't want to bet the observability layer on one model provider's tooling.

### LangSmith: the LangChain-native pick when the team is all-in on the ecosystem

LangSmith is what we reach for when the team is already deep in LangChain and LangGraph. The integration is essentially zero-config — you set two environment variables and every node in your graph emits a span — and the dataset-and-eval product is the most opinionated of the three, with first-class support for LLM-as-judge evals, regression test suites, and prompt-version diffing inside the same UI as the traces. The weak spots are predictable: LangSmith is a hosted SaaS by default (self-hosting is a paid enterprise tier), pricing scales with trace volume which can sting on a busy production system, and the tightest features assume you're using LangChain abstractions throughout the stack. If the team isn't already on LangChain, the integration is fine but you're paying for features you won't fully use. Pick LangSmith when LangChain is the substrate, when the eval-driven-development discipline matters, and when the team values speed-of-setup over data ownership.

### Inspect AI: the eval framework regulators read

Inspect AI is the odd one out and the one we recommend when the eval results have to defend themselves to a third party. Built and maintained by the UK AI Safety Institute, it's an open-source Python framework whose primary job isn't tracing your production traffic — it's running rigorous, reproducible eval suites of the kind that show up in regulatory submissions and frontier-model evaluations. The trace UI is functional rather than slick, but the eval primitives (solvers, scorers, datasets, plans) are the most carefully thought-out in this category, and the audit-trail format is taken seriously enough that we've seen it referenced directly in AISI and NIST submissions. Pick Inspect AI when the system is high-stakes enough that a regulator or a procurement team will ask 'show us your eval methodology' — and pair it with one of the other two for day-to-day production tracing rather than asking it to be both.

## When NOT to use a multi-agent system

Three opinionated calls we'll defend in any client review. They aren't always popular in the kickoff workshop. They've been right often enough that we keep making them.
Call one. Default to a single agent with parallel tool calls. Roughly six or seven out of every ten use cases that walk through our door with a multi-agent ask are better served by one well-prompted Claude Sonnet 4.6 or GPT-5 agent that can call its tools in parallel. The architecture is simpler, the cost is lower, the trace is one tree instead of a graph, and the failure modes are the ones every team already knows how to handle. If a single agent with the right tool set can produce the answer in one or two turns, splitting the work across agents is yak-shaving with extra bills.
Call two. Reach for multi-agent when the work decomposes into specialist roles that genuinely use different prompts, models, or tool sets. A researcher that runs on Sonnet with web-search tools, a fact-checker that runs on Opus with no tools, and a writer that runs on Sonnet with a style guide is a system where the role split pays. A 'planner agent' and a 'coder agent' that both run on the same model with the same tools and a slightly different system prompt is not.
Call three, the one teams underweight. Ship a single-agent baseline first, then decompose. Premature multi-agent decomposition is the new premature microservices: it costs you simplicity up front, locks in a topology before you've measured the bottleneck, and makes the next refactor harder. The right sequence is: ship single-agent, instrument it, find the actual quality or latency gap, split only where the measurement says splitting helps.

## Multi agent system guide: a 7-step build checklist

Use this short *multi agent system guide* as a sanity check before you spend the first sprint on a multi-agent build. We run a version of it inside every kickoff workshop.
Step by step. One, write the task in one sentence; if you can't, the system isn't ready to be designed. Two, ship a single-agent baseline with parallel tool calls and instrument it; only proceed if you can name a measured gap. Three, pick the orchestration pattern (supervisor for most cases, swarm only for parallel research workloads, pipeline only for deterministic chains). Four, pick the framework that matches the pattern and your team's observability needs. Five, pin the cheapest competent model per role. Six, wire up Langfuse or LangSmith on day one. Seven, set hard caps on tokens per run and turns per run and dollars per day, and refuse to ship without them.
One closing note on the checklist. Steps three through five almost always change after the team has run the single-agent baseline for two weeks; steps one and two rarely do. If you get the task definition and the baseline right, the rest of the design is recoverable. If you skip them, no amount of framework polish saves you. We've seen the same lesson land in three different industries this year, and we expect to see it again — every robust multi agent system we've shipped started from a single-agent baseline that earned its split, not a topology picked in the kickoff workshop.

## Frequently asked questions about multi-agent systems

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/


---

## SECTION: 8.2. Blog: generative-ai-services-buyers-guide

_Source: https://www.paiteq.com/blog/generative-ai-services-buyers-guide/_

# Generative AI services: a 2026 buyer's guide

> A 2026 buyer's guide to generative AI services — brand-controlled image, video, audio and multimodal pipelines, eval-graded outputs, and what the production pipeline actually costs.

**HTML version:** https://www.paiteq.com/blog/generative-ai-services-buyers-guide/
**Published:** 2026-05-17T15:45:14.002Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~17 min


---

Generative ai services are the layer of software that takes a creative brief (a product still, a brand film, a voiceover) and runs it through a stitched stack: a foundation model (Imagen 4 or Flux Pro 1.1 or SD 3.5 or Claude Sonnet 4), a routing layer (Fal or Replicate or Runware), a brand-asset store (Pinecone or pgvector), a delivery surface. What's shifted in 2026 is that generative ai services have stopped being one API call equals one asset and become real durable pipelines. A single deliverable now runs through 4 to 8 model calls, a retrieval step, a brand-safety filter, a human-review queue. That's the structural shift this guide is written around.
This is a 2026 buyer's guide for the Creative Director, the Brand Manager, the CMO, the Tech Lead evaluating generative ai services for a budget cycle that needs to ship a working pipeline this quarter. We'll skip vendor-marketing definitions and go straight to architecture shapes, the modality-by-modality vendor matrix (Imagen 4, Flux Pro 1.1, SD 3.5, Sora, Veo 2, ElevenLabs), a working code stack, the cost-per-approved-asset math, and the eight-step kickoff checklist. The bias is toward repeatable cost-per-shippable-asset.

## Generative ai services in one paragraph, and why the 2026 stack looks different

Working definition. A generative ai service is a productionised workflow (multi-step, crossing 3 to 6 systems) where at least one step is a generation call to a foundation model like Imagen 4, Flux Pro 1.1, SD 3.5, Sora, Veo 2, Claude Sonnet 4, or ElevenLabs, and the orchestration survives a retry or a content-safety reject. That rules out a one-shot Midjourney prompt run by hand. It rules in a Figma-to-Imagen pipeline that respects brand tokens, a Diffusers worker running SD 3.5 with a per-brand LoRA, a Runway Gen-3 plus ElevenLabs chain producing a 6-shot social cut, a Claude-driven copy expansion pulling from a Pinecone brand archive.
Three things shifted between the 2023 image-gen wave and the 2026 generative ai services market. First, modalities went multi. Sora and Veo 2 made video generation a commodity ($0.50 to $1.50 per 8-second clip at draft quality). ElevenLabs and Cartesia did the same for voice. Second, the routing layer matured. Fal and Replicate and Runware now host hundreds of open-weight checkpoints behind one HTTP API. Swapping Flux Pro for Stable Diffusion 3.5 is one config line rather than a re-platform. Third, brand-control primitives shipped. LoRA fine-tuning on Flux Dev and SD 3.5 makes a brand-locked checkpoint a weekend of work. ControlNet plus IP-Adapter plus reference-image conditioning let a generated asset stay inside the brand sandbox. That's the headline reason new creative-ops budgets don't go to a single hosted vendor anymore.
We'll use *generative ai services* through this guide as the umbrella term for any production stack that combines durable orchestration with at least one foundation-model generation call, served either through a managed routing platform (Fal and Replicate and Runware, Together AI) or a self-hosted control plane (Diffusers plus ComfyUI workers behind n8n or Temporal). Where the trade-off between a single-vendor hosted path and a routed multi-vendor path matters for procurement, we'll say so. Our deeper companion note on [the model-architecture trade between diffusion and flow-matching](/blog/diffusion-vs-flow-models/) walks the technical comparison if you want to go a layer deeper on why Flux beats SDXL on certain prompts.

## What counts as generative ai services, and what doesn't

The category boundary is where most buyer conversations go sideways. A vendor will pitch an Adobe Firefly seat as a complete generative ai services platform; a Canva Magic Studio rep will pitch a template library as the whole answer. Procurement needs sharper lines.
Two implications from the matrix. One, a generative ai services platform must do meaningful work. A Flux Pro 1.1 call that rephrases one tagline isn't a service; a Flux Pro 1.1 call that generates 40 brand-compliant social variants from a single brief, indexed in Frame.io, with the bottom 70% auto-filtered against a CLIP brand-similarity gate, is. Two, the routing layer matters more than any single model. n8n or Temporal coordinating Fal for image work, Runway for video, ElevenLabs for voice lets you swap any one vendor inside a sprint. Sora went from invite-only at $200/month to API access at sub-$0.10 per generation inside 14 months, which broke a lot of single-vendor projection models.
The other test we run early: can this pipeline survive a model deprecation without a rebuild? A pure Midjourney-only studio answers no — Midjourney v5 to v6 changed the prompt grammar enough that brand-locked templates needed rewriting. A self-hosted Diffusers pipeline with a versioned SD 3.5 LoRA answers yes; your checkpoint doesn't move unless you move it. That portability costs an MLOps engineer to run the GPU pool, but it's why mid-market creative-ops teams with serious brand-equity stakes are migrating off purely-hosted stacks. We've watched three brand teams do this in the past year; they don't, won't, rebuild on a closed-API-only shop once a brand LoRA is in production.

## Generative ai services architecture: the three reference shapes we ship

There are exactly three generative ai services architecture shapes we ship. They differ by who owns the model, who owns brand control, and where the human-review hook lives. Picking the right one at kickoff is the highest-leverage decision; getting it wrong costs 4 to 8 weeks of rework.
Shape one. Single-vendor hosted. The whole pipeline lives inside one platform: Vertex AI with Imagen 4 and Gemini, or AWS Bedrock with Titan and Claude, or Adobe Firefly with the Creative Cloud APIs. Brand control sits in prompt templates and reference images, with no custom checkpoint. Cost shape: roughly $0.03 to $0.08 per image render, $0.50 to $1.50 per 8-second video clip. It's the right pick when speed-to-first-asset matters more than brand fidelity. We recommend it in maybe 35% of engagements, usually B2B SaaS and early-stage commerce where brand systems are still loose.
Shape two. Routed multi-vendor. Fal and Replicate and Runware, or a custom Python router sits in front. The pipeline calls Imagen 4 for hero stills, Flux Pro 1.1 for fast iteration, Stable Diffusion 3.5 for stylised work, Runway Gen-3 for short video, ElevenLabs for voice. The router abstracts the model API so swapping vendors is a config change rather than a refactor. Brand control sits in prompt engineering plus reference-image conditioning plus CLIP-based similarity gating. Use this shape when the brand brief is varied enough that no single vendor wins all the work (a $0.04 Imagen render versus a $0.003 Flux Schnell render is a 13x cost delta on similar briefs), and when vendor risk is real. We pick this in maybe 45% of engagements. It's the modal answer.
Shape three. Self-hosted with custom checkpoints. Diffusers plus ComfyUI workers running SD 3.5 or Flux Dev on Modal or RunPod or Cloud Run. A versioned brand LoRA per model line. Brand control is in the checkpoint itself. Use this shape when brand identity is a strategic moat (luxury or fashion or regulated CPG), when generation volume crosses 50,000 assets a month and cost-per-image needs to land below 1 cent fully loaded, or when on-prem inference is a compliance requirement. It costs an MLOps engineer to operate but ages best on a 3-year horizon. We pick it in 20% of engagements; the deeper pattern lives in our piece on [brand-locked generation with custom LoRA workflows](/blog/brand-controlled-generation-with-lora/).

> [!NOTE] (rich block: callout)

## Best generative ai services by modality (image and video and audio, text, 3D)

Ranking the best generative ai services on a single axis is a fool's errand. "Best" depends on whether you're shipping luxury product stills, a 6-shot brand film, a synthesised voiceover, or expansion copy for a press cycle. Below is the modality-by-modality view we'd hand a creative-ops lead today, with the price bands we see on invoices. Treat it as the working set, not a leaderboard.
Three things to notice in the modality table. First, image and audio per-render costs have collapsed under one cent for fast iteration. The budget conversation is now about how many renders you need per approved asset. Second, video is still the expensive modality. Sora and Veo 2 cost 20 to 50x more per second than image generation; that's the line item agencies under-budget most often. Third, long-form copy (Claude Sonnet 4, GPT-4 Turbo, Gemini 2.0) is so cheap that the cost driver is human-review time, not inference. The procurement frame has to recentre on cost-per-approved-asset.
The modality we get asked about most is image generation for product and brand work. The 2026 default. Imagen 4 via Vertex AI for hero shots when photographic fidelity matters. Flux Pro 1.1 via Fal for stylised brand work where Imagen's aesthetic is too generic. SD 3.5 via ComfyUI or Diffusers when a brand-specific LoRA is in play. Midjourney v6 still wins on a narrow band of art-directed work, but its API surface is the weakest of the four. DALL-E 3 is operationally fine but prompt grammar is loosely constrained. The consistent answer for serious brand work is a routed pair rather than a single pick.

## Generative ai services examples — what creative teams actually ship in 2026

The most useful way to internalise generative ai services examples is by deliverable shape, since that's also how creative-ops budgets get carved. Every example below is a workflow shape we've shipped or specced in the last 18 months; we've avoided naming brands to keep the framing typical-engagement-shape. Treat each row as a recipe.
Two things to notice in the deliverables table. First, the gap between cost-per-render and cost-per-approved-asset is huge. A $0.003 Flux Schnell render hides a 1-in-20 approval rate on unstructured creative briefs; the fully-loaded cost lands closer to $0.06. Still a deal compared to $250 per stock-photography license. Second, voice and music are the modalities where vendor lock-in costs the most operationally. ElevenLabs voice clones don't port to Cartesia, and Suno tracks can't be regenerated identically on Udio. Plan for that early.
The deliverable we get asked about most is the social-variant batch: one hero asset blown out to 40 sized variants for paid social. The 2026 default: a single Flux Pro 1.1 hero fed into a Flux Schnell variant pass via Fal (with seed locking and IP-Adapter for brand consistency), CLIP-filtered against a brand-reference set, assembled per channel by n8n into Notion or a DAM. Cost lands around 5 cents per shipped variant fully loaded. Teams that skip the CLIP gate end up with 15 to 20% of variants drifting off-brand and needing designer cleanup, which kills the math. The gate is a 30-minute build and the highest-ROI step in the whole pipeline.

## The generative ai services platform landscape, vendor-by-vendor

The generative ai services vendor landscape in 2026 has roughly twelve names that matter for a mid-market creative-ops buyer across image plus video, audio and text. We score the image-modality pool on five axes that match the procurement spreadsheet we actually use: brand-control depth, prompt fidelity, per-render cost, latency at scale, and the breadth of the surrounding ecosystem (LoRA training, ControlNet, IP-Adapter, hosted access). The matrix below is what we'd put in front of a creative steering committee tomorrow. Video and audio shops are handled in the follow-on note.
Reading the matrix as a buyer. Imagen 4 wins photoreal fidelity, loses on LoRA brand control. Flux Pro 1.1 wins prompt fidelity and ecosystem, loses on per-render cost. Stable Diffusion 3.5 wins brand control and cost, loses on out-of-box aesthetic without a tuned checkpoint. Midjourney wins aesthetic, loses on API surface. The honest answer for most mid-market creative teams is a pair: Flux Pro 1.1 for art-directed brand work plus Imagen 4 (or Stable Diffusion 3.5 with a LoRA) for the long-tail variant work, routed through Fal. Single-vendor pitches paper over a real gap. Our deeper companion piece on the technical trade between the foundation architectures runs at depth.
On video, the picture is messier and changing faster. Sora is the strongest at narrative continuity but API access is partner-gated and pricing has moved twice in 6 months. Veo 2 is the best-integrated path for teams already on Google Cloud. Runway Gen-3 is the most production-ready API for short brand work. We pair Runway plus Veo for most agency stacks and reach for Sora only when the brief needs a 20-second-plus continuous narrative. On audio, ElevenLabs is the strongest voice vendor but Cartesia is closing the latency gap fast (live narration, dynamic ad insertion). Suno owns short-form music; for longer scoring we'd still bring in a real composer. Our note on [the state of video model picks in 2026](/blog/video-generation-state-2026/) walks the video shortlist in more depth.

## Generative ai services implementation: a working pipeline in code

A concrete generative ai services implementation makes the architecture choices land harder than any matrix. Below are two snippets we'd ship: a routed Replicate call with Flux Pro 1.1 primary and SD 3.5 fallback, and a self-hosted Diffusers worker that loads a brand LoRA. Both encode the same workflow (brief in, image out, CLIP brand-filter, write to Postgres) so you can read them as a paired comparison.
Three implementation gotchas we hit repeatedly. First, the brand-similarity gate is worth its weight in saved designer hours. A 30-line CLIP cosine check against a pre-computed brand reference embed catches 60 to 80% of off-brand renders before a human sees them. The gate threshold (we start at 0.75 cosine and tune) is the most-tweaked knob across a 90-day engagement. Second, retries cost real money on hosted APIs. A Replicate call that retries Flux Pro three times because the upstream had a 502 has just billed you 3x. Cap retries at 2 on generation activities. Third, observability has to be designed in at day one. Every generation lands in Postgres with prompt, seed, model, brand-sim score. Skip it and you're flying blind into the second campaign cycle.
On the integration layer, two pieces are worth budgeting up front. Pinecone or pgvector for the brand-asset retrieval index. Pure-generation pipelines without retrieval drift off-brand within six weeks of launch. And a thin Python adapter (LangChain for text; a 250-line module for image) that abstracts model vendors so you can swap Flux Pro for Imagen 4 without rewriting the calling code. Vendor risk on the model side is the second-largest risk after brand-control loss. We wire this kind of stack regularly through our [model API and tooling integration practice](/services/ai-integration/).

## The evaluation framework for generative ai services against a creative brief

Most generative ai services RFPs we see are scored on the wrong axes: model leaderboard rank, demo polish, raw image quality on cherry-picked prompts. Here's the seven-axis framework we'd put on the procurement spreadsheet instead. Each row scores 0 to 3 against a specific brand brief. The matrix below is what we hand a creative steering committee at vendor-shortlist time.
The vendor that wins on this scorecard is usually not the vendor that wins on the marketing pages. Imagen 4 scores high on prompt fidelity, low on brand-control depth. Stable Diffusion 3.5 self-hosted scores high on portability, low on time-to-first-asset. Flux Pro 1.1 scores high across most axes but loses on the open-weight / self-host axis. Pick the pair that closes the gaps for your brief. Our companion piece on [the legal and provenance posture across model vendors](/blog/gen-ai-safety-and-watermarking/) walks the safety axis in more depth. Worth reading before a regulated buyer signs.

## Build vs buy vs assemble: where each delivery model earns its keep

The build-vs-buy conversation on generative ai services used to be binary: license Adobe Firefly and Canva Magic Studio and call it a day, or spin up a research team. In 2026 it's three options, and we use all three across engagements.
Option one. Buy a managed creative-suite platform end-to-front. Adobe Firefly plus Creative Cloud for everything, or Canva Magic Studio for the marketing-ops side. Vendor owns brand kits, asset library, hosting, model billing. Right for marketing teams without engineering capacity, wrong for anything that needs a custom brand checkpoint or volumes past 100K shippable assets a year where per-seat pricing eats the budget. The right pick maybe 25% of the time.
Option two. Build the routing layer yourself on Fal or Replicate or a custom Python adapter, integrate Imagen 4 and Flux Pro 1.1 and Runway Gen-3 directly. Right for creative-ops shops with at least one engineer, right at volumes where managed-platform pricing crosses into running the calls yourself. We pick this in 50% of engagements, almost always when the brand needs to ship across image plus video plus audio in the same campaign cycle.
Option three. Assemble. The call most creative-ops teams underweight. Pair a routed multi-vendor layer for the bulk of the work with a self-hosted Diffusers plus LoRA worker for brand-locked hero work where checkpoint control is the moat. The routed layer handles 80% of briefs where speed matters more than the last 10% of brand fidelity; the self-hosted worker handles the hero work where it doesn't. We recommend it roughly a quarter of the time. The pattern ages best because either half can be swapped without touching the other. The deeper trade-offs live in our engagement note on [our generative AI engineering practice](/services/generative-ai/).

## ROI and TCO modelling: the unit economics most agency decks skip

Procurement decks for generative ai services overwhelmingly anchor on cost-per-render math, and that doesn't survive a CFO review. The right unit is cost-per-approved-asset, modelled against the current cost-per-asset baseline (stock-photography license or photoshoot frame or in-house designer hour). If a stock license is $250 per image and a Flux Pro 1.1 pipeline runs at $0.40 per approved asset fully loaded, the saving per asset is roughly 600x. Multiply by annual volume, subtract build cost, and you've got a payback curve procurement can sign.
The model needs four inputs: shippable assets per month (V), cost-per-asset baseline (Cb), cost-per-approved-asset after generative (Ca), and build plus ongoing cost (B). Monthly saving is V × (Cb − Ca). Payback months = B ÷ monthly-saving. For a 500-asset-per-month e-commerce catalogue at Cb = $250 and Ca = $0.40 with a $50K build, payback lands inside the first month. For a 200-asset-per-month brand-locked campaign at Cb = $1,500 and Ca = $2.50 with an $80K build, payback runs 2 to 3 weeks of ongoing volume. CFOs respect simple models. What they won't respect is "10x faster than a photoshoot"; that doesn't compose into a P&L.
Three line items procurement decks routinely under-budget. First, reviewer time. A pipeline that generates 40 variants per brief still needs a brand reviewer to approve 4 to 8 of them: 5 to 15 minutes of designer time per brief. At agency rates of $80 to $120 an hour, reviewer cost can outweigh inference cost on small-batch work. Second, brand-LoRA refresh cadence. A LoRA trained on the current brand-asset library drifts roughly every 6 to 9 months; budget a quarterly retrain at 4 hours of A100 time plus 2 days of engineer work. Third, model price drift. The line to forecast is volume rather than unit cost. A successful pipeline tends to drive 2 to 4x the volume the spec assumed once creative teams trust the output.
On TCO over a 24-month horizon, reviewer time dominates, not inference. A 500-approved-asset-per-month operation on Flux Pro 1.1 at $0.04 per render and 8 renders per approved asset lands at roughly $160 per month of inference. Rounding error against most creative budgets. The reviewer cost on the same volume runs $4,000 to $7,500 per month. The higher-leverage lever in 2026 is the brand-similarity gate threshold. Every 0.02 you tighten the gate cuts reviewer load by roughly 10 to 15% on typical briefs. The inference invoice isn't where the budget lives anymore.

## The generative ai services guide: an 8-step build checklist for creative ops

Use this generative ai services guide as a creative-ops kickoff checklist. We run a version in every discovery workshop, and the eight steps cover 90% of the decisions that determine whether a pilot ships on time. Step ordering matters. Skipping ahead is the most common failure mode, particularly on the brand-similarity gate (step six), which teams try to defer and almost always regret by week four.
Step one. Pick one deliverable shape with named volume; don't automate an entire brand library in a single pilot. Step two. Set a cost-per-approved-asset ceiling before you pick a vendor; if the ceiling is $0.50, that rules out Sora on long-form video. Step three. Pick the architecture shape (hardest reversal later). Step four. Pick the model pair (primary plus fallback); Flux Pro 1.1 with Imagen 4 fallback is our default image pair. Step five. Wire retrieval against your brand asset library. Step six. Build the brand-similarity gate before you ship: CLIP cosine against a reference embed, threshold tuned on 200 known-good and 200 known-bad renders. Step seven. Ship the review queue (Frame.io for video and image; Notion for copy). Step eight. Instrument cost from day one in Postgres or BigQuery. Without that, the second-campaign budget conversation is unwinnable.

> [!NOTE] (rich block: pullQuote)

We cross-walk kickoffs with our [AI advisory and roadmap practice](/services/ai-consulting/) when the buyer hasn't yet decided whether generative ai services are the right wedge versus a broader LLM-and-RAG investment. If the brief is text-heavy rather than visual-heavy, the conversation usually pivots to a [language-model engineering engagement](/services/llm-development/) before we re-enter the creative-ops scoping. The two practices share the same observability and routing primitives, and the eight-step checklist above applies (with modality substitutions) to both.

## FAQ on generative ai services, in the buyer's vocabulary

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/


---

## SECTION: 8.3. Blog: diffusion-vs-flow-models

_Source: https://www.paiteq.com/blog/diffusion-vs-flow-models/_

# Diffusion model vs flow matching: a 2026 buyer guide

> A 2026 buyer and builder guide to the diffusion model paradigm — flow matching, diffusion model architecture, sampling cost, and what to ship.

**HTML version:** https://www.paiteq.com/blog/diffusion-vs-flow-models/
**Published:** 2026-05-17T14:56:49.542Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~18 min


---

A diffusion model is a generative system that learns to invert a known noising process. You take real data, add Gaussian noise across many timesteps until the signal is destroyed, then train a neural network to predict and remove that noise step by step. Run the network in reverse from pure noise and you get a sample. That mechanic, formalised by Ho and colleagues in 2020 as DDPM, is what now sits inside Stable Diffusion 3 alongside Flux as well as the closed video stacks (Sora, Veo) and Imagen, and powers most production image and video pipelines shipping in 2026.
Two things have changed since the textbook version of the story. First, flow matching, in particular rectified flow as published by Lipman and by Liu in 2023, now beats classical diffusion on sample efficiency and is the default training objective inside Stable Diffusion 3 and Flux. Second, sampling cost stopped being an academic curiosity and became the only deployment number anyone in our engineering conversations actually cares about. NFE per generation, the wall clock on an H100, the batch math behind a Runware or Replicate or Fal invoice. That's the post you're reading: a paired buyer and builder guide to the diffusion model paradigm in 2026, written for the Tech Lead who has to bet a roadmap on it.

## Diffusion model, in one minute

Working definition: a diffusion model learns the gradient of the log data density (the score) by training a neural network to denoise samples that have been progressively corrupted with Gaussian noise. You can read the same object as a hierarchy of denoising autoencoders or as a score-based SDE or as a variational latent-variable model. The three framings are mathematically equivalent. Song and Ermon proved the score-based view; Ho gave us DDPM; the SDE picture from Song et al. unified them.
What changed since 2023 is the training objective. Classical diffusion regresses against the noise residual at every timestep. Flow matching, the cousin paradigm that now ships inside Stable Diffusion 3 and Flux, regresses against a velocity field on a continuous-time trajectory between noise and data. The sampler at inference time is an ODE solver instead of an SDE walker. Both produce identical-looking images from identical-looking backbones (DiT or MM-DiT or UNet). The training math is what makes them cheaper to sample.
We'll use *diffusion model* in this post as the umbrella term that covers both classical denoising diffusion and flow matching, because that's how teams shipping with HuggingFace Diffusers and PyTorch talk about it in practice. Where the two paradigms diverge enough to matter, we'll say so explicitly.

## How a diffusion model actually works (forward and reverse, no math wall)

The forward process is a Markov chain that adds Gaussian noise over a fixed schedule of timesteps. Ho et al. ran 1000 timesteps with a linear beta schedule from 1e-4 to 0.02; Nichol and Dhariwal improved this with a cosine schedule a year later. By the final timestep, the signal is indistinguishable from pure noise. Crucially, you don't actually run the forward chain at training time. The reparameterisation trick gives you a closed-form sample of any timestep from x_0 directly, which is why DDPM training fits on a single H100 for small models.
The reverse process is where the neural network earns its keep. You train a denoiser, conventionally written epsilon-theta, to predict the noise that was added at timestep t given the noisy sample x_t. At inference, you start from pure Gaussian noise and call the denoiser once per sampler step, subtract the predicted noise (with a small Langevin-style perturbation if you're doing SDE sampling), and step backwards. DDIM, the deterministic shortcut from Song and Ermon, lets you skip timesteps cleanly and brought the canonical sampler count down to about 50 NFE. Newer ODE solvers like DPM-Solver and UniPC together with the Heun integrator that ships with EDM2 push that lower without retraining.
Two intuitions are worth burning into memory. One: every sampler step is one forward pass of the same network, so wall-clock cost scales linearly with NFE. A 50-step DDIM run is 50 network calls per image. Two: the conditioning signal (a text embedding from a T5 or CLIP encoder, a class label, a ControlNet hint) is concatenated or cross-attended at every step. Classifier-free guidance, the trick that gives you sharper conditional samples, doubles your NFE because you run the network conditionally and unconditionally at each step.

## Flow matching: the alternative that ate half the field

> [!NOTE] (rich block: callout)

Conditional flow matching, the practical recipe published in 2023 by Lipman and by Tong and by Liu, gives you a regression target that is a marginal velocity. The training loop is genuinely simpler than DDPM. Sample a data point, sample a noise point, sample a time t in [0,1], linearly interpolate, regress the velocity. That is the whole objective. Stable Diffusion 3 from Stability AI uses this recipe; so does Flux from Black Forest Labs. Both teams reported better sample efficiency than their diffusion siblings at fixed compute, which is why the rectified-flow camp now has the momentum.
Why does the straight-line ODE matter? Because solver error compounds along curved trajectories. A DDPM reverse process curls through latent space; a rectified-flow ODE goes from A to B in something close to a line. Each Euler step covers more useful distance, so you need fewer of them to land inside the data manifold. Tong and colleagues showed that even a 2-step rectified-flow sampler beats a 10-step DDIM run on FID for small image benchmarks. The diffusion paradigm isn't dead, but for new image and video projects starting in 2026 the burden of proof is on the team choosing classical DDPM over a flow-matching variant in HuggingFace Diffusers or in a JAX-based research stack.

## Diffusion model architecture choices that matter in 2026

The backbone of every shipping diffusion model is either a UNet, a DiT (diffusion transformer, Peebles and Xie 2023), or an MM-DiT (multimodal DiT, introduced with Stable Diffusion 3). UNets are the legacy choice and still dominate small-to-medium image models because of their inductive bias toward locality. DiTs scale better with parameters and compute, which is why every model above the 2B-parameter line (Sora and Veo on the closed side, Imagen 3 and SD3 on the published side), has moved to a transformer backbone. MM-DiT adds separate parameter streams for the text and image tokens that meet inside the attention blocks, which lets the model spend capacity on text understanding without polluting visual features.
The other axis is latent vs pixel. Latent diffusion, the Rombach et al. 2022 idea that gave us Stable Diffusion, runs the denoiser inside a VAE-compressed 64x64 or 128x128 latent grid instead of the raw 1024x1024 pixel canvas. That cuts compute by roughly an order of magnitude per step and is the only reason real-time image generation on a single H100 is feasible. Pixel-space models like Imagen still produce sharper fine detail in some categories, but the production economics are brutal: the larger the canvas, the deeper the latent compression you need to keep serving costs predictable. EDM2 from Karras and colleagues at NVIDIA is the reference for how to train a clean pixel-space diffusion model when you do need it.
Conditioning paths are the third decision. Cross-attention from a T5 or CLIP text encoder is the standard for text-to-image. ControlNet plus T2I-Adapter plus IP-Adapter give you extra conditioning channels for pose, depth, edges, or reference images, with no retraining of the base model. For brand and product use cases we usually pair a base diffusion model with a small LoRA, which fits inside a few hundred megabytes and trains in an afternoon. The longer story on that approach lives in our piece on [fine-tuning a model on your brand visuals](/blog/brand-controlled-generation-with-lora/). The shape of the conditioning is also where most production *diffusion model architecture* decisions actually land, because it determines whether you can keep one base checkpoint and ship many product variants on top of it.

## Named diffusion model examples: what is actually shipping

The roster below is the short list we walk a client through when they ask which checkpoint to anchor a 2026 generation system on. None of these are research artefacts; all of them are available either as open weights on HuggingFace or as a hosted API on Replicate, Fal, Runware, or Stability's own platform. The headline pattern across the diffusion model alternatives we walk through below: open-weight rectified-flow models (SD3, Flux) now match or beat the closed image stacks for most product use cases, and the closed lead has moved upmarket into video (Sora, Veo) and very long-form audio.
For most teams a useful framing is: pick an open-weight base from the SD3 or Flux family if you want control, hostability, and fine-tunability; pick Sora or Veo through the official APIs if you need long-form video and can accept closed weights; pick Imagen 3 through Vertex AI if you're deep in the Google Cloud stack. Runware (and Replicate or Fal) will serve any of the open-weight options at predictable per-second pricing, which is the cheapest path to a production-grade *diffusion model examples* catalogue without the headcount to run your own GPU fleet.
On hosted pricing, the three commodity players we benchmark against each other every quarter are Replicate, Fal, and Runware. In 2026 the per-second rates for an SD3 Medium or Flux Dev generation on an H100-class GPU land in roughly the same band across all three: somewhere between $0.001 and $0.003 per second of GPU time, depending on the cold-start posture and whether you're paying for a warm pool. Replicate tends to sit at the higher end on convenience and breadth of model catalogue; Fal aggressively prices the few-step distilled variants (Flux Schnell, SD3-Turbo) toward the lower end on dedicated H100s; Runware quotes a per-second number that is usually the cheapest at sustained throughput but assumes you can keep the queue warm. A 1024x1024 Flux Dev generation at 10 NFE is roughly a 1-to-3 second wall clock on an H100, which puts a single image somewhere in the $0.001 to $0.01 range at list. Always check the vendor's current docs before committing — these numbers move every quarter as new distilled checkpoints land — but the order-of-magnitude framing is stable enough to size a P&L.
On open-vs-closed for enterprise procurement, the trade-off is no longer about quality (SD3 Medium and Flux Dev are competitive with Imagen 3 on most product imagery) but about licensing, redistribution, and fine-tuning rights. Three questions we make procurement teams ask before they sign. One, what does the model licence say about commercial use, derivative works, and re-distribution of LoRA fine-tunes? Stability's licence on SD3 has been revised twice; Black Forest Labs ships Flux Dev under a non-commercial licence with a separate Pro tier for paid commercial use. Two, what dataset provenance documentation does the vendor publish, and does it satisfy your jurisdiction's training-data disclosure rules (the EU AI Act draft is the strictest example)? Closed weights from Sora and Veo give you almost nothing on this axis; open weights at least let your legal team see the model card. Three, can you fine-tune on customer data without sending that data to the vendor's API? Open weights make this trivial; the closed stacks force a managed fine-tune flow where your data crosses the vendor's tenancy boundary, which is a non-starter for regulated industries. The procurement spreadsheet for a 2026 generative AI build looks more like an MSA review than a model evaluation.

## Diffusion model vs flow matching: a decision matrix by modality

Below is our diffusion model comparison by modality, scored the way we score it in client conversations. The decision is rarely abstract. A buyer arrives with a modality (image, video, audio, 3D or molecular), a latency budget, and a quality bar. The honest framing is that classical diffusion and flow matching are both viable across every modality, but the fit is uneven. Below is how we score the four common axes in client conversations.
Two notes on this matrix. First, audio is the one place we still default to classical diffusion in 2026. Stable Audio Open, AudioLDM 2, and Meta's MAGNeT line all use DDPM-style training, and the few flow-matching audio papers have not produced a checkpoint with comparable controllability. Second, in molecular and 3D shape generation, SE(3) flow matching has taken over almost completely; the Boltz-1 protein-structure model and the follow-on docking pipelines are flow-matching under the hood. If your problem looks more like protein design than like image generation, don't start from a DDPM textbook.
The other axis the matrix doesn't show is latency budget, which is what actually decides the sampler-and-checkpoint pair you ship. We sort engagements into three tiers and pick the stack from the tier, not from the modality.
Tier one, sub-100ms interactive. The use case here is typing-while-they-watch: a designer is dragging a slider or refining a prompt and expects the image to update on every keystroke. There is exactly one sampler family that lands inside this budget on a single H100 today, and it is the few-step distilled variants. LCM (Latent Consistency Models) or SD3-Turbo at 4 NFE on a 512x512 latent canvas with the VAE decoder cached across the batch is the recipe. Flux Schnell is the closest open-weight equivalent. You give up a small but visible amount of detail at this tier — distilled samplers smear high-frequency texture more than their teacher — but you gain the interactive UX that lets the product feel like a creative tool instead of a render farm.
Tier two, sub-1s chat-style. The use case is a chat tool where the user sends a prompt and waits for a single high-quality image inside the response. Heun ODE sampler on Flux Dev at 10 NFE, fp16 weights, 1024x1024 latent canvas, classifier-free guidance off in the latency path. That stack lands at roughly 1 second per image on an H100 and is what we recommend by default for almost every chat-grade product surface. The quality is within touching distance of the 50-step teacher and the cost per image is predictable enough to put inside a per-message unit economics model.
Tier three, batch async. The use case is a background renderer behind a CMS, a print-quality marketing asset pipeline, or a video frame batch. Latency budget is measured in tens of seconds per generation and quality is the only number anyone cares about. SD3 Medium or Flux Dev at 25 NFE with a DPM-Solver++ second-order sampler, guidance on at 7.5, full 1024x1024 latent, optionally followed by an SDXL-based refiner pass. We turn classifier-free guidance back on and stop trying to skip steps. This is the tier where the quality difference between rectified flow and classical diffusion almost disappears in human eval, and the right choice usually comes down to which checkpoint your team already has a LoRA stack on top of.

## Sampling cost is the whole game

Training compute is a one-time write-off that an enterprise buyer doesn't pay if they're starting from an open-weight checkpoint. Inference compute is in their P&L every month. The number that matters is NFE per generation multiplied by network FLOPs per pass, multiplied by batch density on the target GPU. NFE is the lever you can pull without retraining; everything downstream of the architecture choice flows from it. Across our engineering reviews, this is the single calculation product teams get wrong most often, because the academic literature reports FID-at-best-NFE numbers and quietly assumes you can afford 50 to 250 NFE per sample.
Two practical implications. One, distillation is the highest-leverage optimisation a serving team has after picking a backbone. Latent Consistency Models, ADD (SD3-Turbo), and progressive distillation can collapse a 50-step DDIM run into a 2 to 4 step generator with a small quality hit and a much larger cost win. Two, classifier-free guidance doubles your effective NFE because you run the network twice per step. For low-latency UX flows we routinely turn guidance off and rely on a stronger conditioning encoder or a LoRA, which keeps the wall clock predictable on a single H100 with fp16 weights.
The pragmatic takeaway: in a typical 2026 engagement we ship with a flow-matching backbone (SD3 Medium or Flux Dev), a Heun ODE sampler at around 10 NFE, fp16 weights on an H100, and guidance turned off in the latency-critical path. That stack gives a 1024x1024 image generation in roughly the same wall clock as a single small-LLM token decode. The closed video stacks (Sora, Veo) will be more expensive per generation by an order of magnitude, but the same NFE arithmetic applies inside their hosted pricing. For a deeper view on safety and provenance once that pipeline is live, see our piece on [provenance and safety controls for generated media](/blog/gen-ai-safety-and-watermarking/). For a build-vs-buy frame on the surrounding services, see [how teams typically evaluate a generative AI engagement](/blog/generative-ai-services-buyers-guide/).

## Diffusion model implementation: a working stack

A pragmatic *diffusion model implementation* for a product team in 2026 leans on three building blocks. PyTorch as the framework. HuggingFace Diffusers as the pipeline layer (schedulers, samplers, conditioning glue, ControlNet adapters, a clean abstraction for safety checkers). The Karras EDM2 repo as the reference codebase if you are training from scratch, and a research stack like JAX or Flax if you want to follow the flow-matching literature where Google DeepMind and Stability researchers publish first.
Three implementation traps we see repeatedly. First, mixed precision is not free: fp16 weights with fp32 attention accumulation is the safest default; pure bf16 saves more memory but bites you on long video sequences. Second, the VAE decoder is often the dominant latency stage at small NFE counts; cache it across a batch when you can. Third, classifier-free guidance changes effective batch size and should be folded into your throughput math before you size the cluster. A serving team that misses this routinely under-provisions GPUs by half.
If you are training rather than serving, the production playbook today is closer to the LLM playbook than it was three years ago. FSDP or DeepSpeed for sharding, ZeRO-2 for the optimiser state, gradient checkpointing on every attention block, and a careful EMA schedule for the weights you actually deploy. The [training-stack engineering](/services/machine-learning-development/) required to run EDM2 at scale looks identical to the work our ML platform team does on classical regression and ranking models, just with bigger optimisers and longer training horizons.
Observability and evaluation is where most teams ship blind, and it is the cheapest gap to close. Three metrics belong on every diffusion serving dashboard from day one. FID (Frechet Inception Distance) is the canonical automated quality number, computed against a frozen reference set; you want it logged per-checkpoint and per-LoRA so you can catch regressions when someone retrains. CLIP-score is the prompt-adherence number: how well does the generated image match the input caption in CLIP embedding space? It is a lossy proxy but it catches the kind of prompt drift that FID misses. And human eval pipelines, run on a sampling cadence with internal raters or a vendor like Scale or Surge, are the only way to catch the subjective failure modes (hands, faces, text rendering) that neither automated metric flags. For experiment tracking we lean on Weights & Biases or Comet for any team running their own training; for hosted-only deployments we usually build a thin internal eval grid that posts FID and CLIP-score per generation back to a Postgres table and a small dashboard. The paiteq engineering team treats this dashboard as a non-negotiable part of the serving stack, not a research nice-to-have.
Safety classifier integration is the other piece that teams routinely punt to a later sprint and then scramble to retrofit. Two layers we now ship by default. First, a content classifier on the output path before the image leaves the serving boundary; the open-source choice is the NSFW filter that ships inside HuggingFace Diffusers (StableDiffusionSafetyChecker), and the production-grade choice is a small custom classifier trained on your own taxonomy plus a vendor like Hive or Sightengine for the long tail. Second, a provenance step: every image we emit gets a watermark injected at decode time so downstream consumers can verify origin. Google's SynthID is the strongest invisible watermark available on hosted Imagen, and the C2PA (Coalition for Content Provenance and Authenticity) standard is the cross-vendor metadata spec that bakes signed origin claims into the image header. We typically combine both — SynthID for robustness against re-compression, C2PA for legal-grade provenance metadata that survives an editorial workflow. Bolting these on after launch is two-to-three sprints of work; bolting them on at the start is half a sprint.

## When to pick which: our default best diffusion model recommendation

Three opinionated calls we'll defend in any client review. We frame these as the *best diffusion model* choices for a 2026 build, in descending order of leverage.
Call one. Default new image and video projects to a flow-matching variant, not classical diffusion. Stable Diffusion 3 and Flux ship with rectified-flow training and beat their DDPM-era siblings on both FID and sample efficiency at fixed compute. The same is becoming true for video where the next round of Sora and Veo successors are quietly migrating to flow-matching objectives. If a team picks classical DDPM in 2026, the question we ask is what specific evidence overrides the default. Usually there is none.
Call two. Sampling cost is the only deployment number that matters. We haven't seen a generative-AI product economics conversation in the past 18 months where training compute was the bottleneck. NFE per generation, decoder cache hit rate, fp16 vs bf16, and batch density on the H100 cluster determine whether the unit economics work. If a vendor can't answer these in their first deck, the production cost will surprise everyone.
Call three, the one most teams underweight. Most teams should buy, not build. Unless you're training a foundation model with the kind of compute budget that puts you on a public capabilities chart, or fine-tuning on a proprietary dataset that genuinely can't leave your VPC, the right move is to call a managed API (Replicate, Fal, Runware, Stability's hosted endpoints) and spend the saved months on the application layer. We follow the same logic on LLM workloads; see our note on [model selection patterns we use for LLM workloads](/services/llm-development/) for the parallel framing.

## Diffusion model guide: a 6-step build checklist

Use this short *diffusion model guide* as a sanity check before you spend the first sprint on a generative-AI build. We run a version of it inside every kickoff workshop we do.
Step by step. One, name the modality and the quality bar; a 720p hero image and a 4K storyboard frame have different solutions. Two, set the latency and GPU budget per generation before you pick a model; this kills 80 percent of unrealistic specs. Three, pick a backbone family (MM-DiT for new builds, UNet only if you inherit it). Four, pick the training paradigm (flow matching for image and video, classical diffusion for audio, flow matching for 3D and molecules). Five, name the conditioning stack: text encoder, ControlNet adapters, a LoRA for brand. Six, decide whether you're running your own GPUs or buying into a [hosted engagement that covers serving and on-call](/services/generative-ai/). The serving decision is often where the project economics actually live.
One closing note on the checklist. Steps four and five almost always change once the team has a first working pipeline; the modality and budget rarely do. If you set those two correctly at kickoff and treat the rest as revisable, the project recovers from almost any wrong call later. If you guess at the modality or skip the latency budget, no amount of architecture work fixes the result. It won't. For the broader picture on where short-form and long-form video sit in 2026, our companion piece on [where video generation stands today](/blog/video-generation-state-2026/) walks the same logic for the video-specific stack.

## Frequently asked questions about the diffusion model paradigm

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/


---

## SECTION: 8.4. Blog: customer-service-chatbot-buyers-guide

_Source: https://www.paiteq.com/blog/customer-service-chatbot-buyers-guide/_

# Customer service chatbot: a 2026 buyer's guide

> A 2026 buyer's guide to customer service chatbots — RAG over your docs, eval gates on deflection, and what the LLM tier actually costs in production.

**HTML version:** https://www.paiteq.com/blog/customer-service-chatbot-buyers-guide/
**Published:** 2026-05-17T16:21:57.715Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~13 min


---

A customer service chatbot in 2026 isn't the intent-classifier widget you bought in 2019. It's a retrieval-augmented agent, plumbed into Zendesk or Intercom or Salesforce Service Cloud, that reads your knowledge base, drafts a reply, and either ships it to the customer or hands the conversation to a human agent with the right context attached. The vendor brochures still call it a customer service chatbot — but the architecture beneath has been rebuilt twice in three years, and the buying decision in 2026 is a different decision from the one most support teams made the last time they shopped.
Over the past 18 months we've reviewed customer service chatbot briefs across mid-market SaaS, regulated fintech, and high-volume e-commerce — this is the buyer's guide that ended up most useful in those conversations. Below is the doc we wish existed when those briefs landed: a vendor matrix without the affiliate bias, a working RAG architecture you can copy, the cost-per-resolved-contact math your CFO will ask for, and a 90-day rollout checklist that survives procurement. If you're scoping a customer service chatbot for next quarter, this is the doc to read before the demos start.

## Customer service chatbot in one paragraph, and what changed in 2026

A customer service chatbot is the automated layer that reads an inbound support message, decides whether it can answer or must escalate, and either ships a reply or hands the conversation to a human agent. That's the same definition you'd have written in 2019. What's changed is the engine underneath. The 2019 chatbot used intent classification on a fixed taxonomy — you told it about thirty topics, it routed accordingly, and anything off-script failed silently. The 2026 customer service chatbot uses a language model (Claude Sonnet 4.6, GPT-5 mini, Gemini 3.0 Flash) plus retrieval over your knowledge base, which means it doesn't need the taxonomy at all. It reads the customer's actual question, fetches the three most relevant help-center articles, and drafts an answer grounded in your own content. That single architectural shift is what makes the buying decision genuinely different this time around.
There's a second shift that's easier to miss. Vendors who built their customer service chatbot stacks before 2023 (Zendesk Answer Bot, Salesforce Einstein Bots, IBM Watsonx Assistant) have spent two years bolting LLMs onto intent engines. Vendors who started after 2023 (Intercom Fin, Ada's newer product, Decagon, Forethought) built LLM-first from day one. They're not the same product anymore. The first group still wins on deep ticketing integration; the second group wins on out-of-the-box answer quality. Knowing which side a vendor sits on is the single biggest signal you'll use during a customer service chatbot RFP this year.

> [!NOTE] (rich block: pullQuote)

## What a customer service chatbot actually does today, in the buyer's language

Support leaders don't buy a customer service chatbot to win an AI-strategy headline. They buy it to move four numbers: deflection rate (the share of inbound contacts the bot closes without a human), average handle time (AHT, the minutes a human agent spends per ticket), escalation quality (whether the bot's handoff leaves the agent better or worse off than a cold ticket), and customer satisfaction (CSAT or its sibling, a transactional NPS). Every customer service chatbot demo you'll sit through this year wants to talk about deflection. Most won't talk about escalation quality at all, which is exactly where most rollouts quietly fail. There's a fifth lever the strongest rollouts pull on as well: [ticket-routing automation around the bot](/services/ai-workflow-automation/) — the rules and workflows that decide which intents the bot is allowed to touch, which queue receives an escalation, and which tickets bypass the bot entirely because the customer signal (VIP tier, sentiment, regulated topic) makes a bot reply the wrong move. Treat that routing layer as part of the chatbot brief, not a side-project for ops, and the bot's deflection number tends to land higher and cleaner.
Here's the inversion that matters. A customer service chatbot that deflects 40% of tickets but hands the other 60% to agents with a confused transcript actually raises your fully loaded cost-per-resolved-contact, because agents now repair the bot's mess on top of solving the original issue. It's the pattern we flag on every customer service chatbot brief: deflection looks great on the dashboard, AHT on escalated tickets climbs by 20–30%, CSAT on those tickets craters, and the support director gets called into a QBR she didn't expect. Work the math on a typical mid-market shape — 40k tickets a month, $10 fully-loaded human cost, $0.15 bot cost. A bot that deflects 40% cleanly trims monthly support cost from $400k to roughly $246k. The same bot deflecting 40% but adding 25% to escalated AHT pushes the residual 60% cost to about $300k, and now you've saved closer to $54k against the gross-deflection slide that claimed $160k. The right scoreboard for a customer service chatbot is cost-per-resolved-contact at a 4+ CSAT threshold — not deflection alone, and not deflection net of nothing.

> [!NOTE] (rich block: callout)

## Customer service chatbot architecture: the three reference shapes we ship

When a brief lands, we sketch one of three customer service chatbot architectures on the call before we quote. Picking the right shape early saves a quarter of re-architecture later. We'll name them by their control-plane style.
Shape one, the SaaS bolt-on, turns on a customer service chatbot module inside the helpdesk you already pay for. Intercom Fin and Zendesk Answer Bot are the canonical examples; Salesforce Einstein Bots and Freshdesk's Freddy AI sit in the same bucket. Setup is hours, not weeks. The vendor hosts the model, the retrieval, the orchestration. You pay per-resolution or per-seat. It's the right pick for English-only, low-to-medium volume, knowledge-base-rich orgs where the helpdesk is already entrenched.
Shape two, hybrid RAG, keeps your helpdesk as the front door (Zendesk, Intercom, Salesforce Service Cloud) but routes the model call out to your own [retrieval architecture for support content](/services/rag-development/). The helpdesk handles ticket lifecycle and routing; a small service you control runs the retrieval against Pinecone or pgvector and calls Claude Sonnet 4.6 or GPT-5 mini for generation. This is what we ship for teams that have a multilingual KB, a non-standard data source (a legacy product catalog, a regulated compliance corpus), or a privacy requirement that won't allow the SaaS vendor to host the model. It's also what teams migrate to when the SaaS bolt-on hits a ceiling on answer quality.
Shape three, build your own, replaces the helpdesk's bot layer entirely with a LangGraph or Anthropic Agent SDK orchestration over your own infrastructure. The helpdesk is reduced to a system-of-record. We don't recommend this shape often; it's reserved for teams with a hard reason to own the whole stack (deep voice integration via LiveKit or Twilio, a regulated audit requirement, or a multi-product router where the bot needs to traverse five backend systems mid-conversation). Engineering cost is real, but so is the lock-in escape.

## Customer service chatbot examples by industry, and why the shape changes

The reference architecture above isn't industry-blind. The customer service chatbot examples we end up shipping look different by vertical because the data, the regulation, and the channel mix change everything. Three forces do most of the work. First, the knowledge base. A SaaS team owns a help center; a fintech team owns a compliance corpus that's been red-pen-reviewed by legal; a healthcare team owns clinical FAQs that can't be paraphrased loosely. The bot needs different grounding rules in each case, which means a different retrieval layer and a different generation prompt. Second, the channel mix. Chat-first teams can ride a SaaS bolt-on; voice-first or WhatsApp-heavy teams almost always end up in Shape 3 because no SaaS bolt-on covers the channel stack from voice to chat to ticket. Third, the regulation. HIPAA, PCI-DSS, SOC 2 Type II, GDPR Article 22 — each one constrains where the model runs, what it logs, and how the audit trail is preserved. A chatbot that's compliant for SaaS isn't automatically compliant for fintech, and the vendor's marketing site is the worst place to confirm that. Here's what we typically see across the five industries we get the most briefs from.
Two patterns hold across all five. First, the knowledge base shape, not the volume, determines the architecture. A well-structured help center with 200 articles is easier to ship than a sprawling 2,000-article archive with no metadata. Second, the channel mix dictates the build. A chat-only deployment lands inside Shape 1 or Shape 2 most of the time; the moment voice, SMS, or WhatsApp join the mix, you're in Shape 3 territory or you're stitching together two SaaS vendors. Pick the architecture for the channels you'll have in 18 months, not the channels you have today.

## The customer service chatbot vendor landscape, vendor-by-vendor

Every vendor we name below is one we've either specified into a brief, shortlisted, or ruled out in a 2025-2026 buying cycle. We're not affiliated with any of them. Prices we quote are list-price bands, not deal terms — and they shift quarterly, so always pull current pricing during the RFP. The single most useful filter we apply before reading any vendor deck is the engine-generation split. LLM-native vendors (the ones that built post-2023 on a language model from day one — Intercom Fin, Ada's Reasoning Engine, Forethought, Decagon, plus the LLM-first mode of Voiceflow) treat retrieval and generation as the primary control surface; intent-engine vendors (Zendesk's older Answer Bot lineage, Dialogflow CX, Rasa) treat the LLM as a generation layer bolted onto a taxonomy that still drives routing. On surface-level demos the two look nearly identical, because the LLM smooths over the seams. On a real RFP they pull apart fast. LLM-native vendors tend to win on out-of-the-box long-tail answer quality and degrade gracefully on questions the bot hasn't seen; intent-engine vendors tend to win on ticketing depth, macro integration, and procurement comfort, and they degrade more sharply when the customer asks something the taxonomy doesn't cover. Knowing which side a vendor sits on before the bake-off saves a fortnight of confused eval results.
The split that predicts the rest of the bake-off: LLM-native vendors (Intercom Fin, Ada, Forethought, Decagon, Voiceflow) tend to beat intent-engine-with-LLM-bolt-on vendors (Zendesk Answer Bot, Dialogflow CX, Rasa) on long-tail answer quality, and the gap shows up most clearly on questions the bot hasn't seen before. The older vendors win on ticketing depth and on procurement comfort. Choose for the gap you can't close yourself — answer quality is harder to backfill than ticketing integration.

## Customer service chatbot implementation: a working RAG pipeline in code

Here's the smallest customer service chatbot implementation that we'd actually put in front of a customer. It's a hybrid-RAG (Shape 2): a Python service that listens to a Zendesk webhook, retrieves help-center articles from Pinecone, drafts a reply with Claude Sonnet 4.6, scores its own confidence, and either ships the reply or escalates with a transcript. About 200 lines, deployable in a week, and structurally close to what we ship into production.
Four engineering details earn their keep in this shape. The system prompt forces strict grounding, so the bot can't answer outside the passages, which is what keeps hallucinations off your CSAT scorecard. The model emits structured JSON, which means the deploy/escalate branch is a switch statement and not a regex over free text. The escalation note includes the retrieved passages, so the human agent isn't starting from zero. And the bot tags every resolved ticket with bot-resolved, which is how you'll measure deflection cleanly six weeks later.
Three things we haven't shown that you'll need in week three: a confidence threshold (a calibrated score from a second classifier call to gate auto-shipping versus draft-for-agent), an eval harness (RAGAS or Langfuse with offline ticket replay), and an evaluation cadence (a weekly review of escalated tickets where the bot was wrong, fed back into the KB). Without those three, the bot drifts inside six weeks and your CSAT walks.

## The evaluation framework: how we score a customer service chatbot against a real support brief

We score a customer service chatbot vendor or build along six axes during a bake-off. It's an opinionated list. There are more axes you could add, but these six surface the differences that matter inside a quarter of running the bot in production.
Two pitfalls on eval. First, don't score on synthetic questions a vendor PM wrote; replay 200 of your own historical tickets and grade those, because synthetic data hides the long-tail failure modes that wreck CSAT. Second, score escalation quality with the actual support agents who'll receive the handoffs, not with the procurement team. Agents will flag transcript problems that nobody else can see — the missing context, the wrong tag, the apology the bot tried to write.

## Buy a SaaS chatbot, build your own, or assemble: where each option earns its keep

Every customer service chatbot brief lands on the same three-way decision. Buy a SaaS chatbot (Shape 1). Assemble a hybrid stack (Shape 2). Or build your own (Shape 3). We've watched teams pick the wrong shape and rebuild within twelve months — most of those teams over-built on the first try. Here's where each shape earns its fee.

> [!NOTE] (rich block: callout)

## The economics most vendor decks skip: cost-per-resolved-contact, deflection, and the deflection-quality tax

Vendor decks report deflection rate because it's the number that looks best on a slide. The number procurement actually approves is fully-loaded cost-per-resolved-contact at a CSAT floor. Here's the math we walk every brief through. Take your fully-loaded human-agent cost-per-contact (typical mid-market North America band, $8-$12 including overhead). Multiply by your monthly volume. That's your support cost base. A customer service chatbot that deflects 30% of contacts at $0.15 each cuts the base by roughly 27% net of bot cost — but only if the escalated 70% don't get worse. If escalated AHT climbs 20%, the deflected savings shrink to about 15%. If CSAT on escalations drops below your threshold, you've spent money to make the support experience worse.
Two of those numbers deserve a second look. SaaS chatbots on a per-resolution model (Intercom Fin's pricing shape is the canonical example) typically land $0.70-$1.50 per resolved contact at list price. That's a fine deal at low volume; past about 30k resolved contacts a month, the same workload on a hybrid RAG stack costs less than a third. The break-even between SaaS and hybrid sits around the 25-35k monthly resolutions mark for most teams we've modelled — that's the threshold to push back on if a vendor's pricing scales linearly with volume.
And don't forget the hidden cost most decks skip: the KB-maintenance work that a customer service chatbot generates. A good bot exposes content gaps every week — questions the bot couldn't ground, articles it answered wrong from, articles that contradict each other. Closing those gaps takes content-ops time, typically 0.25-0.5 of a content writer in year one. We bake that into the TCO model on every brief, because the teams that don't end up with a bot that drifts inside six months.

## Best customer service chatbot picks for three real buyer profiles

There isn't one best customer service chatbot. There's a best pick for each buyer profile. Below are three profiles we see most often, and the shortlist we'd ship into each one. We're naming product names, not retainers, so this list dates inside a year — pull it up against current pricing during the actual RFP.
Notice what's not on any of these lists: a single vendor that wins everywhere. The vendors that pitch themselves as universal customer service chatbot platform are the ones we ask the hardest questions of during a bake-off — usually they've got two strong axes and a weak third one, and the weak axis is the one your team will hit by month six.

## The customer service chatbot guide: a 7-step rollout checklist for the first 90 days

If you've signed a vendor or kicked off a build, here's the customer service chatbot guide we hand to clients for the first 90 days. It's seven steps, not thirty. The point isn't to be comprehensive — it's to hit the four things that decide whether the rollout earns its quarter.

> [!NOTE] (rich block: callout)

## Frequently asked questions about customer service chatbot rollouts

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/


---

## SECTION: 8.5. Blog: ai-automation-solutions-buyers-guide

_Source: https://www.paiteq.com/blog/ai-automation-solutions-buyers-guide/_

# AI automation solutions: a 2026 buyer's guide

> A 2026 buyer's guide to AI automation solutions — what runs LLM-in-the-loop on n8n, Make and Temporal, where the cost lives, and how to ship eval-gated.

**HTML version:** https://www.paiteq.com/blog/ai-automation-solutions-buyers-guide/
**Published:** 2026-05-17T14:56:30.918Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~17 min


---

AI automation solutions are the layer of software that takes a business process (invoice triage, lead scoring, IT ticket routing, contract intake, support deflection) and runs it through a stitched stack of an orchestrator (Temporal or n8n or Make or AWS Step Functions), a model call (Claude or GPT-4 or Gemini through OpenAI's API or Anthropic's), a system of record (Salesforce or HubSpot or Postgres or Snowflake), and a notification surface (Slack, email, a ticketing tool). What's changed in 2026 versus the RPA era is that the orchestrator is now LLM-aware by default. n8n ships a native OpenAI node, Make wires Claude in three clicks, Tines treats prompt steps as first-class, and Workato's Workbot composes Anthropic calls inline. That's the structural shift this guide is written around.
This is a 2026 buyer's guide for the VP Ops, Director of Automation, and COO who's evaluating ai automation solutions for a budget cycle that needs to ship a working pipeline this quarter rather than a roadmap presentation. We'll skip the vendor-marketing definitions and go straight to the architecture shapes, the vendor matrix, a working code stack, the ROI math we actually walk procurement through, and the seven-step checklist we run inside every kickoff. The voice is engineering, the bias is toward cost-per-completed-task, and the named tools are the ones we've shipped on, not a sponsored top-10.

## AI automation solutions in one paragraph, and why the 2026 stack looks different

Working definition. An AI automation solution is a workflow (usually multi-step, usually crossing 3 to 8 systems) where at least one step is an LLM call and the orchestration is durable enough to survive a retry, a partial failure, or a human-in-the-loop pause. That definition rules out a Zapier zap that fires a Slack message on a webhook (no model call, not really automation in 2026 terms), and it rules out a one-shot Claude prompt run by hand (no orchestration). It rules in n8n flows that call GPT-4 to classify an inbound email, Temporal workflows that retry a failed Pinecone query with exponential backoff, and Tines stories where Anthropic's API decides whether a security alert escalates.
Three things shifted between the 2022 RPA wave and the 2026 ai automation solutions market that matter for buyers. First, the cost of an inference token collapsed. GPT-4 mini sits around $0.15 per million input tokens, Claude Haiku 3.5 around $0.80, Gemini Flash even lower. A workflow that needed a $0.50 BPO seat in 2022 can run at $0.03 to $0.06 per execution in 2026. Second, the orchestrator layer matured. Temporal v1.22 ships durable execution guarantees the old Zapier model never had, and n8n's self-hosted edition gives you the same primitives without the per-task billing. Third, LLM-in-the-loop changed the failure profile. Classical RPA bots from UiPath or Power Automate break on a DOM change; an LLM step recovers because the model re-reads the page from semantics. That's the headline reason new automation budgets aren't going to the legacy RPA incumbents anymore.
We'll use *ai automation solutions* through this guide as the umbrella term for any production stack that combines durable orchestration with at least one LLM call, served either through a managed orchestrator (Make or Workato or Tines) or a self-hosted control plane (n8n or Temporal or AWS Step Functions). Where the difference between a deterministic workflow and an LLM-driven one matters for procurement, we'll say so explicitly. Our sibling note on [when intelligent process automation outperforms classical RPA](/blog/ai-vs-rpa-when-to-use-which/) walks the deeper category comparison if you need it.

## What counts as an AI automation solution, and what doesn't

The category boundary is where most buyer conversations go sideways. A vendor will pitch you a Zapier-style integration platform as an AI automation solutions platform; a UiPath rep will pitch RPA-with-an-LLM-bolted-on as the same thing; a Workato seller will pitch a recipe library as the whole answer. They're all partially right, but the procurement decision needs sharper lines. Here's how we draw them inside a kickoff.
Two implications from the matrix. One, an AI automation solutions platform must run an orchestrator under the hood — durable plus retry-aware plus idempotent at the step level. n8n, Make, Temporal, Tines, Workato, AWS Step Functions, and Google Cloud Workflows all qualify; Zapier's classic editor doesn't (Zapier's newer Tables + AI Actions product is moving in that direction, but the orchestration model is still flat). Two, the LLM step has to do meaningful work. A Claude call that rewrites a single field's phrasing isn't automation; a GPT-4 call that classifies an unstructured 800-token document into one of 12 ticket categories with 90%+ agreement against a human label is automation. The cost-per-task math only works when the model step replaces something a human used to do.
The other test we run early: can this workflow recover from a layout change or a vendor API drift without a developer touching it? A pure UiPath or Power Automate bot answers no — selectors break, recordings need re-recording, and a Salesforce UI change can take a fleet offline overnight. An LLM-in-the-loop n8n flow or Tines story answers yes most of the time, because the model re-reads the page or the API response and adapts. That recoverability is what justifies the higher inference cost; it's also why mid-market ops teams are migrating off the 2018-era RPA stacks faster than the analyst reports suggest. We've watched three IT shops do this migration in the last year, and the consistent pattern is they don't, won't, rebuild on UiPath even when the licence is paid through 2027.

## AI automation solutions architecture: the three reference shapes we ship

There are exactly three ai automation solutions architecture shapes we ship in client engagements. They differ by who owns durability, who calls the model, and where the human-in-the-loop hook lives. Picking the right one at kickoff is the highest-leverage decision in the whole engagement; getting it wrong costs roughly 4 to 6 weeks of re-architecture later.
Shape one. Visual orchestrator. The control plane is n8n, Make, or Tines; non-engineering ops staff can read and edit flows; an LLM step calls Claude, GPT-4, or Gemini through a native node. This is the default for back-office automation where the workflow lives inside the ops team's mental model. Cost shape: roughly $50 to $400 per month of platform fee plus per-execution pricing (Make charges per operation, n8n self-hosted charges only your compute). It's what we recommend in maybe 60% of engagements.
Shape two. Code-first orchestrator. Temporal or AWS Step Functions or Google Cloud Workflows owns durability; workers are Python or TypeScript services; the LLM call is a Python function that hits OpenAI or Anthropic directly, often through LangChain or a thin custom client. Engineering owns the flow. Use it when the workflow needs versioning under Git, when human-in-the-loop pauses can stretch for days, or when 99.9% durability under partial failure is a board-level requirement. This is the right shape for finance-ops, billing reconciliation, and anything regulated. We pick it in maybe 25% of engagements.
Shape three. LLM-as-orchestrator. The model itself drives the flow — Claude's tool-use API, OpenAI's assistants API, or a LangGraph state machine — calling tools and deciding the next step. This is genuinely new and genuinely useful for open-ended workflows (research, deep customer email triage, multi-hop investigation) but the durability story is weak and the cost variance is high. We use it for maybe 15% of work, always paired with a code-first orchestrator behind it for the deterministic parts. The deeper architectural pattern lives in our piece on [model-in-the-flow patterns for production workloads](/blog/llm-in-the-loop-patterns/).

> [!NOTE] (rich block: callout)

## AI automation solutions examples by function (ops, finance, support, RevOps, IT)

The most useful way to internalise ai automation solutions examples is by the function the workflow lives inside, since that's also how budgets get cut. Every example below is a workflow shape we've shipped or specced in the last 18 months; we've intentionally avoided naming clients to keep Rule F clean. Treat each row as a recipe, not a case study.
Three things to notice in the example matrix. First, the cost-per-task lands in a tight 4 to 40 cent band across functions; the variance is dominated by model choice (Haiku is cheaper than Sonnet by roughly 3x, Flash beats GPT-4 mini at most classification tasks) and by document length, not by orchestrator. Second, the orchestrator pattern matches the team that owns the flow: n8n for ops-owned workflows, Temporal for engineering-owned, Tines for security-owned. Third, the SaaS connector tax is real — Zendesk, Salesforce, NetSuite, and HubSpot connectors have non-trivial per-call costs once you cross a few thousand executions per day, which is why platform-level pricing matters more than model pricing for high-volume workflows.
The function we get asked about most often is support automation, and it's worth one extra paragraph. The 2026 default is: Make or n8n routing inbound mail or chat through a Claude Haiku classification step, with a second Claude or GPT-4 step that drafts a reply against a retrieval layer (Pinecone, Postgres pgvector, or the Zendesk knowledge-base API directly). The human-in-the-loop is the agent reviewing the draft, not authoring from scratch. Cost lands around 6 cents per ticket; deflection rates of 20 to 40% on tier-one are typical engagement shape for the first 90 days. We've seen teams skip the retrieval layer and ship pure-generation; that doesn't survive the first hallucination escalation, and they always retrofit retrieval inside two months.

## The AI automation solutions platform landscape, vendor-by-vendor

The ai automation solutions platform market in 2026 has roughly eight vendors that matter for a mid-market buyer, plus the cloud-native primitives (AWS Step Functions, Google Cloud Workflows) that show up inside larger engineering shops. We score them on five axes that match the procurement spreadsheet we actually use: orchestration durability, LLM-native ergonomics, self-host posture, per-execution cost shape, and the size of the connector library. The matrix below is what we'd put in front of a steering committee tomorrow.
Reading the matrix as a buyer: Temporal wins durability and loses connectors; Workato wins connectors and loses on self-host; n8n wins cost and self-host and loses on enterprise SaaS depth; Zapier wins connector breadth and loses on durability. The honest answer for most mid-market teams is a pair: Temporal for the durable backbone plus n8n or Make for the connector-heavy long tail, or alternatively Workato for the SaaS connector layer plus a Python worker pool behind it for the model calls. Single-vendor pitches from any of these will paper over a real gap. Our companion deep-dive on [platform comparison across the major orchestrators](/blog/n8n-vs-make-vs-temporal/) runs the matrix at greater depth.
One vendor framing worth being explicit about: the legacy RPA incumbents (UiPath, Power Automate, Automation Anywhere) have spent 18 months bolting LLM capabilities onto a 2018-era control plane, and the result is okay but rarely first-pick for net-new buyers in 2026. If you've already paid for UiPath through 2027 and your fleet is mostly classical bots, finishing that depreciation cycle is sensible. If you're greenfield, we wouldn't anchor on them. The LLM-native vendors (n8n, Make, Tines) shipped these primitives natively and they show; the procurement spreadsheet usually tells the same story.

## AI automation solutions implementation: a working pipeline in code

A concrete ai automation solutions implementation makes the architecture choices land harder than any matrix. Below are two snippets we'd actually ship: an n8n workflow JSON for the visual-orchestrator shape, and a Temporal Python worker for the code-first shape. Both encode the same business workflow — inbound invoice → Claude classifies vendor + line items → write to Postgres → notify Slack on exception — so you can read them as a paired comparison.
Three implementation gotchas we hit repeatedly. First, the Claude or GPT-4 step should always return JSON and you should always validate it server-side; we use Pydantic on the Temporal side and a small JSON-schema node on the n8n side. A model that returns prose when you asked for JSON will crash a downstream node and the failure mode is opaque if you skipped validation. Second, retries cost real money. A Temporal activity that retries an Anthropic call 5x because Postgres was down briefly has just billed you 5x the inference; cap retries at 2 to 3 on LLM activities and route the rest to a dead-letter queue. Third, observability has to be designed in at day one. We pipe Temporal events to OpenTelemetry and n8n's execution logs to Postgres + Grafana; without those, the first production incident takes a day to diagnose instead of an hour.
On the integration layer, two pieces are worth budgeting for up front. Pinecone or Postgres pgvector for the retrieval index when the workflow involves any kind of document lookup — pure-generation flows without retrieval don't survive contact with real data. And a small adapter layer (LangChain works, but a hand-written 200-line Python module works better) that abstracts model providers, so you can swap GPT-4 for Claude Sonnet without rewriting the workflow. Vendor risk on the model side is the second-largest risk in an AI automation engagement after orchestrator lock-in; the adapter pattern is the half-day of work that buys you the option. We integrate this kind of stack regularly through our [model API and tooling integration practice](/services/ai-integration/).

## The evaluation framework for an AI automation solutions vendor against a brief

Most ai automation solutions RFPs we see are scored on the wrong axes — connector count, AI feature parity, gushy demo polish. Here's the framework we'd put on the procurement spreadsheet instead. Each row scores 0 to 3 against a specific brief, and we weight the totals against the dollar size of the engagement.
The vendor that wins on this scorecard is usually not the vendor that wins on the marketing pages. Workato and UiPath consistently score highest on connectors and lowest on per-execution cost; Temporal scores highest on durability and observability and lowest on connectors; n8n scores highest on self-host and cost shape and middle on enterprise SaaS coverage. That's the trade matrix; pick the pair that closes the gaps for your specific brief. The procurement deck we ship after a vendor evaluation is usually 8 pages, half of which is this scorecard with your stack's specific connector list filled in. Our companion piece on [evaluating a workflow automation engagement](/blog/workflow-automation-eval/) walks the framework in more depth.

## Build vs buy vs assemble: where each option earns its keep

The build-vs-buy conversation on ai automation solutions used to be binary. In 2026 it's three options, and we genuinely use all three across engagements. We'll defend these three opinionated calls in any client review.
Option one. Buy a managed platform end-to-front. Make or Workato Cloud or Tines for everything; vendor owns durability plus connectors plus hosting plus the model billing. Right for ops teams without engineering capacity, right for workflows that fit cleanly inside the vendor's mental model, wrong for anything that needs to live behind a VPC boundary or scale past roughly 100k executions a day without the per-task pricing eating the budget. This is the right pick maybe 30% of the time.
Option two. Build the orchestrator yourself on Temporal or AWS Step Functions, write the workers in Python, integrate OpenAI or Anthropic directly. Right for engineering-led shops, right for regulated workloads, right at sustained scale beyond a few million executions a month where managed-platform pricing crosses into the territory of just hiring two engineers. Wrong when the team doesn't have on-call coverage to operate a stateful service. We pick this maybe 20% of the time and almost always for finance, healthcare, or anything where data residency matters.
Option three. Assemble — and this is the call most teams underweight. Pair a managed visual orchestrator (n8n self-hosted or Make) with a small Python worker pool for the model-heavy steps, glue them with HTTP webhooks, and run the whole thing behind a single Postgres for state. The orchestrator handles 80% of flows that don't need exotic durability; the worker pool handles the 20% that do. This is what we recommend roughly half the time, and it's the pattern that ages best because either half can be swapped without touching the other. The deeper trade-offs live in our note on [agent and automation engineering as a service](/services/ai-agent-development/).

## ROI and TCO modelling: the unit economics most procurement decks skip

Procurement decks for ai automation solutions overwhelmingly anchor on hours-saved framing, and that math doesn't survive a CFO review. The right unit is cost-per-completed-task, modelled against the current cost-per-task baseline. If a BPO line item is $1.20 per invoice processed and an n8n + Claude Haiku pipeline runs at $0.04 per invoice fully loaded (model tokens + n8n compute + Postgres + Slack notification), the operational saving is 30x per task. Multiply by volume, subtract the build cost, subtract the on-call cost, and you've got a payback curve that procurement can sign.
The model needs four inputs: task volume per month (V), cost-per-task baseline (Cb), cost-per-task after automation (Ca), and build + ongoing cost (B). Monthly saving is V × (Cb − Ca); payback in months is B ÷ monthly-saving. For a 50,000-invoice-per-month finance ops workload at Cb = $1.20 and Ca = $0.04 with a $40K build, payback is roughly 0.7 months. For a 5,000-applicant-per-month recruiting workflow at Cb = $0.40 and Ca = $0.05 with a $20K build, payback is roughly 11 months. The model is brutally simple; CFOs respect simple models. What they don't respect is "saves 20 hours a week" — that statement doesn't compose into a P&L without 6 more questions.
Three line items procurement decks routinely under-budget. First, observability and on-call: pencil in roughly 15 to 20% of the build cost annually for someone to watch the workflow, triage alerts, and refresh the model when a vendor deprecates a checkpoint. We've seen Claude versions deprecate on 60-day notice; that's a sprint of work if you weren't ready. Second, retrieval index maintenance: if the workflow uses Pinecone or pgvector, the index needs reindexing on a cadence (we run weekly for active docs, monthly for archives), and that's compute and engineer time. Third, model price drift. Token prices have only ever gone down so far, but the line item to forecast is volume rather than unit cost, and a successful automation tends to drive 2 to 3x the volume the original spec assumed.
On TCO over a 24-month horizon, the dominant line is usually inference tokens, not platform fees. A 50k-execution-a-month workflow on Claude Haiku at $0.80 per million input tokens and roughly 1.5k tokens per execution lands at roughly $60 per month of inference, which is rounding error against most ops budgets. The same workflow on Sonnet runs roughly 8x that; on GPT-4 turbo roughly 12x. Picking the smallest model that meets your quality bar (Haiku, Gemini Flash, GPT-4 mini, or one of the open-weight models like Llama 3.1 served through Together AI) is the single highest-leverage cost lever after architecture choice. The model isn't always the answer.

## Industry shape: what AI automation solutions look like across ops, support, and back-office

The vertical pattern matters because the same workflow shape (intake → classify → write → notify) takes very different stacks depending on the system of record. Below are the four engagement shapes we see most often, framed by industry rather than function.
The vertical conversation usually surfaces two extra constraints that the function-led view misses. Compliance: PII and regulated data either can't cross a vendor's tenancy boundary at all (healthcare, defence, some financial services) or has to cross it with explicit data-processing agreements (most enterprise SaaS). That kills cloud-only platforms for a non-trivial slice of buyers, which is why n8n's self-host posture and Temporal's on-prem option matter so much. And legacy integration: an insurance shop with a 20-year-old policy admin system or a manufacturing shop with a 2009 ERP can't use the connector library out of the box, so a meaningful fraction of the build is custom adapters — and that's where the classical RPA tools (UiPath, Power Automate) still earn their fee, as the screen-scraping bridge to a system that has no API. Hybrid stacks (LLM-driven n8n for the modern half, UiPath for the legacy half) are the practical answer.
On the bot half of that hybrid, a quick note: we still recommend a small classical-RPA capability for any engagement that touches a legacy thick-client or a SaaS without an API. Our [robotic process automation build practice](/services/rpa-development/) covers the deterministic side of that picture, paired with the LLM-driven control plane covered here.

## The AI automation solutions guide: a 7-step build checklist

Use this ai automation solutions guide as a kickoff checklist. We run a version of it in every discovery workshop, and the seven steps cover roughly 90% of the decisions that determine whether a pilot ships on time. Step ordering matters; skipping ahead is the most common failure mode we see.
Step one. Pick one workflow with named volume; don't try to automate a department in a single pilot. "Triage invoices over $1000" beats "automate AP". Step two. Set a cost-per-task ceiling before you pick a vendor — if you can't tolerate above $0.10 per task, that rules out Workato Premium and Zapier at scale. Step three. Pick the architecture shape (visual, code, LLM-as-orchestrator); this is the hardest reversal later. Step four. Pick the orchestrator inside the shape — n8n or Make for visual, Temporal or Step Functions for code, LangGraph for LLM-as-orchestrator. Step five. Pick the model (start with Claude Haiku or Gemini Flash; upgrade only if quality fails an eval) and the retrieval layer (Pinecone or pgvector; start with pgvector if you already run Postgres). Step six. Ship observability before you ship the workflow — OpenTelemetry traces, per-execution logs to Postgres or BigQuery, alerts on failure rate and on cost-per-task drift. Step seven. Ship the human-in-the-loop reviewer queue with the workflow; nobody trusts a fresh automation enough to skip review for the first 6 weeks.
On the link in that CTA: our [workflow automation engineering practice](/services/ai-workflow-automation/) is the parent service for the whole picture this guide walks through — orchestrator selection, model integration, durability engineering, observability scaffolding, and the on-call rotation that keeps the pipeline running once it's live. It pairs with the seven-step checklist above; the checklist is what we run in the kickoff, the practice is the team that runs the pipeline afterward.

## FAQ on AI automation solutions, in the buyer's vocabulary

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/


---

## SECTION: 9. Contact

_Source: https://www.paiteq.com/contact/_

# Talk to AI engineering — Paiteq

> Talk to the Paiteq AI engineering team. Same-day reply from the engineer who would lead the work. NDA counter-signed before discovery. Walk-away clause on every engagement.

**HTML version:** https://www.paiteq.com/contact/

## Key facts

- Same-day reply from engineering.
- NDA counter-signed before discovery.
- Walk-away clause on every audit.
- Email: info@paiteq.com · Phone: +91 80 5003 2994.
- Offices: Bengaluru, IN (HD-101A WeWork Salarpuria Symbiosis) and Dallas, TX (539 W. Commerce St #1814).

## Related pages

- [Services hub](https://www.paiteq.com/services/)
- [Case studies](https://www.paiteq.com/case-studies/)
- [About](https://www.paiteq.com/about/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

Contact

# Talk to *AI engineering*.

The first reply comes from the AI engineer who would lead the work. Same day on most inbounds, Bengaluru business hours.

[Send an inquiry](/contact/) [Or email directly](mailto:info@paiteq.com)

001 / WHAT TO BRING

## One workload, scope access, and 30 minutes.

The discovery call goes faster when you bring these three things. None of them are blockers — we've run audits off a Slack message describing the workload. The more concrete the input, the sharper the audit deliverable.

-   01
    
    ### One concrete workload
    
    A workload with a problem statement, current cost (people-hours, ticket volume, doc throughput — whatever the metric is), and what success would look like. We'd rather audit one workload deeply than ten shallowly.
    
-   02
    
    ### Read-only scope access
    
    A read-only data tap to start — a CSV export, a sample of tickets / invoices / calls, or a sandbox API key. No write access until the pilot phase, no production data until NDA + DPA where applicable.
    
-   03
    
    ### 30 minutes for discovery
    
    We run the audit on our side — you don't prep slides. The discovery call is a conversation: what you do, what's broken, what you've tried, where the AI work has to land in your existing systems.
    

002 / CHANNELS

## Four ways to reach us. Email is fastest.

If you want a written reply on a specific question, email beats every other channel — it routes straight to engineering. Phone and WhatsApp are best for time-sensitive coordination once we're already engaged.

[EMAIL · FASTEST info@paiteq.com Direct line to the engineering team. Same-day reply during Bengaluru business hours (IST). Best for: written questions, RFPs, scoping notes longer than the form supports.](mailto:info@paiteq.com) [PHONE · BENGALURU +91 80 5003 2994 For time-sensitive coordination during India business hours (Mon–Fri 10:00–19:00 IST). For new inbounds, email routes faster — phone is best once we're already engaged.](tel:+918050032994) [LINKEDIN /company/paiteq Best for: connecting with our engineers individually, or referring us inside your network. InMail replies within a few business days. Long-form scoping still goes to email.](https://www.linkedin.com/company/paiteq/) [WHATSAPP BUSINESS +91 80 5003 2994 Same number as the phone line. Good for short coordination once a discovery call is on the calendar. Avoid for first-touch scoping — context gets lost in chat threads.](https://wa.me/918050032994)

003 / OFFICES

## Bengaluru engineering HQ + Dallas mailing address.

Honest framing: Bengaluru is where the team works on-site. Dallas is a registered US mailing address for clients who need a US point of contact — there is no team on the ground there. We are a remote-first studio.

PRIMARY · ENGINEERING TEAM

### Bengaluru, India

Most of the team works here

Paiteq  
HD-101(A) WeWork Salarpuria Symbiosis  
Bengaluru, Karnataka 560077  
India

**Hours** · Mon–Fri 10:00–19:00 IST **Phone** · [+91 80 5003 2994](tel:+918050032994)

REGISTERED MAILING ADDRESS

### Dallas, Texas (US)

Mail forwarding · no team on-site

Paiteq  
539 W. Commerce St #1814  
Dallas, TX 75208  
United States

**Purpose** · Registered US mailing only **Client meetings** · Zoom / Meet — engineering is remote-first

004 / INQUIRY

## Tell us what you're trying to ship.

A 90-second form. An engineer reads it and replies — usually within a few hours during Bengaluru business hours, same business day otherwise.

 or email [info@paiteq.com](mailto:info@paiteq.com)

NDA counter-signed before discovery Walk-away clause on every engagement You own all the IP

005 / FAQ

## Questions we get on every first call.

Eight things every inbound asks before signing the NDA. If yours isn't covered, the inquiry form is the fastest way to ask.

Who answers when I send the form?

An AI engineer — usually a senior who would actually be on your pilot. Inbounds land directly in an engineer's inbox, not a sales queue. The reply digs into what you sent — what systems it touches, where the judgment lives, what counts as success — and asks the questions that move the scope forward. Some inbounds need a short reply; complex ones get a long one. Most get a same-day response; long-form RFPs may take a day extra because the read takes longer.

If the work isn't a fit, we'll say so on the reply and route you elsewhere if we can. The trade-off for talking to engineering early: harder technical questions earlier rather than smoother demo-deck questions.

Do you sign an NDA before discovery?

Yes. We counter-sign your NDA before the discovery call, or send ours if you don't have one. For regulated workloads (healthcare, finance, legal), we also sign a DPA before any data touches our infrastructure. Until the pilot kicks off, we recommend the audit phase run on anonymised samples even after the NDA — keeps the surface area small while we're still scoping.

Who owns the IP we build together?

You do. The audit memo, the eval set, the prompts, the workflow code, the deployment scripts, the ops runbook — all of it transfers to you. We retain the right to reuse operator patterns ("how to ship a tier-1 deflection agent", "how to structure a RAG eval") but not your prompts, your data, or your code. Standard work-for-hire terms in the pilot + continuous contracts.

What happens if the audit says "don't build"?

You keep the deliverable — the workload map, the model-cost projection, the 90-day roadmap, the risk-tier ladder — and you walk away. That's the walk-away clause baked into every engagement. We'd rather lose a pilot we shouldn't have sold than ship something that won't move the metric.

The honest version of this is: about 1 in 5 audits we run end up recommending no AI work, or recommending you defer for 6 months until a prerequisite (data quality, eval ground truth, internal champion) is in place. That's the whole point of separating the $3K audit phase from any pilot money.

How fast can we start a pilot?

Typical sequence: discovery call this week → audit deliverable in 1–2 weeks → pilot kickoff within 1–2 weeks of audit sign-off. So 3–5 weeks from "first call" to "engineers writing pilot code".

If you're under a real deadline (board demo, regulator timeline, contract milestone), tell us in the form. We can compress the audit phase or run NDA / DPA paperwork in parallel. We won't compress the eval set work — that's load-bearing for the pilot ship gate.

Do you support HIPAA, SOC 2, GDPR?

**HIPAA** — yes for HIPAA-aware patterns. We ship Claude on AWS Bedrock with a BAA and PrivateLink VPC, audit-logged, with field-level masking on PHI before any model call.

**GDPR** — yes for GDPR-aware DPAs and EU-region residency on hosted models (Anthropic EU, OpenAI EU data residency, Azure West Europe). Subject-access-request workflow documented in the ops runbook.

**SOC 2** — partial. We follow SOC-2-ready practices (audit logs, least-privilege IAM, key rotation, encryption at rest and in transit) but we are not ourselves SOC 2 Type II certified as a vendor. If your procurement requires a SOC 2 Type II report from the agency itself, flag that on the call and we'll route accordingly.

Will I talk to engineers or salespeople?

Engineers. The discovery call is run by someone who would actually be on your pilot — typically a senior AI engineer plus a product / PM lead who has shipped similar shapes. We don't have a separate sales team.

The trade-off: we ask harder technical questions earlier ("what's the schema of that ticket data?", "what's your p99 latency budget?", "where does the human-in-the-loop sit today?") rather than smoother demo-deck questions. If you're earlier in the buying cycle and want a polished overview before a technical conversation, the per-pillar service pages and case-studies index cover that better than a call would.

What models and stacks do you typically pick?

Per workload, not per vendor. Long-context reasoning lands on Claude Sonnet 4.6 most of the time. Realtime voice agents go to OpenAI Realtime API for sub-400ms first-token latency. Cost-sensitive batch workloads run on open-weight models (Llama 4, Mistral) self-hosted on vLLM where the volume justifies it. Regulated workloads on Bedrock with BAA.

The model pick is part of the audit deliverable, with the reasoning shown. We'll tell you honestly when not to use a model — see the per-pillar pages for the per-vendor positioning across [LLM](/services/llm-development/), [RAG](/services/rag-development/), [agents](/services/ai-agent-development/), and [chatbots](/services/chatbot-development/).

006 / Prefer to read first?

## See how we *build the work*.

Twelve service pillars, with engagement shapes, eval methodology, and stack picks per workload.

[All services](/services/) [Email info@paiteq.com](mailto:info@paiteq.com)