# AI Agent Development Services — Paiteq

> Paiteq is an AI agent development company building production AI agents on LangGraph, CrewAI, AutoGen, Composio, and DSPy. Plan/act/reflect loops, multi-agent systems, voice agents — with evaluation built in. AI agent development services from pilot to production.

**HTML version:** https://www.paiteq.com/services/ai-agent-development/

## Key facts

- Stacks: LangGraph, CrewAI, AutoGen, Composio, DSPy.
- Architectures: plan/act/reflect loops, multi-agent systems, voice agents.
- Eval-instrumented from day one; defined kill point before the build starts.

## Related pages

- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [LLM Development](https://www.paiteq.com/services/llm-development/)
- [Chatbot Development](https://www.paiteq.com/services/chatbot-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

AI Agent Development

# The *AI agent development company* production teams trust to ship.

Paiteq is an AI agent development company building production agents on LangGraph, CrewAI, AutoGen, and Composio. AI agent development services from pilot to production, plan/act/reflect loops, multi-agent systems, voice agents, evaluation built in.

[Talk to engineering](/contact/) [See build process](#process)

Stack LangGraph · CrewAI · Composio

Engage Pilot · Build · Run

Eval Task success · Latency · Halluc.

Compliance SOC-2-ready · HIPAA · GDPR

001 / WHAT WE BUILD

## Agents shipped across eight surfaces.

We don't list "AI for everything." Each surface below is a workload we've shipped to production, with the eval methodology, the framework choice, and the failure-mode story already worked out. [See three of them in detail →](#use-cases)

AI agent development sorts cleanly by function, not by industry. A sales agent for a B2B SaaS uses the same plan/act/reflect loop as one we ship for a manufacturer; the integration surface differs, the eval anchor differs, but the shape of the build doesn't. Sorting by function lets us reuse the eval framework, the observability rig, and the prompt-iteration playbook across clients. Sorting by industry, the way most listicle competitors organize their pages, hides where the engineering actually lives. Every custom AI agent development engagement we run reuses the same scaffolding; only the workload-specific tool surface changes.

[

01 / SALES ↗

Sales agents

Lead qualification, outbound research, CRM action. Replace SDR work that doesn't need a human.

B2BOutboundEnrichment

](#use-cases)[

02 / SUPPORT ↗

Support agents

Tier-1 deflection, ticket triage, knowledge-base augmented. Escalates with full context, not raw transcripts.

Tier-1RAGVoice

](#use-cases)[

03 / OPS ↗

Ops agents

Invoice matching, AP routing, inventory triage. Replaces structured-but-tedious back-office work.

AP / ARWorkflows

](#use-cases)[

04 / CODING ↗

Coding agents

Repo-aware code review, refactor, doc-gen. Wired into your CI as a PR gate, not a chat surface.

Repo-awarePR gate

](#use-cases)[

05 / RESEARCH ↗

Research agents

Multi-step deep research over the open web + your internal corpora. Cited, structured outputs.

Multi-stepCitations

](#use-cases)[

06 / VOICE ↗

Voice agents

Phone and in-app voice agents with low-latency turn-taking. Built on LiveKit, Vapi, or Pipecat.

LiveKit<400ms

](#use-cases)[

07 / MULTI-AGENT ↗

Multi-agent systems

Supervisor + worker patterns for tasks that don't fit one agent. Planner, executor, critic.

SupervisorCritic

](#architecture)[

08 / EVAL ↗

Eval & observability

Every agent ships with task-success scoring, hallucination metrics, and weekly eval runs.

LangfuseBraintrust

](#eval)

FUNCTION × INDUSTRY

Where we've shipped each agent class. Strength reflects production volume, not theoretical fit, empty cells mean we either haven't done it yet or the workload didn't justify an agent.

Function Industry

B2B SaaS

Health-tech

Mfg

Fin-tech

Legal

E-comm

Ed-tech

Logistics

Sales agents

Support agents

Ops / back-office

Research agents

Voice agents

Multi-agent

Sales agents

B2B SaaSMfgFin-techE-commEd-tech Health-techLegalLogistics

Support agents

B2B SaaSHealth-techMfgFin-techLegalE-commEd-techLogistics

Ops / back-office

B2B SaaSHealth-techMfgFin-techLegalE-commLogistics Ed-tech

Research agents

B2B SaaSHealth-techFin-techLegalEd-tech MfgE-commLogistics

Voice agents

B2B SaaSHealth-techFin-techEd-techLogistics MfgLegalE-comm

Multi-agent

B2B SaaSHealth-techMfgFin-techLegalE-commEd-techLogistics

Possible fit Good fit Primary vertical

The shipped-volume bias is intentional. Sales, support, and ops are the three columns we run the most, they're where evaluable agent work meets clear ROI math. Voice and research agents are growing fastest in 2026; multi-agent systems remain a smaller share of the book because we recommend a single-agent loop unless the workload actually has separable sub-tasks. We'll talk you out of multi-agent if the work doesn't fit it. [More on that under reference architectures.](#architecture)

002 / SERVICES

## AI agent development services, pick where to start.

Three engagement shapes. Each is fixed-scope and fixed-duration. You always know what's coming, when, and what counts as done. [Full engagement-model breakdown below →](#engage)

Choosing an AI agents development company is mostly about choosing the right starting shape. How you start usually decides how the project ends. Buyers who walk in with a single scoped workload and an eval set in mind ship to production 70% of the time. Buyers who walk in with "we want AI agents" without a target workload ship 25% of the time, usually after a re-scope. We've built the four shapes below to map cleanly onto those starting points, pick the one that matches what you actually have, not what you wish you had. Each shape is an AI agent development service we've shipped 10+ times, the deliverables and gate criteria are locked in by repetition, not invented for your engagement.

[

01 / PILOT ↗

Agent Pilot

One agent, one scoped workflow, intake to live in 2–4 weeks. Eval framework included.

2–4 wksFixed scope

](#engage)[

02 / BUILD ↗

Custom Agent Build

Full production build with integrations, eval gates, observability, and post-launch iteration.

8–16 wksFixed scope

](#engage)[

03 / MULTI-AGENT ↗

Multi-Agent Systems

Supervisor + worker + critic orchestration for tasks one agent can't handle. LangGraph or CrewAI.

10–20 wks

](#engage)[

04 / VOICE ↗

Voice / Conversational Agents

Sub-400ms turn-taking voice agents. Inbound, outbound, in-app. LiveKit + your LLM of choice.

6–12 wks<400ms

](#engage)

A practical decision tree: if the workload is scoped but unproven, start with a **Pilot**. If the workload is proven (one of yours is already working manually or in a janky prototype) and you need production discipline, start with a **Custom Agent Build**. If the workload has 3+ sub-tasks that fight each other in a single prompt, start with **Multi-Agent Systems**. Voice is a separate workstream, latency budget, turn-taking, telephony, so it gets its own shape. Compared to other AI agent development companies, we build the eval framework before writing code, not after the agent ships. [Week-by-week scope on each, further down →](#engage)

003 / STACK

## Frameworks we build on.

Framework choice follows the workload, not the other way around. We don't have a house framework. The six below cover ~95% of what we ship; the rest live in **Vercel AI SDK**, **OpenAI Agents**, or hand-rolled SDK wrappers when the surface is small enough.

-   LangChain
-   LangGraph
-   CrewAI
-   AutoGen
-   DSPy
-   Composio
-   Pydantic-AI
-   Phidata
-   AG2
-   Vercel AI SDK
-   OpenAI Agents
-   Anthropic SDK
-   LangChain
-   LangGraph
-   CrewAI
-   AutoGen
-   DSPy
-   Composio
-   Pydantic-AI
-   Phidata
-   AG2
-   Vercel AI SDK
-   OpenAI Agents
-   Anthropic SDK

FRAMEWORK PICKS

For each framework: what it's strongest at, when we pick it, when we don't, and the specific Paiteq pattern we use with it. We've shipped production agents on every one of these, the "when we don't" lines come from actual builds, not theory.

LangGraph

Strengths

Explicit graph control over a stateful agent loop. Checkpointing, time-travel debugging, and human-in-the-loop interrupts are first-class.

When We Pick

Plan/act/reflect loops where you need to inspect, replay, or branch state. Long-running agents that survive crashes. Any agent that needs an explicit retry policy.

When We Don't

Single-turn extraction. Stateless tool calls. Anything that fits in a Vercel AI SDK route, don't pay the graph tax if you don't need it.

Paiteq Pattern

We use LangGraph for ~70% of production agents. The checkpoint store goes to Postgres so resume-after-crash is one redeploy away.

StatefulCheckpointsMulti-step

CrewAI

Strengths

Supervisor / worker orchestration with role-prompt scaffolding. Less code than LangGraph for the same 3-agent pattern.

When We Pick

Research workflows. Content pipelines (planner → writer → critic). Multi-agent prototypes where the orchestration topology won't change.

When We Don't

Once you need explicit state graph control, retries, or runtime topology changes, graduate to LangGraph. CrewAI's role abstraction starts to fight you.

Paiteq Pattern

Pilots that need to demo a multi-agent loop in week 3 start in CrewAI. About a third graduate to LangGraph for production.

SupervisorMulti-agent

AutoGen

Strengths

Multi-agent conversation patterns from Microsoft Research. Strong code-execution agents and group-chat orchestration.

When We Pick

Code-generation agents that need a sandboxed runtime. Group-chat patterns where 3+ agents debate before acting.

When We Don't

Single-agent loops. Latency-sensitive paths. AutoGen's chat metaphor adds round-trips that LangGraph avoids with direct state passing.

Paiteq Pattern

We use AutoGen Studio when a client wants to author agent conversations themselves before we wire it into production.

Code-execGroup chat

Composio

Strengths

Pre-built tool surface. 250+ integrations (Salesforce, Slack, GitHub, Linear, Zendesk) with auth + rate-limit handling already solved.

When We Pick

Agents that touch 4+ external systems and we'd otherwise spend the first three weeks writing OAuth and webhook plumbing.

When We Don't

Internal-only tools or anything off the catalog. The wrapper costs ~80ms per call and you're at Composio's rate-limit policy, for hot paths, write the tool yourself.

Paiteq Pattern

Sales and support agents almost always start on Composio. We migrate to native SDKs only when latency or rate limits become the bottleneck.

250+ toolsOAuth

DSPy

Strengths

Compile prompts the way you'd compile code. Optimizers (BootstrapFewShot, MIPRO) tune the prompt against your eval set automatically.

When We Pick

Agents where the prompt is the bottleneck and we have an eval set ≥50 examples. Hand-tuned prompts plateau; DSPy gets another 5–10 points.

When We Don't

Eval set too small (DSPy needs signal). Workflows where the bottleneck is tool design, not prompting. Prototypes, start with hand-crafted, compile later.

Paiteq Pattern

Phase 2 of any Build engagement. We hand-craft the prompts in pilot, then DSPy-compile them once the eval set is mature.

Prompt optMIPRO

LiveKit

Strengths

Voice agent infrastructure, WebRTC transport, STT/LLM/TTS pipeline, turn-detection. Sub-400ms mic-to-speaker on Claude or GPT-5 realtime.

When We Pick

Phone or in-app voice agents. Anywhere the user expects a human-cadence conversation, not a chat UI with TTS bolted on.

When We Don't

Long-form generation. Voice agents that need 5+ seconds to think, LiveKit's turn detector will keep interrupting them.

Paiteq Pattern

Pipecat for prototypes (faster to demo), LiveKit when going to production. We've shipped voice agents with p95 turn-take at 320ms.

VoiceSub-400msWebRTC

Two patterns worth flagging on every custom AI agent development engagement we lead: we benchmark **two frameworks against the eval set** before locking the stack, usually LangGraph vs whichever lighter option fits the workload. The eval set decides, not the framework's marketing. And we keep an out: every enterprise AI agent ships with prompts in portable YAML for re-hosting on a different framework if needed. Most AI agent development companies don't scope portability into the SOW, we do it by default.

003b / MODELS & INTEGRATIONS

## Models, integrations, and the tool surface.

Framework choice gets the H2 but rarely the headline call. Model and integration choice usually matter more for production behaviour. We benchmark at least two models per workload, name every integration in scope on day one, and pick between Composio's pre-built tool surface and native SDK wrappers per call site, not as a blanket policy.

MODELS WE DEPLOY

Four model families cover ~98% of what we ship. We benchmark candidates against your eval set before locking; the leader on cost-adjusted quality wins, regardless of which vendor we'd default to. Routing across multiple families in one production deployment is increasingly the norm.

Claude (Sonnet / Opus)

Strengths

Tool use, structured output, long-context reasoning. Our default for agent planning loops where the prompt is doing the heavy lifting.

When We Pick

Stateful agents with complex tool surfaces. Long-context RAG. Anywhere prompt-following matters more than raw speed. Sonnet for ~80% of production; Opus when reasoning depth justifies the cost.

When We Don't

Hyper-latency-sensitive paths (Sonnet TTFT runs 400-800ms). Very high-volume routine calls where GPT-5 mini or Llama-3 are cheaper.

Paiteq Pattern

Sonnet 3.5 is the default planner in our agent loops. We benchmark it head-to-head against GPT-5 on every new engagement's eval set.

Tool useStructuredLong context

GPT (4o / 4o-mini)

Strengths

Lowest latency on the hosted side (4o realtime TTFT ~250ms). Strong vision. 4o-mini is the price-per-token king for routing.

When We Pick

Voice agents needing realtime TTFT. Multimodal apps (vision + text). High-volume classifier or router tier where 4o-mini's cost wins.

When We Don't

Heavy tool-using planning loops, Claude usually wins our eval. Workflows requiring strict structured output without retries.

Paiteq Pattern

GPT-5 realtime is the standard for voice agents. 4o-mini routes ~70% of high-volume traffic in cost-engineered deployments.

RealtimeVisionRouting

Gemini (2.0 Flash / Pro)

Strengths

Massive context window (1M+ tokens). Strongest cost-per-token at the frontier tier. Native multimodal across video.

When We Pick

Workloads with very large context (whole-codebase analysis, long document Q&A) where chunking would lose signal. Video understanding.

When We Don't

Tool-using agents that don't need the context window, Gemini's tool-call accuracy still trails Claude on our evals. Production paths where Google Cloud lock-in is a concern.

Paiteq Pattern

We use Gemini Flash for long-document agents (legal contracts, codebase audits). Rarely as a primary agent planner, yet.

1M contextMultimodalVideo

Llama 4 / Mistral / Qwen

Strengths

Self-hosted on your cloud. Fixed infra cost. No data leaves your perimeter. LoRA fine-tuneable for domain language.

When We Pick

Regulated data rules (HIPAA, FedRAMP, EU residency). Very high token volume where dedicated GPU amortizes. Workloads where prompt + small fine-tune beats hosted prompt-only.

When We Don't

Tool-using agents on the frontier, open weights still trail Claude/GPT-5 by 5-15 points on tool-call accuracy. Engagements with no ops capacity to run inference infrastructure.

Paiteq Pattern

vLLM on dedicated A100/H100 for self-hosted. LoRA fine-tunes on Llama 4 70B for domain-specific classification or extraction agents. Hybrid: hosted planner + self-hosted worker is increasingly common.

Self-hostedvLLMLoRA

INTEGRATIONS WE SHIP AGAINST

Tool-call accuracy against your real systems is one of the four eval metrics. We don't trust integrations until we've graded them. Below: the systems we've shipped agent integrations against in the last 12 months. Adding to the list takes a few days, not a re-architecture.

CRM & Sales

-   Salesforce
-   HubSpot
-   Pipedrive
-   Apollo
-   Clearbit

Support & Ticketing

-   Zendesk
-   Intercom
-   Freshdesk
-   Linear
-   Jira

Data Warehouse & Search

-   Snowflake
-   BigQuery
-   Databricks
-   Pinecone
-   Qdrant

Communication & Files

-   Slack
-   Microsoft Teams
-   Gmail / Google Workspace
-   SharePoint
-   Notion

Code & DevOps

-   GitHub
-   GitLab
-   CircleCI
-   Linear
-   PagerDuty

Voice & Telephony

-   LiveKit
-   Twilio
-   Vapi
-   Plivo
-   Telnyx

TOOL SURFACE DESIGN

Every agent has a tool surface, the set of functions it can call to do its job. Sizing that surface is one of the most consequential decisions in the build. Too small and the agent can't do the work; too large and it loses focus, tool-call accuracy drops, latency balloons.

01 **≤8 tools per planning loop.** Beyond that, accuracy starts dropping on every eval we've run. Decompose into supervisor + workers.

02 **Composio for breadth, native SDK for hot paths.** Composio's 250+ integrations save 2-3 weeks on OAuth + webhook plumbing but add ~80ms per call. Native SDKs when latency or throughput rules.

03 **Confirmation step in front of write actions.** Read tools call freely; write tools (send email, create ticket, charge card) ship with a confirmation gate unless tool-call accuracy is above 99.5%.

04 **Structured outputs over freeform.** Every tool input gets a Pydantic / Zod schema. We use Anthropic and OpenAI structured-output mode by default; retries on schema violation are cheaper than guessing what the agent meant.

The model and integration choices are where engagement scope quietly grows. Buyers ask for "an AI agent for our sales workflow" without specifying which CRM, which ICP scoring fields, which model. We force those choices into the spec in week 2, naming the model, naming the four CRM endpoints we'll integrate, naming the cost band per request. Decisions made explicit at scope time stop being re-litigations during build.

004 / PROCESS

## Six steps from discovery to running.

The same process runs across both a 2-week pilot and a 16-week custom build. The gates change in depth, not in shape. Every step has an explicit deliverable, a named owner, and a gate criterion, pass or rework, no "we'll figure it out next week."

WEEK 1–2

### Discovery

We map the workflow, scope the agent's job, and identify the eval surface, what counts as the agent doing its job correctly?

WEEK 2–3

### Spec

Tools, prompts, guardrails, model choice, and the first 30–50 eval examples. Signed off before any code.

WEEK 3–6

### Prototype

First runnable agent against the scoped eval set. We iterate prompts and tools until baseline accuracy is hit.

WEEK 6–10

### Eval gates

Task success, hallucination rate, tool-call accuracy, and latency thresholds all green before production wire-up begins.

WEEK 10+

### Deploy

Production integration, auth, rate limits, observability via Langfuse, retry + fallback policies, on-call runbook.

ONGOING

### Running

Weekly eval runs, prompt + tool iteration, and a regression alarm if any metric drops by more than 5%.

01

### Discovery

We map the full workload, every decision point, handoff, and exception, before scoping any agent. That means watching one of your team members do the work today, recording every decision point, and identifying which decisions are deterministic (rule-based) vs judgment-based (LLM-fit). The week-1 output is a workload map + a draft eval surface: what counts as the agent doing the job correctly?

OwnersPaiteq AI engineer + your subject-matter expert. ~6 hours of their time across the week.

GateWorkload boundary signed off. If sub-tasks straddle a fuzzy boundary, we shrink scope rather than guess.

02

### Spec

Stack picks (framework + model + observability), prompt sketches, tool surface, guardrails policy, and the first 30–50 eval examples. The eval examples come from your domain expert grading real candidate outputs, not from us guessing. Signed off as a one-pager before code starts.

OwnersPaiteq AI engineer + senior architect review.

GateEval examples graded. If your team can't agree on a grade for the example outputs, the spec isn't done.

03

### Prototype

First runnable agent against the scoped eval set. We iterate on prompts, tool design, retrieval (if RAG), and model choice. Multiple models get benchmarked against the same eval set, the leader on cost-adjusted quality wins, regardless of which vendor we'd default to.

OwnersPaiteq AI engineer building; weekly demo to your team.

GateBaseline accuracy hit on the eval set. Below baseline, we revise the spec rather than the threshold.

04

### Eval gates

Four thresholds must all be green before any production wire-up: task success rate, hallucination rate, tool-call accuracy, and p95 latency. Hallucination is dual-scored (LLM-as-judge + human spot-check on disputed examples). Tool-call accuracy is separately measured because a wrong tool call can succeed at the wrong thing.

OwnersPaiteq AI engineer + your domain expert verifying the human-spot-check.

GateAll four metrics green or the build doesn't deploy. Period. We've shipped exactly zero agents that bypassed this gate.

05

### Deploy

Production integrations, auth, rate-limit, observability via Langfuse, fallback policies, cost guardrails, on-call runbook. We wire the eval set into the deploy pipeline so regression alarms fire automatically when an upstream model change drops scores. The handoff includes the runbook in your repo, not in a doc somewhere.

OwnersPaiteq AI engineer + your platform/SRE team.

GateRunbook drilled (we simulate an outage + rollback before the actual go-live).

06

### Running

Four weeks of post-launch iteration are part of every Build engagement, weekly eval review, prompt iteration on edge cases, regression alarm triage. After week 16, ongoing iteration moves to a Run engagement (separate monthly SOW) only if the workload genuinely benefits. About half of completed builds graduate to Run.

OwnersPaiteq AI engineer (decreasing % of time) + your team picking up ownership.

GateOngoing, weekly eval review never stops while we're engaged.

Two notes that matter. **Eval gates are non-negotiable**, we will not wire an agent into production traffic until task success rate, hallucination rate, tool-call accuracy, and latency are all green against the eval set scoped during discovery. **Running is a real phase**, not an afterthought. The first 4 weeks post-launch are part of every Build engagement, with weekly eval runs and prompt iteration baked into the SOW.

005 / DECISION

## AI agents vs. chatbots, when do you need which?

This is the most common scoping mistake we see. Buyers ask for "an AI agent" when a chatbot is enough, or ask for "a chatbot" when the workload genuinely needs autonomy. The seven dimensions below cover most of the call.

Chatbots

AI Agents

Turns

Single, request-response

Multi-turn, planning loop

State

Stateless or thin context

Stateful, often memory-backed

Tool use

None or one-shot lookup

Core, APIs, code, retrieval, other agents

"Tool use" is the line buyers most often miss. A chatbot can *call one API* on intent match, that's not tool use, that's a function call. Real tool use means the model decides **which** tool to call, **when**, with **what arguments**, and how to react when the tool errors. That decision loop is the agent.

Autonomy

Scripted by intent map

Goal-driven, decides its own steps

Autonomy is the scope dial. Scripted flows are easier to test, cheaper to run, and won't surprise you in production, but they cap at the conversation tree you drew. Agents will solve problems you didn't anticipate, which is the value *and* the risk. Most production deployments end up with bounded autonomy: agent decides within a fixed toolset and rejects-with-explanation outside it.

Eval surface

Intent classification acc.

Task success rate + sub-step accuracy

Failure mode

Wrong intent → wrong reply

Wrong plan → cascading bad actions

Best for

FAQ, lookups, routing

Multi-step workflows, research, ops

Cost (rough)

$$

$$$$, per-task LLM cost dominates

Cost flips the typical recommendation. At **$0.001/resolution for a tuned chatbot** vs **$0.05–$0.20/task for an agent**, the volume math is brutal. A chatbot handling 10k tickets/day at $0.001 costs $300/mo. The same volume on an agent at $0.10 is $30k/mo. Agents earn their cost on **multi-step work** (research, ops, integration), not on volume. If the workload is bounded and high-volume, chatbot every time.

Full breakdown, [when to pick which](/blog/ai-agents-vs-chatbots/)

Rule of thumb: if the work is **look something up and respond**, you want a chatbot. If the work is **understand a goal, take several steps, and use tools along the way**, you want an agent. Anything in between, the decision tree below walks you through a few diagnostic questions, most projects fit cleanly into one of five outcomes.

DECISION TREE

Answer four questions about your workload. We've used these same questions to right-size scope on every engagement since 2023.

Path

Question

Pick one

Result

006 / ARCHITECTURE

## Three patterns we deploy.

Most production agents reduce to one of three patterns. The taxonomy isn't ours, it's standard in the LangGraph and CrewAI communities, but the *deployment choices* are where engineering judgment lives: when to pick which, what fails first, which eval metric becomes the anchor.

  

01

### Single-agent + tools

The simplest production pattern. One agent runs a plan/act/reflect loop with a fixed tool surface, one LLM call per turn. This is where most production agents land, sales research, support deflection, ops routing. State is small (recent turns + scratchpad), the topology is fixed, and the eval anchor is end-task success plus tool-call accuracy. Around 60-70% of our production agents fit this pattern. Don't reach for multi-agent until single-agent demonstrably fails the eval set.

Pick when

-   The workload is bounded with stable tools. A single planning loop covers the task. Tool surface ≤8. Most pilots start here.

Skip when

-   Sub-tasks fight each other in one prompt. Task needs >15 sequential tool calls (latency budget breaks). Workflow has clearly separable specialised roles.

Stack

LangGraphClaude Sonnet 4.6Composio

02

### Supervisor / worker

One supervisor agent plans and routes; workers specialise (research, draft, execute, critique). Used when no single agent's prompt can hold the full task without quality collapse. The supervisor's job is decomposition + routing, not execution, keeping its prompt focused on "which worker, with what input?" beats letting it also try to do the work. Per-worker success and supervisor routing accuracy become separate eval metrics; either failing tells you something different about what to fix.

Pick when

-   Task has clearly separable sub-tasks (research → draft → critique). Single mega-prompt is producing worse results than orchestrating focused agents. You can score each worker's output independently.

Skip when

-   Workflow is linear with no decision points (just chain LLM calls). Latency budget tight (each handoff adds 800-1500ms). Sub-tasks share too much context.

Stack

LangGraphCrewAIClaude 4GPT-5

03

### RAG-augmented agent

The agent treats retrieval as a tool it can call mid-loop, not as a fixed pre-step before generation. Vector store sits behind a retriever the agent invokes when grounding is needed. Right when context grounding matters more than autonomy depth, clinical Q&A, contract review, regulatory research. Eval anchors shift: retrieval recall (did we find the relevant chunks?) and answer faithfulness (did the agent stay grounded in what was retrieved?) matter more than tool-call accuracy. We hand-build the chunking + reranker per corpus, defaults are bad.

Pick when

-   Output must be grounded in your corpus (docs, tickets, contracts). Corpus too large or too fresh to fit in the prompt. Citation enforcement is a hard requirement.

Skip when

-   Workload is mostly generative (writing, image). Corpus fits in context window with room to spare. You don't have ground-truth answers for eval.

Stack

LangGraphPineconeBGE rerankClaude 4

A common scoping mistake we see in enterprise AI agent projects: clients ask for pattern 02 (multi-agent) when pattern 01 + a better prompt would have shipped in half the time. The supervisor/worker abstraction is seductive, it *sounds* rigorous, but every extra agent doubles the eval surface and adds 800-1500ms of latency per handoff. Default to pattern 01. Move up only when the eval set tells you to. Most enterprise AI agent deployments we audit land back on pattern 01 within 90 days.

007 / EVAL

## Four metrics on every agent we ship.

Most "agent" projects fail in production because nobody scoped what success looked like before writing code. We invert that. The eval set lands in week 2, before the first prompt is written.

94%

Task success rate

Did the agent complete the user's goal start-to-finish, scored against the eval set's expected outputs.

<2%

Hallucination rate

LLM-as-judge scoring with weekly human spot-check on disputed examples. Hard gate before production wire-up.

99.2%

Tool-call accuracy

Right tool, right args. Separately scored from end-task success because a wrong tool call can succeed at the wrong thing.

<2.4s

P95 latency

Measured across the full call chain including tool invocations. Voice agents target sub-400ms turn-taking. Budget reviewed weekly.

Numbers shown are illustrative target ranges for new engagements until eval data from production work is published.

EVAL GATES

The four gates aren't suggestions. All four must be green before we wire the agent into production traffic. Each has an explicit methodology, a target, and a fail-state, codified before the first prompt is written.

1.  01 Task success
    
    ≥94%
    
    Domain-expert graded eval set, 30–50 examples covering main flow plus edge cases. Re-graded weekly. Production traces sampled into the eval set monthly.
    
    If <90%, the agent doesn't ship. We revise the spec before retrying.
    
2.  02 Hallucination
    
    <2%
    
    LLM-as-judge with Claude Sonnet 4.6 scoring each output, then human spot-check on the 5% of outputs the judge marked disputed.
    
    If ≥3%, hard gate before production wire-up. No exceptions.
    
3.  03 Tool-call accuracy
    
    ≥99%
    
    Right tool + right args. Scored independently of end-task success because a wrong tool call can accidentally succeed.
    
    If <97%, the agent ships with a tool-confirmation step in front of write actions.
    
4.  04 P95 latency
    
    <2.4s
    
    Measured across the full call chain including tool invocations. Voice agents target <400ms turn-take. Budget reviewed weekly; regression alarm if breached for 24h.
    
    If breached for >72h, we revisit model routing or tool design.
    

Two methodology notes that matter. We use **LLM-as-judge with Claude Sonnet 4.6** as the default scorer because it produces the most consistent grades against human ground-truth on the eval sets we've shipped. Hallucination disputes (5-8% of outputs typically) get human spot-check by your domain expert, we never let LLM-as-judge stand alone for the hard cases. And the eval set **grows during production**: real traces sampled monthly into the eval set, with regression alarms when an upstream model change drops scores. The eval set we hand you on day 1 is not the eval set you have on day 365.

Eval and observability stack we deploy by default:

Langfuse Braintrust Promptfoo LangSmith Helicone Inspect AI

007b / SECURITY · COMPLIANCE · COST

## Security, compliance, and cost engineering.

Three concerns enterprise buyers always ask about before procurement. We address each one explicitly in the spec, not as a "we'll figure it out at the security review" promise.

### Security & guardrails

Defense in depth, not a single classifier. Every production agent ships with input filtering, output filtering, system-prompt isolation, and an adversarial eval set we re-run on every model swap.

-   **Input classifier**, Llama Guard 3 or a custom policy classifier blocks known prompt-injection patterns before they hit the planner.
-   **Structured output enforcement**, Pydantic / Zod schemas with retry on violation. Cuts most "agent decided to do something weird" failure modes.
-   **System-prompt isolation**, user content can never override system instructions. We test this with an adversarial eval on every deploy.
-   **Output filtering**, Llama Guard or Presidio on outbound responses for PII leakage, prohibited content, hallucinated tool calls.
-   **Tool confirmation**, write actions (send email, charge card, update CRM) gate behind a confirmation step unless tool-call accuracy is ≥99.5%.

### Compliance posture

Default posture covers most enterprise procurement bars. Regulated workloads (clinical, financial, EU) layer in additional controls, scoped into the SOW, not retrofitted at security review.

SOC 2-ready

Practices, not certified · default posture

HIPAA-aligned

PII-scrubbed prompts · BAAs · log redaction

GDPR / EU AI Act

EU residency · DPA · model-card disclosures

On-prem / VPC deployment available, Llama 4, Mistral, Qwen on your cloud via vLLM. Standard pattern for healthcare and defense-adjacent engagements.

### Cost engineering

Token cost is the second-highest line item on most production agents after engineering time. We model expected cost during discovery and cut it 40-70% on the average build through routing, caching, and batch APIs.

40–70%

Token-cost reduction

Via model routing on a typical mid-volume agent

92%

Cache hit rate

On stable system prompts using Anthropic / OpenAI prompt caching

5–10×

Batch API throughput

On overnight enrichment / classification workloads

-   **Model routing**, classifier routes by query complexity. Easy queries to GPT-5 mini or Claude Haiku at 1/20th the cost; hard ones to the frontier model. Quality holds via eval gate.
-   **Prompt caching**, Anthropic / OpenAI prompt caching on stable system prompts and tool definitions. 90%+ cache hit rate on most agents within two weeks of launch.
-   **Batch API for async**, overnight enrichment, classification, scoring. 50% cost cut vs sync API, 5-10× throughput.
-   **Token budget per request**, hard ceilings on context size and tool-call chain length. Outliers get circuit-broken, not silently bloated.

All three concerns share a pattern: the discipline is in the spec, not in the build. We name the threat model, the compliance posture, and the cost band during the discovery phase. The build executes against those targets, security and cost aren't add-on phases that happen after the agent works. They're how it gets to work.

008 / USE CASES

## Where teams have shipped agents.

Engagements anonymised. Industry and segment are real; metrics are real; brand names removed under standard NDA terms.

Use cases are organised by function, sales, support, ops, coding, research, voice, not by industry. The same plan/act/reflect loop ships to a B2B SaaS and a manufacturer; what changes is the integration surface and the eval anchor, not the agent shape. Below are six representative engagements: three flagship cases (full numbers), three additional function stubs (recent shipments where the metric narrative is short).

Sales

B2B SaaS · 11–50 emp

### Lead-qualification + outbound research agent

Pulls signals from LinkedIn, Crunchbase, the prospect's own website, and recent news. Scores fit against the ICP, drafts a personalised first-touch message citing the strongest 2 signals, and only hands off to an AE when the score crosses a tuned threshold. Replaced 2.5 SDR seats in the first six months. The AEs report higher-quality top-of-funnel and shorter first-call discovery.

0

SDR seats

Support

Health-tech · enterprise

### Tier-1 deflection agent

RAG over product docs and a redacted 18-month ticket archive. Resolves password resets, billing edits, and onboarding questions without any human touch. Clinical questions are escalated immediately to a human, with the agent's draft attached so the responder has full context. Cut p1 ticket volume by 38% over 90 days, with zero clinical false negatives in the eval set.

0 %

p1 ticket volume

Ops

Mfg · 200+ emp

### Invoice matching + AP routing agent

Reads PDF and scanned invoices, runs OCR + LLM extraction, matches against open POs in NetSuite, and routes to the correct approver via Slack with a structured summary. Exceptions go to the ops lead with an annotated diff explaining why the agent didn't match. ROI inside six months. The ops lead now handles the 8% of invoices that genuinely need judgment.

<6 months

in

Coding

Dev-tools SaaS · 50–200 emp

### Repo-aware code review agent as a PR gate

Wired into GitHub Actions on every PR. Pulls the repo's conventions, runs a critic loop on the diff, leaves inline review comments. Flagged a missed null check on 12% of merged PRs in the first month.

0 %

\+ issue catch rate

Research

Fin services · 1,000+ emp

### Regulatory research agent across 6 jurisdictions

Multi-step research over published regulations + the firm's internal interpretation memos. Cited outputs, refuses on out-of-corpus, escalates ambiguity to a compliance reviewer rather than guessing.

0

8 days → min per memo

Voice

Health-tech · enterprise

### Intake triage voice agent on LiveKit

Phone intake agent with sub-400ms turn-take. Asks the standard intake questions, escalates clinical-judgment cases to a nurse with full context. PII-scrubbed transcripts; HIPAA-aligned deployment.

p95 turn-take 320ms

Patterns across all six engagements: **the metric anchor was scoped in week 2**, before code; **the eval set grew during production** via sampled traces; **handoff included the runbook in the client's repo**, not in a doc somewhere. Outcome numbers are what your team measured at week 8 post-launch, not at deploy. The work that matters happens after the agent ships, picking an ai agents development company that stays for that work is the most underrated criterion in vendor selection. As an agentic AI company, we run post-launch eval reviews as part of the standard SOW, not as an add-on.

009 / ENGAGE

## Three ways to start.

Every AI agent development engagement is fixed-scope and fixed-duration. The first phase is small enough that stopping is a real option, about a third of our pilots end at the pilot for legitimate scoping reasons. That's a feature, not a bug. Cheap to discover the workload doesn't fit; expensive to discover it 12 weeks in.

Pilot · 2–4 weeks

Pilot · 4 weeks 4 phases

WEEK 1 Discovery

Workload map + eval surface scoped

Workload boundary signed off

WEEK 2 Spec

Stack picks + 30–50 graded eval examples

Eval examples agreed by your domain expert

WEEK 3 Prototype

First runnable agent + baseline scores

Baseline accuracy hit

WEEK 4 Demo

Demo + scoping memo for next phase

Build · 8–16 weeks

Build · 16 weeks 6 phases

WEEK 1–2 Discovery + Spec

Workload map, stack lock, eval scope

WEEK 3–6 Prototype

Runnable agent against eval set

Baseline accuracy hit

WEEK 6–10 Eval gates

Four metrics green vs target

All four green

WEEK 10 Deploy

Auth, observability, rollback drilled

WEEK 11–14 Iteration

Weekly eval review + prompt iteration

WEEK 15–16 Handoff

Runbook in your repo, ownership transferred

Multi-Agent + Voice · 10–20 weeks

Multi-Agent + Voice · 20 weeks 5 phases

WEEK 1–3 Discovery + Spec

Workload graph + per-agent eval surfaces

WEEK 4–8 Prototype

Supervisor + 2 workers + critic running

WEEK 9–14 Eval + Voice

Per-agent eval gates green; voice latency tuned

Per-agent success + routing accuracy

WEEK 15–18 Production

Telephony / SDK integration + observability

WEEK 19–20 Handoff

Multi-agent runbook + on-call rotation

01 Agent Pilot Fixed scope

2–4 weeks

### Pilot one agent, intake to live.

In scope

-   One scoped use case and workflow map
-   Eval framework with 30–50 graded examples
-   Working prototype against your real data
-   Demo, scoring report, and a recommendation memo for the next phase

Out of scope

-   Production deploy and integrations
-   Multi-agent orchestration
-   Voice / sub-400ms latency work

02 Custom Agent Build Fixed scope

8–16 weeks

### Production build with eval gates.

In scope

-   Everything in the Pilot
-   Production integrations, auth, observability, rate limits, fallback policies
-   Eval gates baked into the deploy pipeline (regression alarms enabled)
-   Four weeks of post-launch iteration with weekly eval runs
-   On-call runbook and ownership transfer

Out of scope

-   Open-ended Run engagements after week 16 (separate SOW)

03 Multi-Agent + Voice Fixed scope

10–20 weeks

### Multi-agent or voice systems.

In scope

-   Supervisor / worker / critic orchestration on LangGraph or CrewAI
-   Or voice agents with sub-400ms turn-taking on LiveKit / Pipecat / Vapi
-   Eval focus on per-agent success and inter-agent routing accuracy
-   Production wire-up including telephony or in-app SDK integration

04 Agent Rescue Fixed scope

4–6 weeks

### Diagnose and fix a struggling agent.

In scope

-   Trace + eval audit of the existing agent (tool-call accuracy, loop rate, p95 latency, cost per task)
-   Root-cause memo: prompt, planner, tool surface, retrieval, or evals, where it actually fails
-   Targeted rebuild of the failing layer with regression tests before swap-in
-   Handover with a sustainable eval gate so the next regression is caught in CI, not by users

Out of scope

-   Rewrite from scratch (becomes Custom Agent Build)
-   Migrations to a different orchestrator unless root-cause requires it

Want ongoing iteration after week 16? A **Run engagement** is a separate monthly SOW, typically one AI engineer half-time, weekly eval review, and a fixed iteration budget. We move you to Run only if the workload genuinely benefits from continued investment, which is roughly half of completed builds. As an agentic AI company we're built for this: custom AI agent development doesn't end at deploy.

010 / FAQ

## Common buyer questions about AI agent development.

If the answer you need isn't here, the contact form is faster than email, we triage same-day from an engineer.

How is AI agent development different from chatbot development?

Chatbots are **single-turn or short-turn conversational systems** with minimal autonomy. The user asks, the chatbot answers. State is small or none, tool use is minimal (usually a single RAG retrieval), and the eval anchor is "intent accuracy + answer relevance."

AI agents are **autonomous, goal-driven systems** that take multiple steps to complete a task. They plan, call tools, reflect on intermediate results, and decide their next move. State is rich and stateful loops survive crashes via checkpointing. The eval anchor is "task success + tool-call accuracy + latency budget."

The decision is rarely binary. Most failed projects we audit picked the wrong shape: a chatbot when the work needed autonomy, or an agent when a chatbot would have shipped in half the time. [Our flagship piece breaks down the seven dimensions](/blog/ai-agents-vs-chatbots/) that decide between them.

Do you work with our existing AI stack (Claude / GPT / Gemini / Llama)?

Yes. We're model-agnostic by default, we benchmark **at least two models against your eval set** before locking the choice. The leader on cost-adjusted quality wins, regardless of which vendor we'd default to.

-   **Hosted**, Claude 4/Sonnet, GPT-5, Gemini 3.0/Flash, Mistral hosted
-   **Self-hosted**, Llama 4, Mistral, Qwen on vLLM/TGI on your cloud
-   **Routing**, Production agents often route across 2–3 models by query complexity to cut cost 40–70% without quality loss

If you have an existing contract with a specific provider, we work within it. If you don't, we'll recommend the routing pattern that fits the workload.

Who owns the code, prompts, and eval sets at the end?

You do. Everything we ship transfers into your repository under the SOW:

-   All agent code (framework wrappers, tool definitions, integration glue)
-   All prompts in portable YAML, re-hostable on a different framework if needed
-   The eval set (30–50+ graded examples with criteria)
-   Infrastructure-as-code (Terraform / Pulumi) for the deployment
-   Runbook and on-call procedures

Paiteq retains zero rights to your prompts, eval data, fine-tuned weights, or domain examples. We keep the **engineering learnings**, patterns and methodologies, for our internal playbook. That's it.

How long does it take to ship a production AI agent?

An Agent Pilot ships in 2–4 weeks. A Custom Agent Build with eval gates, integrations, and observability runs 8–16 weeks. Voice agents and multi-agent systems are longer because of latency tuning and orchestration complexity. We always scope a fixed-duration first phase so you can stop or scale up after seeing the prototype.

What frameworks do you build on, and how do you choose?

We default to LangGraph for stateful agents that need explicit graph control, CrewAI for multi-agent supervisor / worker patterns, Vercel AI SDK or the OpenAI Agents SDK for simpler tool-calling, and Composio when the tool surface is large and pre-built integrations matter. Framework choice follows the workload, not the other way around. We do not have a house framework we push regardless of fit.

How is an AI agent different from a chatbot?

Chatbots are single-turn, stateless, scripted by intent maps, and measured on intent classification accuracy. Agents are multi-turn, stateful, goal-driven, use tools autonomously, and measured on task success rate. Picking the wrong one is the most common scoping mistake, we cover this in detail in our piece on AI agents vs chatbots.

How do you measure agent quality and prevent hallucinations?

Every agent ships with an eval set scoped during discovery, usually 30 to 50 graded examples covering the agent's main cases and the edge cases that worry the business. We track task success rate, tool-call accuracy, hallucination rate (via LLM-as-judge plus human spot-check), and p95 latency. Eval runs weekly post-deploy. If any metric drops more than 5%, a regression alarm fires and the build is paused.

Who owns the IP, code, and prompts?

You do. All code, prompts, eval sets, and architecture diagrams are delivered into your repository under a transfer-of-ownership clause in the SOW. We retain no rights to your prompts or data. Paiteq keeps non-identifying engineering learnings, frameworks, patterns, eval methodologies, for our internal playbook.

How do you handle security, PII, and compliance?

Default posture is SOC-2-ready practices — audit logs, least-privilege IAM, key rotation, encryption at rest and in transit. We can deploy fully on your cloud (AWS, GCP, Azure) with no data leaving your perimeter, run prompt-level PII scrubbing via Presidio or your existing DLP, and use private-link endpoints to model providers where required. HIPAA and GDPR evidence work is included in regulated engagements.

Can we start with a pilot and scale to production?

Yes. The Pilot is designed to graduate into a Custom Build, eval framework, prompts, and architecture carry forward. About 70% of pilots we run convert to a production engagement. The 30% that don't either pivoted scope based on what the pilot revealed, or decided the workflow wasn't yet ready for AI. Both are valid outcomes.

What's the typical team shape on an engagement?

One AI engineering lead, one senior AI engineer, and a fractional product manager for scope and stakeholder management. Multi-agent and voice projects add a second AI engineer. We run two-week iteration cycles with a weekly demo. You always have a direct Slack channel with the build team, no account-management buffer.

011 / FURTHER READING

## Where this practice connects.

The interesting question isn't whether you can wire three agents into a CrewAI graph; it's which shape to pick. Our [multi-agent system orchestration patterns in production](/blog/multi-agent-orchestration-patterns/) covers supervisor, swarm, hierarchical, and the failure modes that only surface after the demo goes live.

When the workload needs deterministic rules on clean paths and LLM judgment on exceptions, the build crosses into our [AI automation agency](/services/ai-workflow-automation/) practice; n8n, Make, Temporal, and custom orchestrators, every workflow eval-graded before it ships. The retrieval substrate, when grounded answers matter inside the agent loop, comes from our [RAG development services](/services/rag-development/) bench. And the upstream strategic frame — should this even be an agent — usually starts in [AI consulting services](/services/ai-consulting/).

Industry routes: [AI for SaaS companies](/ai-for-saas/) is the most common entry shape — sales agents, RAG copilots, embedded AI search. [AI for fintech](/ai-for-fintech/) agents (KYC workflows, transaction-triage co-pilots) typically need the stricter eval-gate posture; that's the engagement shape we lead with for regulated buyers. [AI for ecommerce](/ai-for-ecommerce/) agents (browse-recovery, merchandising) and [AI healthcare software development](/ai-for-healthcare/) agents (utilization review, prior-auth drafting) round out the most-shipped industry mix. When the upstream workflow is rules-heavy and migrating from a legacy automation estate, route through [RPA development services](/services/rpa-development/) for the bridge. The wider engineering context lives on [the Paiteq practice page](/about/); the full [AI development services](/services/) menu shows adjacent practices, with the broader [AI development company](/) story on the homepage and the founder profile of [Navin Sharma](/team/navin-sharma/).

012 / Related practices

## Adjacent services.

[

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

](/services/rag-development/)[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)[

AI WORKFLOW AUTOMATION

AI Workflow Automation

Intelligent workflows on n8n, Make, and custom agent orchestration.

](/services/ai-workflow-automation/)

013 / Start a project

## Let's *build* something that ships.

Pilot in 2–4 weeks. Custom build in 8–16. Same-day response on every inbound.

[Talk to engineering](/contact/) [Architecture review](/contact/?topic=arch-review)