# Chatbot Development Services — Paiteq

> Chatbot development services from an engineering-led AI chatbot development services agency. Custom chatbot development, enterprise chatbot development, LLM chatbot development, RAG chatbot, and voice chatbot development. Fixed scope, eval-instrumented, multi-channel.

**HTML version:** https://www.paiteq.com/services/chatbot-development/

## Key facts

- Types: custom, enterprise, LLM, RAG, voice.
- Channels: web, WhatsApp, Slack, Teams, voice.
- Eval-instrumented; fixed scope.

## Related pages

- [RAG Development](https://www.paiteq.com/services/rag-development/)
- [AI Agent Development](https://www.paiteq.com/services/ai-agent-development/)
- [Services hub](https://www.paiteq.com/services/)

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements. Same-day reply from engineering. NDA counter-signed before discovery. Walk-away clause on every engagement.

**Site index for agents:** https://www.paiteq.com/llms.txt
**Full content for agents:** https://www.paiteq.com/llms-full.txt
**Book a call:** https://www.paiteq.com/contact/

---

## Full content

P8 · Services

# *Chatbot development services*, grounded chat, tool-using chat, voice chat, hybrid AI + human handoff

Chatbot development services from an engineering-led custom chatbot development team. Conversational AI development for customer service, sales, internal Q&A, voice deflection, and regulated-grounded chat, eval-instrumented, multi-channel, with a warm-handoff layer to the human agent queue. We default to Claude Sonnet 4.6 with a multi-vendor routing layer above; we don't ship single-vendor chat stacks.

[Talk to a partner](/contact/) [See engagement shapes](#engage)

Practice Conversational AI development

Shapes Grounded · Tool-using · Voice · Hybrid

Default Claude Sonnet 4.6 + GPT-5 mini fallback

Engagements 3–12 weeks · fixed scope

001 / FRAME

## Chatbot or agent? The most expensive misframing in this category.

Most buyers arrive at chatbot development services with an agent-shaped problem, or vice versa. Pricing the wrong shape burns one to two quarters of engineering before anyone notices. The grid below is the first frame we run, single-turn-ish conversational chat on the left, multi-step autonomous agent on the right, scored on the dimensions that actually drive the architectural call. Read it once; the rest of this page assumes you've landed on the chat side. If you're on the agent side, route to our [AI agent development](/services/ai-agent-development/) practice.

Chatbot (this pillar)

Agent (sibling pillar)

Decision shape

Single-turn-ish, answer or escalate within one conversation

Multi-step autonomous, plan, act, observe, iterate over N turns

Default response budget

Sub-2-second perceived latency; voice ≈ sub-400ms turn-take

5–60 seconds is normal; long-running tasks measured in minutes

Speed is a feature in customer-facing chat, users abandon threads after 4–5 seconds of silence. The sub-2-second budget isn't arbitrary; it's the threshold where **chat still feels like chat** rather than a loading screen. Agents trade latency for capability by design, which is the right call for back-office work but the wrong call on a live support surface.

Tool calls per session

Zero to three, retrieve, lookup-by-id, maybe a write

Often ten-plus; explicit state machine over the tool surface

Tool depth is the clearest capability signal. Three tool calls is the practical ceiling for a chatbot before the state tracking collapses and latency blows the budget, if your build needs more than that, you're not designing a chatbot, you're designing an agent with a chat surface. The explicit state machine that agents require isn't overhead; it's the mechanism that keeps tool-call ten coherent with tool-call one.

Grounding source

RAG over enterprise corpus or product KB; tight context window

Open-ended retrieval, web search, multi-corpus, internal tools

Tight scope is a chatbot advantage, not a limitation. Grounding against a controlled corpus means every retrieval chunk can be audited, every answer traced back to a source document. Agents need open-ended retrieval because their tasks demand it, but that breadth trades away the faithfulness guarantees that [RAG-grounded chatbots](/services/rag-development/) can provide and regulated industries require.

Failure mode

Hallucinated answer, missed escalation, wrong tone

Tool-call loops, plan drift, partial state, idempotency bugs

Right model class

Claude Sonnet 4.6 or GPT-5 mini, fast, cheap, well-grounded

Claude Opus 4.7 / GPT-5 reasoning, for the plan layer

Right framework

Direct SDK + a thin orchestrator (or LangGraph for state)

LangGraph with proper state-graph control; never CrewAI for serious builds

Eval surface

Faithfulness, escalation precision, tone, latency p95

Trajectory eval, tool-call success, plan coherence, end-state correctness

Eval surface is where chatbot builds most often cut corners. Faithfulness and escalation precision sound simple but measuring them at scale requires a dedicated harness, RAGAS for retrieval quality, Langfuse traces for latency p95, and a held-out escalation set that fires on every deploy. Agents have it harder (trajectory evals are genuinely more complex), but chatbot teams who skip the eval build are the ones regretting it in month three.

Where we recommend it

Customer service, sales-qualifying, internal Q&A, voice deflection

Multi-system workflows, research agents, ops automation, see siblings

Most production builds aren't pure one-or-the-other, a chatbot can call a small agent loop for a sub-task; an agent can expose a chat surface to its operator. The point of the frame is to lock the primary shape so the architecture, the model class, the framework, and the eval surface all line up.

If your buyer is asking for "an AI agent" but the actual user-facing surface is a single-turn customer support widget, you're in the chat pillar, the agent vocabulary is procurement-side language, not architecture. We'll make that call on the framing call and route accordingly.

002 / SHAPES

## Five chatbot shapes. Every chatbot development services engagement maps to one.

Conversational AI development isn't a single shape, it's five distinct engineering envelopes with different default models, different frameworks, different eval surfaces, and different cost curves. The five shapes below cover roughly 100% of inbound. We won't sell you a voice chatbot development engagement when the deflection use case lives on web; we won't sell you a tool-using build when the actual ask is RAG-grounded.

-   01 · GROUNDED
    
    ### RAG chatbot over a curated corpus
    
    Customer service deflection, internal Q&A, regulated policy chat, product knowledge base. The corpus is curated, the answer is grounded, the eval surface is faithfulness plus retrieval precision. Roughly 60% of engagements. Default stack: Claude Sonnet 4.6, pgvector + BM25 hybrid retrieval, RAGAS eval, Langfuse traces.
    
-   02 · TOOL-USING
    
    ### Chat that calls structured tools
    
    Order-status, return-initiation, appointment-booking, CRM-aware chat. Three to eight tools is the typical envelope; tool-call success rate matters more than reasoning depth. Different from agents, the conversation stays single-thread. Default stack: GPT-5 mini or Sonnet 4.6, Zod / Pydantic schemas, function-calling SDK, eval on tool arg correctness.
    
-   03 · VOICE
    
    ### Sub-400ms voice chatbot development
    
    Phone-channel deflection, clinical intake, sales prospecting outbound, IVR replacement. Latency budget is the headline constraint. LiveKit Agents or Pipecat as the runtime, Deepgram for streaming STT, ElevenLabs for TTS, Sonnet 4.6 as the chat brain. Eval includes turn-take p50/p95 and barge-in recovery, neither shows up in a text eval suite.
    
-   04 · HYBRID + HANDOFF
    
    ### AI tier-one with warm transfer to humans
    
    Enterprise support where AI handles tier-one and escalates to a human agent with full conversation context. Roughly 40% of enterprise chatbot development engagements. The handoff layer is the highest-value engineering decision, warm transfer with context wins, cold handoff that re-asks the customer's name loses. Default: LangGraph state-graph, Zendesk Sunshine handoff, co-pilot suggestions in the agent's queue.
    
-   05 · MODERNISE
    
    ### Legacy chatbot replacement
    
    Drift, Ada, Intercom Fin, first-gen LLM chatbot, or rule-based legacy. Migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline. Target stack usually keeps the channel surface (Zendesk, Intercom, Twilio) and replaces the brain. Ships behind progressive rollout, cutover is reversible until the eval baseline is matched-or-exceeded.
    

A chatbot can call a small agent loop for one sub-task and still be a chatbot. The primary shape is the one that owns the user-facing surface, if the user sees a single chat thread that resolves or escalates, you're in chat; if the user kicks off a job and comes back later for the result, you're in [agents](/services/ai-agent-development/). Most inbound for a support chatbot or customer service chatbot lands in shape 01 (grounded) or shape 04 (hybrid handoff); a support chatbot with a real human queue almost always wants 04.

003 / NUMBERS

## What production chatbot development services look like at the latency level.

Performance targets we hold ourselves to before any chatbot development services engagement ships. The numbers below are typical-workload defaults, actual targets land in writing during the discovery phase and become the ship gate. We don't ship below the latency or faithfulness threshold; we'd rather miss the launch date than ship a chat that hallucinates under production traffic.

p95 < 1.8s

Grounded chat response

RAG + Claude Sonnet 4.6, typical workload

p50 ≈ 320ms

Voice turn-take

LiveKit + Pipecat, deepgram streaming

$0.004

Average grounded reply

Sonnet 4.6 + pgvector retrieval, mid-volume

0 –94%

Tier-1 deflection target

Per use case; eval-validated, not vendor-claimed

004 / STACK

## The six families of the modern chatbot stack.

A modern enterprise chatbot development build touches six families, chat model, conversation framework, retrieval, voice (if voice is in scope), observability + eval, and channel + handoff. The categories below name the default pick, the cost-floor alternative, and the conditions under which we'd revisit. Per-family recommendations land in writing during discovery; the chat-stack inventory is the artefact that survives the engagement.

Chat model layer (Claude Sonnet 4.6 · GPT-5 mini · Gemini 3 Flash)

Strengths

The mid-tier hosted models are where production chat lives in 2026. Claude Sonnet 4.6 leads on grounded answers and tone control. GPT-5 mini wins on raw throughput at scale. Gemini 3 Flash is competitive at the cheap end. Pricing is in the $0.3–3 input / $1.5–15 output per million tokens band, about an order of magnitude cheaper than the frontier tier and indistinguishable from it on most chat workloads we audit.

When We Pick

Default pick for almost every chatbot development services engagement. Customer service deflection, internal Q&A, sales qualifying, support knowledge bases. Below ~500M monthly tokens the hosted-mid economics beat self-hosted by 3–8×. C-suite buyers who want a named vendor on the contract.

When We Don't

Strict data residency where the provider's region map can't close, route to [self-hosted Llama 4](/services/llm-development/) on your infra. High-frequency low-value chat (FAQ-style triage) where a smaller distilled model wins on cost. Reasoning-heavy multi-step shapes, that's not chatbot, that's an [agent](/services/ai-agent-development/).

Paiteq Pattern

Default architecture: Sonnet 4.6 for grounded chat, GPT-5 mini as the multi-vendor backstop with a thin routing layer above. Costs roughly a week of engineering and saves the renegotiation when one provider hikes prices in month nine. We've never recommended a single-vendor chat stack for any production engagement.

Mid-hostedMulti-vendorGrounded

Conversation framework (LangGraph · OpenAI Agents SDK · raw SDK)

Strengths

Chat doesn't always need a framework. A thin orchestrator over the SDK is the cleanest pattern for single-turn-ish chat, a system prompt, a retrieval call, a response, a handoff branch. LangGraph earns its keep when conversation state matters across turns: long context, multiple tool calls, structured handoff to a human agent. The OpenAI Agents SDK is fine for narrow OpenAI-only builds; we don't recommend it for multi-vendor chat.

When We Pick

LangGraph when state across turns matters, escalation flows, multi-step diagnostic chat, conversational forms with branching logic. Raw SDK + thin orchestrator when chat is genuinely single-turn-ish. Always wire eval and observability before the second sprint, regardless of framework.

When We Don't

CrewAI for serious chat, the role abstraction starts to fight you once you need state-graph control, and we've yet to ship a CrewAI-based chat into production without a re-platforming sprint. AutoGen, stalled relative to LangGraph; not recommended for new builds.

Paiteq Pattern

Roughly two-thirds of our chatbot development services engagements ship on a thin SDK orchestrator. The other third use LangGraph, usually because the chat has a handoff layer (warm transfer to human, escalation router, multi-channel context sync) where the state graph carries real weight.

State-machineThin-SDKLangGraph

Retrieval layer (pgvector · Pinecone · Qdrant · BM25 hybrid)

Strengths

RAG chatbot work lives or dies on the retrieval layer. pgvector is the cheapest, lowest-friction pick when Postgres is already in the stack, and Postgres is already in roughly nine out of ten enterprise stacks. Pinecone Serverless cuts ops bandwidth to near zero at a premium tier. Qdrant self-hosts cleanly when data residency is non-negotiable. BM25 hybrid (sparse + dense) wins on technical-product corpora where the exact-keyword match still beats embeddings half the time.

When We Pick

Any grounded chatbot, clinical, legal, regulated, internal knowledge base, product Q&A. Almost every chatbot engagement we ship recommends retrieval-augmented chat as the spine, not a generic conversational LLM blowing answers from training data. We route the deeper retrieval-pipeline work to our [RAG development services](/services/rag-development/) practice when the scope earns it.

When We Don't

Pure tone-and-brand chat with no enterprise knowledge to ground against (rare, usually a sign the product team hasn't found the use case yet). Tiny corpora under 5k chunks where in-context retrieval beats a vector store and the index is overkill.

Paiteq Pattern

Default recommendation: pgvector when Postgres is already there; Pinecone Serverless when ops capacity is the constraint; Qdrant self-hosted when data residency requires it. We don't recommend Weaviate for a chat-only build unless multi-modal retrieval is the headline requirement.

RAGHybridpgvector-first

Voice stack (LiveKit Agents · Pipecat · Deepgram · ElevenLabs)

Strengths

Voice chatbot development is its own engineering discipline. LiveKit Agents and Pipecat both land sub-400ms voice turn-take in production, and the latency budget is the whole game. Deepgram leads on streaming STT accuracy at scale. ElevenLabs leads on voice quality; the open-source side (Whisper Large v3, F5-TTS) is closing fast but still trails on edge cases. Sub-second perceived turn-take is the difference between a voice agent that feels conversational and one that feels like an IVR with extra steps.

When We Pick

Support deflection at scale where call-centre cost is the headline number. Clinical intake, sales prospecting, after-hours triage. Anywhere the buyer journey involves a human picking up a phone. Roughly 25% of our voice chatbot development engagements in 2026 have replaced an existing IVR rather than building net-new.

When We Don't

Voice as a CEO whim with no buyer-journey evidence, the build is hard, the deflection economics are real but specific, and a voice agent without a deflection use case is theatre. Multi-step task automation, that's an [agent](/services/ai-agent-development/) with a voice front, not a voice chatbot.

Paiteq Pattern

Default voice stack: LiveKit Agents + Claude Sonnet 4.6 + Deepgram + ElevenLabs. Open-source substitutes priced as a phase-two option when the volume crossover lands inside 12 months. We've shipped this exact stack three times in 2026, never had to re-platform the voice layer.

VoiceSub-400msDeflection

Observability + eval (Langfuse · Braintrust · RAGAS · Inspect)

Strengths

Every chatbot we ship costs more in instrumentation than the buyer expected and earns it back inside the first month. Langfuse leads OSS observability with traces, prompt versioning, and a usable eval surface. Braintrust dominates closed-source eval workflows with a clean diffing UX. RAGAS is the default retrieval-and-faithfulness harness for grounded chat. Inspect AI (UK AISI-backed) is the rigour pick for safety-critical chat in regulated industries.

When We Pick

Every chat engagement. Day-one cost line, not a phase-three nice-to-have. Most stalled chat pilots we audit failed because nobody knew which prompts were drifting, which retrievals were silently returning irrelevant chunks, or which escalations were silently being miss-routed. Instrumentation is the cheapest insurance in the stack.

When We Don't

Never. We've never shipped a production chatbot without observability wired before the second sprint. Toy demos and POCs are the only exception, and the moment the demo earns a budget, observability lands in the next ticket.

Paiteq Pattern

Default: Langfuse self-hosted for teams with data-control concerns; Braintrust for teams with budget and no ops capacity. RAGAS as the retrieval eval harness regardless of trace backend. We don't recommend bare logging-without-traces for anything past prototype, it's the false-economy that creates the month-six stall.

Day-oneTracesEval-first

Channel + handoff (Twilio · Intercom · Zendesk · Slack · web)

Strengths

The channel layer is where most chatbot pilots quietly fail. Twilio carries SMS, WhatsApp, and voice transport. Intercom and Zendesk are the dominant support-channel anchors, Zendesk Sunshine is the canonical handoff API for enterprise support chat. Slack is the default for internal-Q&A chat. The web widget is the simplest channel and the easiest to instrument; mobile native chat carries half the engineering cost and twice the polish.

When We Pick

Multi-channel chat is the default ask in 2026, buyer expects the same chatbot across web, Intercom, Slack, and a voice line. The handoff layer to a human agent is the single highest-value engineering decision: warm transfer with conversation context wins; cold handoff with a ticket creation loses every time.

When We Don't

Single-channel chat where the channel is fixed and there's no plausible expansion path, usually the buyer hasn't priced the multi-channel ask yet. We'd still build the abstraction; cost is roughly a week and the option value is enormous.

Paiteq Pattern

Default: channel abstraction layer above whichever vendor (Intercom / Zendesk / Twilio) carries the surface. Warm-transfer-with-context as the handoff default. We've migrated chat across three channel vendors for two clients this year, the abstraction earns its keep on the migration.

Multi-channelWarm-handoffTwilio

005 / ARCHITECTURES

## Four conversation architectures. Pick the one that fits the buyer journey.

RAG-grounded, tool-using, voice, and hybrid-with-handoff are the four architectures we ship across roughly 100% of chatbot development services engagements. The architecture determines the model class, the framework, the eval surface, and the team shape. We won't sell you a tool-using build when the buyer journey is information retrieval; we won't sell you grounded chat when the buyer journey is task completion. The framing call is free.

   

01

### RAG-GROUNDED

The most common chatbot shape we ship. A retrieval pipeline (pgvector + BM25 hybrid is the typical default) feeds the chat model context from a curated corpus, product docs, knowledge base, ticket history, regulated policy. Roughly 60% of chatbot development services engagements land here. The headline failure mode is silent retrieval drift: chunks rank well on cosine similarity but answer the wrong question. Eval is faithfulness + retrieval precision tracked over time, not a quarterly cherry-pick.

Pick when

-   Customer service deflection on a known product surface
-   internal Q&A over a curated KB
-   regulated chat where every answer must be grounded in policy
-   sales-assist over a product catalogue
-   B2B onboarding where the corpus is stable

Skip when

-   Open-ended chat with no curated corpus, that's a generic LLM, not a chatbot we'd build
-   Multi-step workflow execution, that's an agent
-   Pure entertainment chat, different design vocabulary entirely

Stack

Claude Sonnet 4.6pgvector + BM25RAGAS evalLangfuse traces

02

### TOOL-USING

Chat that needs to look up an order status, file a return, schedule a callback, or update a CRM record. Different shape from agents, the tool surface is narrow (typically three to eight tools), the conversation stays single-thread, and the tool-call success rate matters more than reasoning depth. Roughly 20% of engagements. Eval is tool-call success rate, argument correctness, and refusal precision when the user asks for something the tool surface doesn't cover.

Pick when

-   Order-status + return chat for ecommerce
-   appointment-booking chat with calendar write
-   CRM-aware sales chat with lead-write
-   support chat with ticket-creation
-   banking-info chat with read-only account lookups

Skip when

-   Multi-step task automation, that's an agent loop, not a tool-using chatbot
-   Pure information retrieval, RAG-grounded is cheaper and faster
-   Workflows that need branching execution state, escalate to a state-graph architecture

Stack

GPT-5 mini · Sonnet 4.6Function-calling SDKZod / Pydantic schemasEval on tool args

03

### VOICE

Voice chatbot development is its own engineering discipline. The latency budget is the headline constraint, anything past 500ms perceived turn-take feels like an IVR. LiveKit Agents and Pipecat dominate the production stack; streaming STT (Deepgram) plus parallel TTS (ElevenLabs) plus a small-context chat model is the canonical shape. Roughly 15% of engagements but the highest-value tier per minute of build. Eval is turn-take p50/p95, interruption handling, and barge-in recovery, none of which are visible in a text-chat eval suite.

Pick when

-   Support deflection where the existing channel is voice
-   clinical intake at scale
-   sales prospecting outbound
-   after-hours triage routing
-   IVR replacement where the legacy DTMF tree has lost the war on customer patience

Skip when

-   Voice as a CEO whim with no deflection use case
-   Visual-context-required interactions (returns with photo evidence, document review)
-   Anything where sub-second turn-take isn't actually the constraint, text chat is cheaper

Stack

LiveKit AgentsPipecatDeepgram streaming STTElevenLabs TTS

04

### HYBRID + HANDOFF

Production support chat is rarely AI-only. The hybrid shape, AI handles tier-one, escalates to a human agent with full conversation context, often with AI co-pilot suggestions in the agent's queue, is roughly 40% of enterprise chatbot development engagements. The headline engineering choice is the handoff layer: warm transfer with conversation context wins; cold handoff that re-asks the customer's name loses every customer who's been re-asked their name. We've migrated five clients off cold-handoff vendors in 2026, every one of them lifted CSAT inside the first month.

Pick when

-   Enterprise support with a real human agent team
-   regulated chat where AI deflects but a licensed human signs the resolution
-   complex products with long-tail questions AI can't reliably answer
-   brands where escalation latency is itself a customer-experience metric

Skip when

-   Pure self-service products with no human support team
-   Volume-only deflection where the human agent doesn't exist
-   Workflows where AI cannot escalate quickly enough, that's an architecture flag, not an engagement

Stack

LangGraph state-graphZendesk Sunshine handoffCo-pilot suggest layerEval on handoff precision

006 / PHASES

## What a six-week custom chatbot development engagement actually ships.

A grounded chatbot pilot is roughly five distinct engineering phases against a fixed timeline. The phases below are the standard shape for a single-channel grounded chat; multi-channel adds a parallel integration sprint, voice chatbot development adds a barge-in and turn-take tuning phase, legacy modernisation prepends a contract-and-baseline read. The phases ship in series; eval lands in code before the chat model touches the corpus.

1.  01
    
    ### Discovery + use-case lock
    
    Sixty-minute exec session locks the use case, the channel, the deflection target, and the handoff posture. Existing channel data, call recordings, ticket archive, chat transcripts, read for tone, escalation patterns, and the long-tail of questions the human agent team currently absorbs. Output is a one-page chatbot spec everyone signs off in writing before any code lands. Some engagements end at this phase because the right answer is "do a discovery sprint first, build later", we still ship the spec and bill the phase.
    
2.  02
    
    ### Corpus + retrieval scaffold + gold-set eval
    
    Ingest the corpus, chunk, embed, build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns the engineering time. First eval run on a 50-question gold set built with the buyer team, questions the human agent team actually fields, with the correct answer in the corpus and the wrong answer adjacent. Faithfulness and retrieval precision land in writing before the chat model touches the corpus. If the gold set fails at this stage, the corpus needs work before the chat does.
    
3.  03
    
    ### Chat surface + tool wiring
    
    System prompt iteration, tool-call wiring, response shaping. Chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface stays small and explicit, three to eight tools is the envelope; ten-plus tools is an agent in disguise. For grounded chat, the system prompt names the corpus, the failure modes, and the escalation triggers in writing.
    
4.  04
    
    ### Eval harness + handoff layer
    
    RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, a custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context, Zendesk Sunshine or the channel-native equivalent, never cold handoff. Eval thresholds locked in writing as the ship gate. The harness is the artefact that survives the engagement; the buyer's team owns it on day one of phase five.
    
5.  05
    
    ### Channel integration + UAT + ship
    
    Web widget, Intercom, Zendesk, Slack, or WhatsApp, whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass, voice can't reuse text eval, ever. Progressive rollout behind a feature flag; first-month tuning loop wired with the on-call team. Engagement closes with a written handover memo plus a thirty-day check-in.
    

Clean handoff is the default. The chatbot, the eval harness, the observability dashboards, the channel integration code, all owned by the buyer's engineering team on day one of phase five. About a third of pilot engagements convert to a build engagement under shapes 02–04; we don't push that conversion, the memo names the call either way.

007 / CHANNEL

## Channel picker. Where the chatbot lives, by use case.

The channel decision drives the engagement scope more than buyers realise. A single-channel pilot is roughly three to four weeks; a multi-channel build is six to eight; a voice build is eight to twelve. The grid below is the call we'd make on the rubric, channel across the top, use case down the side. Yes-cells are the default; maybe-cells need a follow-up call; no-cells are usually a misframing we'd route differently.

Use case

Web widget

Mobile native

Voice (phone)

Messaging (Slack / WhatsApp)

Customer service deflection

Default

Strong

High-value

Common

Sales qualifying / lead-capture

Default

Useful

Outbound only

WhatsApp wins

Internal employee Q&A

Possible

Rare

Skip

Slack default

Clinical / regulated intake

Yes, gated

Yes, app-context

High-value

Avoid SMS

Onboarding / activation

Default

In-product

Skip

Useful nudge

Booking / scheduling

Default

In-app

Strong fit

WhatsApp common

Knowledge-base / search-replace

Default

Useful

Voice search rare

Slack for internal

Yes = default channel for this use case. Maybe = depends on buyer-journey specifics. No = usually a misframing, we'd route to a different shape.

Multi-channel chat is the default ask in 2026. The channel abstraction layer is roughly a week of engineering and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year, every one validated the abstraction call we made at week two.

008 / GATES

## Six eval gates an honest chatbot ship clears.

A chatbot eval memo is only as honest as the gates the team runs before ship. Below is the screen we apply to every chatbot development services engagement, and the same screen we use when we're hired to second-opinion a chatbot a different vendor already shipped. Second-opinion work routinely flags at least one gate the original build silently skipped, usually faithfulness or handoff precision.

-   01
    
    ### Faithfulness against retrieval
    
    Does the answer match what retrieval returned, or is the model blowing answers from its training corpus? RAGAS plus a custom harness against a clinician- or domain-expert-built gold set. Tracked over time, regression flagged. Faithfulness is the headline failure mode for grounded chatbot work, and the easiest one for a vendor to hide behind a quarterly cherry-pick.
    
-   02
    
    ### Retrieval precision per query
    
    Did the right chunks rank above the wrong ones? Per-query precision tracked over time, with the long-tail of low-precision queries flagged for corpus work. Retrieval drift is the silent failure that turns a six-month-old chatbot into a ticket generator. We don't ship without a regression gate on this metric.
    
-   03
    
    ### Escalation precision
    
    When the chat should hand off to a human, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set built with the buyer's support leadership. Escalation false-negatives hurt CSAT; false-positives drown the human agent queue. Both directions matter and both get a threshold.
    
-   04
    
    ### Tone + brand alignment
    
    Does the chat sound like the brand? Human-rated against a rubric the buyer's brand and CX teams sign off in writing. Scored over time. The rubric is shared, not run on a private spreadsheet. Tone is the failure mode buyers feel before they notice the faithfulness gap, a chat that sounds wrong loses customer trust before the eval team catches the drift.
    
-   05
    
    ### Latency p50 + p95
    
    Sub-2-second p95 for grounded chat. Sub-400ms turn-take p50 for voice. Measured in production, alerted on regression. Latency regression usually traces to retrieval-layer drift or a model swap, and both are the kind of thing observability surfaces in week one of the regression, not month three.
    
-   06
    
    ### Failure mode named in writing
    
    What's the single most likely way this chatbot fails at month six? If the memo can't answer that question, it isn't a ship-ready eval, it's a vendor demo. We name the failure mode, the leading indicator, and the threshold at which the trigger fires. Roughly one in five engagements names a failure mode that catches the regression before the customer-facing incident; that's the metric we measure ourselves on.
    

Six-out-of-six clean is the ship gate; we don't launch below threshold. Two or fewer clean is the trigger for a methodology intervention, the build needs more eval work before any prompt tuning. Eval rigour is the cheapest insurance in the chatbot stack and the most-skipped line item in the vendor proposal.

009 / WHERE

## Six chatbot shapes across six industries, where we've shipped.

Capability-by-industry heatgrid for chatbot development services we've actually built, not what the brochure promises. Strength reflects engagement depth, dark cells are repeat patterns; light cells are honest about depth we haven't built yet.

Function Industry

B2B SaaS

Ecommerce

Healthcare

Fintech

Logistics

EdTech

Customer service chatbot

Sales / lead-qualifying

Internal Q&A bot

Voice chatbot

Regulated grounded chat

Hybrid AI + human handoff

Customer service chatbot

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Sales / lead-qualifying

B2B SaaSEcommerceFintechEdTech HealthcareLogistics

Internal Q&A bot

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Voice chatbot

B2B SaaSEcommerceHealthcareFintechLogistics EdTech

Regulated grounded chat

B2B SaaSHealthcareFintechLogisticsEdTech Ecommerce

Hybrid AI + human handoff

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Possible fit Good fit Primary vertical

Dark cells: repeat engagement pattern. Medium: shipped at least once. Light: scoped but not yet completed. Empty: not yet relevant to the industry.

010 / PROCESS

## Six steps. Six weeks. One shipped chatbot.

Eval-first, gold-set-anchored, channel-aware custom chatbot development methodology, refined across grounded chat, tool-using chat, voice chat, and legacy modernisation engagements. The sequence below is the standard six-week build for a single-channel grounded chat. Voice adds a barge-in tuning phase; multi-channel adds a parallel integration sprint; legacy modernisation prepends a contract-and-baseline read. None of these run on a time-and-materials clock, fixed scope, fixed fee, fixed timeline.

WEEK 1

### Discovery + use-case lock

60-minute exec session to lock the use case, the channel, the deflection target, and the handoff posture. Read of the existing channel data, call recordings, ticket archive, chat transcripts. Output is a one-page chatbot spec everyone signs off in writing before any code lands.

WEEK 1–2

### Corpus + retrieval scaffold

For grounded chat: ingest the corpus, chunk, embed, and build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns it. First eval run on a 50-question gold set built with the buyer team. Faithfulness and retrieval precision land in writing before the chat model touches the corpus.

WEEK 2–3

### Chat surface + tool wiring

System prompt iteration, tool-call wiring, response shaping. The chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface is small and explicit, three to eight tools is the typical envelope.

WEEK 3–4

### Eval harness + handoff layer

RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context, Zendesk Sunshine or the channel-native equivalent. Eval thresholds locked in writing as the ship gate.

WEEK 4–5

### Channel integration + UAT

Web widget, Intercom, Zendesk, Slack, or WhatsApp, whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass; voice can't reuse text eval.

WEEK 5–6

### Ship + observability tuning

Live launch behind progressive rollout. Langfuse dashboards tuned with the on-call team. First-month tuning loop wired in, drift detection on retrieval, faithfulness regression flags, escalation precision regression flags. Engagement closes with a written handover memo plus a 30-day check-in.

011 / WHY PAITEQ

## Why teams pick us for enterprise chatbot development.

-   01
    
    ### Eval before prompt-tuning
    
    We ship the eval harness in code before the chat model touches the corpus. Most stalled chatbot builds we audit failed because the team prompt-tuned for three months without a regression gate. Faithfulness, retrieval precision, escalation, tone, latency, all measured in code, all regression-flagged, all owned by the buyer's team on day one.
    
-   02
    
    ### Multi-vendor chat by default
    
    We don't ship single-vendor chat. The routing layer above the model SDKs costs roughly a week of engineering and saves the contract renegotiation that always arrives in month nine. Sonnet 4.6 plus a GPT-5 mini fallback is the default; the buyer's team can swap providers in two weeks of vendor support effort, not six months of re-platforming.
    
-   03
    
    ### Warm-handoff, never cold
    
    Every chatbot we ship into a support context wires a warm-handoff layer with full conversation context. Cold handoff that re-asks the customer their name loses the customer who's been re-asked their name. The handoff layer is the highest-value engineering decision in any hybrid chat, and the most-skipped line item in legacy vendor proposals.
    
-   04
    
    ### Voice as its own discipline
    
    Voice chatbot development isn't text chat with a TTS layer slapped on. Sub-400ms turn-take, barge-in handling, interruption recovery, call-recording-as-eval-set, all engineering surface that doesn't exist in a text-chat build. We've shipped voice on LiveKit + Pipecat across healthcare, fintech, and logistics; never had to re-platform the voice layer.
    
-   05
    
    ### Channel abstraction from week two
    
    Multi-channel chat is the default ask in 2026. The channel abstraction is roughly a week of engineering at week two and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year, the abstraction call holds up every time.
    
-   06
    
    ### Fixed scope, written deliverable
    
    Three to twelve weeks per engagement; no time-and-materials clock; no vendor lock-in. The chatbot, the eval harness, the observability dashboards, the channel integration code, all owned by the buyer's team at handoff. About a third of pilot engagements convert to a follow-up shape; we don't push the conversion, the memo names the call either way.
    

012 / SHAPES

## Four ways to start a chatbot development services engagement.

The four shapes as picker cards. Fixed-scope, fixed-fee, written deliverable. Pick the closest match, the framing call refines if needed.

[

01 / PILOT ↗

Chatbot pilot, single channel

Three to four weeks, fixed scope. One channel, one use case, one corpus. Grounded chat with eval harness wired, observability live, handoff layer scoped. Deliverable is a working chatbot behind a feature flag plus a written handover memo. The default entry shape for ai chatbot development services, usually a customer service chatbot or a support chatbot pilot.

3–4 wksFixed

](#engage)[

02 / BUILD ↗

Custom chatbot development

Six to eight weeks, fixed scope. Multi-channel, tool-using or RAG-grounded, with eval and observability wired. Handoff layer included. Most enterprise chatbot development engagements land here. Ships behind progressive rollout with first-month tuning loop.

6–8 wksFixed

](#engage)[

03 / VOICE ↗

Voice chatbot development

Eight to twelve weeks, fixed scope. Sub-400ms turn-take voice chat with LiveKit + Pipecat + Deepgram + ElevenLabs. Includes barge-in handling, interruption recovery, and an IVR-replacement path where the legacy is in scope. Eval rigour at production-voice depth.

8–12 wksFixed

](#engage)[

04 / MODERNISE ↗

Legacy chatbot modernisation

Six to ten weeks, fixed scope. Replace a stalled rule-based or first-gen LLM chatbot with a grounded, eval-instrumented modern stack. Migration plan against the existing channel and handoff vendor contracts. We've shipped this against four legacy chatbot vendors in 2026.

6–10 wksFixed

](#engage)

013 / USE CASES

## Where the chat has landed.

Three typical-shape engagement patterns. Function, segment, and deliverable are real shapes; specific client metrics land in case studies once shipped engagements clear NDA.

Healthcare

Multi-state payer · HIPAA-grounded chat

### Grounded member-services chatbot against a regulated policy corpus

Typical shape: a US healthcare payer needs a member-services chatbot that grounds every answer in the actual benefit policy document, never the model's training corpus. We build the chat surface on Sonnet 4.6 + pgvector hybrid retrieval against the policy library, wire RAGAS eval against a clinician-built gold set, and ship the handoff layer warm-transfer-with-context into the existing Zendesk queue. Faithfulness is the headline gate; we don't ship below the threshold.

Deliverable: grounded chat + RAGAS eval harness + warm-handoff into Zendesk Sunshine

Ecommerce

DTC retail · multi-channel post-purchase chat

### Returns + order-status chatbot across web, Intercom, and WhatsApp

Typical shape: a DTC retailer wants to deflect tier-one post-purchase volume across three channels without re-asking the customer their order number twice. We build a tool-using chatbot on GPT-5 mini with order-lookup, return-initiation, and tracking tools; ship the channel abstraction above Intercom plus a WhatsApp Business adapter; wire the warm-handoff layer to the human agent queue with full conversation context. Eval on tool-call accuracy and escalation precision against a stratified live-traffic shadow.

Deliverable: multi-channel tool-using chat + channel abstraction + handoff layer

Fintech

Pre-Series-B lending · regulated voice chat

### Voice chatbot for KYC pre-fill and after-hours intake

Typical shape: a regulated lending platform wants to compress KYC intake to a five-minute voice conversation with a sub-second turn-take target. We ship LiveKit Agents + Deepgram + ElevenLabs + Claude Sonnet 4.6 as the chat brain. Barge-in handling, interruption recovery, and call-recording-as-eval-set wired before launch. Handoff layer warm-transfers to a licensed loan officer when the conversation reaches a regulated decision point.

Deliverable: voice chat stack + sub-400ms turn-take + regulated-handoff layer

014 / STACK

## The stack we ship against.

Chat models, conversation frameworks, retrieval, voice, observability, and channel, the surface a 2026 chatbot build actually touches.

-   Claude Sonnet 4.6
-   GPT-5 mini
-   Gemini 3 Flash
-   Llama 4
-   LangGraph
-   OpenAI Agents SDK
-   pgvector
-   Pinecone
-   Qdrant
-   Langfuse
-   Braintrust
-   RAGAS
-   LiveKit
-   Pipecat
-   Deepgram
-   ElevenLabs
-   Twilio
-   Intercom
-   Zendesk Sunshine
-   Slack
-   Claude Sonnet 4.6
-   GPT-5 mini
-   Gemini 3 Flash
-   Llama 4
-   LangGraph
-   OpenAI Agents SDK
-   pgvector
-   Pinecone
-   Qdrant
-   Langfuse
-   Braintrust
-   RAGAS
-   LiveKit
-   Pipecat
-   Deepgram
-   ElevenLabs
-   Twilio
-   Intercom
-   Zendesk Sunshine
-   Slack

015 / FAQ

## What buyers ask before signing.

What's the difference between chatbot development services and ai agent development?

Different shape, different engineering discipline. Chatbot development services here cover single-turn-ish conversational systems, the user asks, the chat answers (often grounded in a retrieval pipeline), the conversation either resolves or hands off to a human. [AI agent development](/services/ai-agent-development/) covers multi-step autonomous task execution, plan, act, observe, iterate, often over minutes or hours of runtime with ten-plus tool calls per session. The model class is different (Sonnet 4.6 for chat, Opus 4.7 for the plan layer of agents); the framework is different (thin SDK or LangGraph for chat, full state-graph for agents); the eval surface is different (faithfulness and escalation for chat, trajectory and tool-call success for agents); the latency budget is different (sub-2-second for chat, 5–60 seconds is normal for agents). If your use case is customer service, sales-qualifying, internal Q&A, voice deflection, you're in the right pillar. If it's multi-system workflow automation, route to the agent practice.

Why do you default to Claude Sonnet 4.6 instead of GPT-5 for grounded chat?

Tone and grounding. Sonnet 4.6 leads on faithfulness-against-retrieved-context in our eval runs, measurably less hallucinated content when the retrieval layer feeds it on-topic chunks, and a noticeably better refusal posture when retrieval misses. GPT-5 mini wins on raw throughput and is our backstop in the routing layer; we ship a multi-vendor abstraction above both. Gemini 3 Flash is competitive at the cheap end for low-stakes chat, we've shipped it on two internal-Q&A engagements where the cost economics flipped the spreadsheet. The honest answer is: it depends on the use case, and the default is Sonnet 4.6 plus a routing fallback. We've never recommended a single-vendor chat stack for any production engagement, vendor risk is real and the abstraction costs roughly a week of engineering.

How do you eval a chatbot before it ships?

Five surfaces. Faithfulness, does the answer match what retrieval returned? RAGAS plus a custom harness against a gold set built with the buyer team. Retrieval precision, did the right chunks rank above the wrong ones? Tracked per query, regression flagged. Escalation precision, when the chat should hand off, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set. Tone, does the chat sound like the brand? Human-rated against a rubric, scored over time. Latency p50 / p95, measured in production, alerted on regression. For voice chat add turn-take p50 / p95, barge-in success, and interruption-recovery. We ship eval in code before the chat model touches the corpus, the harness is the artefact that survives the engagement.

Do you build voice chatbot development on LiveKit or build it custom?

LiveKit Agents or Pipecat for every voice chatbot we ship. We don't build the voice transport layer custom, Twilio carries the PSTN side, LiveKit handles the WebRTC and the agent runtime, Deepgram does the streaming STT, ElevenLabs handles TTS. The custom work is the chat brain, system prompt, retrieval over the right corpus, tool wiring, handoff layer, and the eval harness. Building voice transport custom is a six-month engineering yak-shave that no buyer has ever earned the ROI on. LiveKit lands sub-400ms turn-take out of the box; the engineering work is making the chat brain not stupid inside that latency budget, which is where every voice chatbot development engagement actually lives.

What does an enterprise chatbot development engagement cost?

Fixed scope, fixed fee. A single-channel chatbot pilot runs three to four weeks at the lower end of the band. A full multi-channel custom chatbot development engagement runs six to eight weeks. Voice chatbot development runs eight to twelve weeks (the engineering surface is bigger). Legacy chatbot modernisation runs six to ten weeks depending on the migration depth. We quote exact numbers after a 30-minute scoping call. None of our chatbot development services engagements run on a time-and-materials clock, we sell a working chatbot against a fixed scope, not hours. Pricing scales with channel count, corpus complexity, and handoff depth; the lower end is single-channel grounded chat, the upper end is multi-channel voice + handoff + IVR replacement.

Can you migrate us off an existing chatbot vendor (Drift, Ada, Intercom Fin, etc.)?

Yes, legacy chatbot modernisation is a defined engagement shape. We've migrated chat off four legacy vendor stacks in 2026, every one for the same reasons: eval depth was vendor-controlled, model swap was vendor-blocked, channel abstraction didn't exist, and renewal economics had walked away from the value. The migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline (where one exists). The target stack is usually Sonnet 4.6 + pgvector + Langfuse + your existing channel anchor (Zendesk, Intercom, Twilio), vendor surface kept, brain replaced. Migration runs six to ten weeks; we ship behind a progressive feature flag so the cutover is reversible until the eval baseline is matched-or-exceeded.

How is this different from your conversational ai development or llm chatbot development work?

Same pillar, different language. Conversational AI development is the head-term, the discipline of building conversational systems, regardless of channel. Chatbot development services is the buyer-side phrase, what the procurement team types into the search bar. LLM chatbot development is the architectural sub-category, chat where the brain is an LLM (which is roughly 100% of 2026 builds; rule-based chat is a legacy modernisation target, not a greenfield shape). All three terms describe work we ship under this pillar. The sibling practices: [AI agent development](/services/ai-agent-development/) for multi-step autonomous task execution; [RAG development services](/services/rag-development/) for retrieval pipelines deeper than the chat use case requires; [LLM development services](/services/llm-development/) for the model-engineering layer (fine-tuning, hosted-vs-self-hosted, cost engineering) beneath a chat surface.

016 / FURTHER READING

## Where this practice connects.

If you're on the agent side of that decision grid, route to our [AI agent development company](/services/ai-agent-development/) practice; the architecture and eval shape are meaningfully different from chat. The sibling practices: [RAG development services](/services/rag-development/) for retrieval pipelines deeper than the chat use case requires; [LLM development services](/services/llm-development/) for fine-tuning, hosted-vs-self-hosted, and cost engineering beneath a chat surface.

For chatbots that sit on the auth boundary (account-takeover surface, transaction-confirmation flows), [our AI fraud detection at the auth boundary](/blog/ai-fraud-detection-at-auth-boundary/) walkthrough covers the hybrid rules + ML + LLM architecture we deploy underneath. Almost every chatbot engagement we ship recommends a [retrieval-augmented generation pipeline](/services/rag-development/) as the spine, not a generic conversational LLM blowing answers from training data. Most inbound for a support chatbot or customer service chatbot lands in shape 01 (grounded) or shape 04 (hybrid handoff); the [2026 customer service chatbot guide](/blog/customer-service-chatbot-buyers-guide/) has the vendor matrix and cost-per-resolved-contact math.

Industry routes: [AI healthcare software development](/ai-for-healthcare/) chatbots is the single most common regulated entry shape (BAA-backed model hosting, PHI masking before any model call), with [AI for fintech](/ai-for-fintech/) support chatbots a close second when the SOC 2 + FFIEC posture is the gate, and [custom AI insurance development](/ai-for-insurance/) chatbots growing fast for policyholder self-service. For mobile-first chat surfaces (the in-app chat lives inside a Flutter or React Native shell), the engineering bench is the same team that maintains [GetWidget](https://www.getwidget.dev/) (the open-source Flutter UI library, 4,800+ stars on the [github repo](https://github.com/ionicfirebaseapp/getwidget), 2026-Q2). UI components, voice stack, and LLM gateway sit in one team — not three vendors with three eval rubrics.

017 / Related practices

## Adjacent services.

[

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

](/services/ai-agent-development/)[

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

](/services/rag-development/)[

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

](/services/llm-development/)

018 / Start a chatbot engagement

## Ship a *grounded* chatbot in six weeks.

Grounded chatbot pilot in 3–4. Custom chatbot development in 6–8. Voice chatbot development in 8–12. Legacy chatbot modernisation in 6–10.

[Talk to a partner](/contact/) [See engagement shapes](#engage)
