Chatbot development services — grounded chat, tool-using chat, voice chat, hybrid AI + human handoff
Chatbot development services from an engineering-led custom chatbot development team. Conversational AI development for customer service, sales, internal Q&A, voice deflection, and regulated-grounded chat — eval-instrumented, multi-channel, with a warm-handoff layer to the human agent queue. We default to Claude Sonnet 4.6 with a multi-vendor routing layer above; we don't ship single-vendor chat stacks.
Chatbot or agent? The most expensive misframing in this category.
Most buyers arrive at chatbot development services with an agent-shaped problem, or vice versa. Pricing the wrong shape costs roughly $80k–$240k of engineering before anyone notices. The grid below is the first frame we run — single-turn-ish conversational chat on the left, multi-step autonomous agent on the right, scored on the dimensions that actually drive the architectural call. Read it once; the rest of this page assumes you've landed on the chat side. If you're on the agent side, route to our AI agent development practice.
| Chatbot (this pillar) | Agent (sibling pillar) | |
|---|---|---|
| Decision shape | Single-turn-ish — answer or escalate within one conversation | Multi-step autonomous — plan, act, observe, iterate over N turns |
| Default response budget | Sub-2-second perceived latency; voice ≈ sub-400ms turn-take | 5–60 seconds is normal; long-running tasks measured in minutes |
| Speed is a feature in customer-facing chat — users abandon threads after 4–5 seconds of silence. The sub-2-second budget isn't arbitrary; it's the threshold where chat still feels like chat rather than a loading screen. Agents trade latency for capability by design, which is the right call for back-office work but the wrong call on a live support surface. | ||
| Tool calls per session | Zero to three — retrieve, lookup-by-id, maybe a write | Often ten-plus; explicit state machine over the tool surface |
| Tool depth is the clearest capability signal. Three tool calls is the practical ceiling for a chatbot before the state tracking collapses and latency blows the budget — if your build needs more than that, you're not designing a chatbot, you're designing an agent with a chat surface. The explicit state machine that agents require isn't overhead; it's the mechanism that keeps tool-call ten coherent with tool-call one. | ||
| Grounding source | RAG over enterprise corpus or product KB; tight context window | Open-ended retrieval, web search, multi-corpus, internal tools |
| Tight scope is a chatbot advantage, not a limitation. Grounding against a controlled corpus means every retrieval chunk can be audited, every answer traced back to a source document. Agents need open-ended retrieval because their tasks demand it — but that breadth trades away the faithfulness guarantees that RAG-grounded chatbots can provide and regulated industries require. | ||
| Failure mode | Hallucinated answer, missed escalation, wrong tone | Tool-call loops, plan drift, partial state, idempotency bugs |
| Right model class | Claude Sonnet 4.6 or GPT-5 mini — fast, cheap, well-grounded | Claude Opus 4.7 / GPT-5 reasoning — for the plan layer |
| Right framework | Direct SDK + a thin orchestrator (or LangGraph for state) | LangGraph with proper state-graph control; never CrewAI for serious builds |
| Eval surface | Faithfulness, escalation precision, tone, latency p95 | Trajectory eval, tool-call success, plan coherence, end-state correctness |
| Eval surface is where chatbot builds most often cut corners. Faithfulness and escalation precision sound simple but measuring them at scale requires a dedicated harness — RAGAS for retrieval quality, Langfuse traces for latency p95, and a held-out escalation set that fires on every deploy. Agents have it harder (trajectory evals are genuinely more complex), but chatbot teams who skip the eval build are the ones regretting it in month three. | ||
| Where we recommend it | Customer service, sales-qualifying, internal Q&A, voice deflection | Multi-system workflows, research agents, ops automation — see siblings |
If your buyer is asking for "an AI agent" but the actual user-facing surface is a single-turn customer support widget, you're in the chat pillar — the agent vocabulary is procurement-side language, not architecture. We'll make that call on the framing call and route accordingly.
Five chatbot shapes. Every chatbot development services engagement maps to one.
Conversational AI development isn't a single shape — it's five distinct engineering envelopes with different default models, different frameworks, different eval surfaces, and different cost curves. The five shapes below cover roughly 100% of inbound. We won't sell you a voice chatbot development engagement when the deflection use case lives on web; we won't sell you a tool-using build when the actual ask is RAG-grounded.
-
01 · GROUNDED RAG chatbot over a curated corpus
Customer service deflection, internal Q&A, regulated policy chat, product knowledge base. The corpus is curated, the answer is grounded, the eval surface is faithfulness plus retrieval precision. Roughly 60% of engagements. Default stack: Claude Sonnet 4.6, pgvector + BM25 hybrid retrieval, RAGAS eval, Langfuse traces.
-
02 · TOOL-USING Chat that calls structured tools
Order-status, return-initiation, appointment-booking, CRM-aware chat. Three to eight tools is the typical envelope; tool-call success rate matters more than reasoning depth. Different from agents — the conversation stays single-thread. Default stack: GPT-5 mini or Sonnet 4.6, Zod / Pydantic schemas, function-calling SDK, eval on tool arg correctness.
-
03 · VOICE Sub-400ms voice chatbot development
Phone-channel deflection, clinical intake, sales prospecting outbound, IVR replacement. Latency budget is the headline constraint. LiveKit Agents or Pipecat as the runtime, Deepgram for streaming STT, ElevenLabs for TTS, Sonnet 4.6 as the chat brain. Eval includes turn-take p50/p95 and barge-in recovery — neither shows up in a text eval suite.
-
04 · HYBRID + HANDOFF AI tier-one with warm transfer to humans
Enterprise support where AI handles tier-one and escalates to a human agent with full conversation context. Roughly 40% of enterprise chatbot development engagements. The handoff layer is the highest-value engineering decision — warm transfer with context wins, cold handoff that re-asks the customer's name loses. Default: LangGraph state-graph, Zendesk Sunshine handoff, co-pilot suggestions in the agent's queue.
-
05 · MODERNISE Legacy chatbot replacement
Drift, Ada, Intercom Fin, first-gen LLM chatbot, or rule-based legacy. Migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline. Target stack usually keeps the channel surface (Zendesk, Intercom, Twilio) and replaces the brain. Ships behind progressive rollout — cutover is reversible until the eval baseline is matched-or-exceeded.
A chatbot can call a small agent loop for one sub-task and still be a chatbot. The primary shape is the one that owns the user-facing surface — if the user sees a single chat thread that resolves or escalates, you're in chat; if the user kicks off a job and comes back later for the result, you're in agents. Most inbound for a support chatbot or customer service chatbot lands in shape 01 (grounded) or shape 04 (hybrid handoff); a support chatbot with a real human queue almost always wants 04.
What production chatbot development services look like at the latency level.
Performance targets we hold ourselves to before any chatbot development services engagement ships. The numbers below are typical-workload defaults — actual targets land in writing during the discovery phase and become the ship gate. We don't ship below the latency or faithfulness threshold; we'd rather miss the launch date than ship a chat that hallucinates under production traffic.
The six families of the modern chatbot stack.
A modern enterprise chatbot development build touches six families — chat model, conversation framework, retrieval, voice (if voice is in scope), observability + eval, and channel + handoff. The categories below name the default pick, the cost-floor alternative, and the conditions under which we'd revisit. Per-family recommendations land in writing during discovery; the chat-stack inventory is the artefact that survives the engagement.
The mid-tier hosted models are where production chat lives in 2026. Claude Sonnet 4.6 leads on grounded answers and tone control. GPT-5 mini wins on raw throughput at scale. Gemini 3 Flash is competitive at the cheap end. Pricing is in the $0.3–3 input / $1.5–15 output per million tokens band — about an order of magnitude cheaper than the frontier tier and indistinguishable from it on most chat workloads we audit.
Default pick for almost every chatbot development services engagement. Customer service deflection, internal Q&A, sales qualifying, support knowledge bases. Below ~500M monthly tokens the hosted-mid economics beat self-hosted by 3–8×. C-suite buyers who want a named vendor on the contract.
Strict data residency where the provider's region map can't close — route to <a href="/services/llm-development/">self-hosted Llama 4</a> on your infra. High-frequency low-value chat (FAQ-style triage) where a smaller distilled model wins on cost. Reasoning-heavy multi-step shapes — that's not chatbot, that's an <a href="/services/ai-agent-development/">agent</a>.
Default architecture: Sonnet 4.6 for grounded chat, GPT-5 mini as the multi-vendor backstop with a thin routing layer above. Costs roughly a week of engineering and saves the renegotiation when one provider hikes prices in month nine. We've never recommended a single-vendor chat stack for any production engagement.
Chat doesn't always need a framework. A thin orchestrator over the SDK is the cleanest pattern for single-turn-ish chat — a system prompt, a retrieval call, a response, a handoff branch. LangGraph earns its keep when conversation state matters across turns: long context, multiple tool calls, structured handoff to a human agent. The OpenAI Agents SDK is fine for narrow OpenAI-only builds; we don't recommend it for multi-vendor chat.
LangGraph when state across turns matters — escalation flows, multi-step diagnostic chat, conversational forms with branching logic. Raw SDK + thin orchestrator when chat is genuinely single-turn-ish. Always wire eval and observability before the second sprint, regardless of framework.
CrewAI for serious chat — the role abstraction starts to fight you once you need state-graph control, and we've yet to ship a CrewAI-based chat into production without a re-platforming sprint. AutoGen — stalled relative to LangGraph; not recommended for new builds.
Roughly two-thirds of our chatbot development services engagements ship on a thin SDK orchestrator. The other third use LangGraph, usually because the chat has a handoff layer (warm transfer to human, escalation router, multi-channel context sync) where the state graph carries real weight.
RAG chatbot work lives or dies on the retrieval layer. pgvector is the cheapest, lowest-friction pick when Postgres is already in the stack — and Postgres is already in roughly nine out of ten enterprise stacks. Pinecone Serverless cuts ops bandwidth to near zero at a premium tier. Qdrant self-hosts cleanly when data residency is non-negotiable. BM25 hybrid (sparse + dense) wins on technical-product corpora where the exact-keyword match still beats embeddings half the time.
Any grounded chatbot — clinical, legal, regulated, internal knowledge base, product Q&A. Almost every chatbot engagement we ship recommends retrieval-augmented chat as the spine, not a generic conversational LLM blowing answers from training data. We route the deeper retrieval-pipeline work to our <a href="/services/rag-development/">RAG development services</a> practice when the scope earns it.
Pure tone-and-brand chat with no enterprise knowledge to ground against (rare, usually a sign the product team hasn't found the use case yet). Tiny corpora under 5k chunks where in-context retrieval beats a vector store and the index is overkill.
Default recommendation: pgvector when Postgres is already there; Pinecone Serverless when ops capacity is the constraint; Qdrant self-hosted when data residency requires it. We don't recommend Weaviate for a chat-only build unless multi-modal retrieval is the headline requirement.
Voice chatbot development is its own engineering discipline. LiveKit Agents and Pipecat both land sub-400ms voice turn-take in production — and the latency budget is the whole game. Deepgram leads on streaming STT accuracy at scale. ElevenLabs leads on voice quality; the open-source side (Whisper Large v3, F5-TTS) is closing fast but still trails on edge cases. Sub-second perceived turn-take is the difference between a voice agent that feels conversational and one that feels like an IVR with extra steps.
Support deflection at scale where call-centre cost is the headline number. Clinical intake, sales prospecting, after-hours triage. Anywhere the buyer journey involves a human picking up a phone. Roughly 25% of our voice chatbot development engagements in 2026 have replaced an existing IVR rather than building net-new.
Voice as a CEO whim with no buyer-journey evidence — the build is hard, the deflection economics are real but specific, and a voice agent without a deflection use case is theatre. Multi-step task automation — that's an <a href="/services/ai-agent-development/">agent</a> with a voice front, not a voice chatbot.
Default voice stack: LiveKit Agents + Claude Sonnet 4.6 + Deepgram + ElevenLabs. Open-source substitutes priced as a phase-two option when the volume crossover lands inside 12 months. We've shipped this exact stack three times in 2026 — never had to re-platform the voice layer.
Every chatbot we ship costs more in instrumentation than the buyer expected and earns it back inside the first month. Langfuse leads OSS observability with traces, prompt versioning, and a usable eval surface. Braintrust dominates closed-source eval workflows with a clean diffing UX. RAGAS is the default retrieval-and-faithfulness harness for grounded chat. Inspect AI (UK AISI-backed) is the rigour pick for safety-critical chat in regulated industries.
Every chat engagement. Day-one cost line, not a phase-three nice-to-have. Most stalled chat pilots we audit failed because nobody knew which prompts were drifting, which retrievals were silently returning irrelevant chunks, or which escalations were silently being miss-routed. Instrumentation is the cheapest insurance in the stack.
Never. We've never shipped a production chatbot without observability wired before the second sprint. Toy demos and POCs are the only exception — and the moment the demo earns a budget, observability lands in the next ticket.
Default: Langfuse self-hosted for teams with data-control concerns; Braintrust for teams with budget and no ops capacity. RAGAS as the retrieval eval harness regardless of trace backend. We don't recommend bare logging-without-traces for anything past prototype — it's the false-economy that creates the month-six stall.
The channel layer is where most chatbot pilots quietly fail. Twilio carries SMS, WhatsApp, and voice transport. Intercom and Zendesk are the dominant support-channel anchors — Zendesk Sunshine is the canonical handoff API for enterprise support chat. Slack is the default for internal-Q&A chat. The web widget is the simplest channel and the easiest to instrument; mobile native chat carries half the engineering cost and twice the polish.
Multi-channel chat is the default ask in 2026 — buyer expects the same chatbot across web, Intercom, Slack, and a voice line. The handoff layer to a human agent is the single highest-value engineering decision: warm transfer with conversation context wins; cold handoff with a ticket creation loses every time.
Single-channel chat where the channel is fixed and there's no plausible expansion path — usually the buyer hasn't priced the multi-channel ask yet. We'd still build the abstraction; cost is roughly a week and the option value is enormous.
Default: channel abstraction layer above whichever vendor (Intercom / Zendesk / Twilio) carries the surface. Warm-transfer-with-context as the handoff default. We've migrated chat across three channel vendors for two clients this year — the abstraction earns its keep on the migration.
Four conversation architectures. Pick the one that fits the buyer journey.
RAG-grounded, tool-using, voice, and hybrid-with-handoff are the four architectures we ship across roughly 100% of chatbot development services engagements. The architecture determines the model class, the framework, the eval surface, and the team shape. We won't sell you a tool-using build when the buyer journey is information retrieval; we won't sell you grounded chat when the buyer journey is task completion. The framing call is free.
RAG-GROUNDED
The most common chatbot shape we ship. A retrieval pipeline (pgvector + BM25 hybrid is the typical default) feeds the chat model context from a curated corpus — product docs, knowledge base, ticket history, regulated policy. Roughly 60% of chatbot development services engagements land here. The headline failure mode is silent retrieval drift: chunks rank well on cosine similarity but answer the wrong question. Eval is faithfulness + retrieval precision tracked over time, not a quarterly cherry-pick.
- Customer service deflection on a known product surface
- internal Q&A over a curated KB
- regulated chat where every answer must be grounded in policy
- sales-assist over a product catalogue
- B2B onboarding where the corpus is stable
- Open-ended chat with no curated corpus — that's a generic LLM, not a chatbot we'd build
- Multi-step workflow execution — that's an agent
- Pure entertainment chat — different design vocabulary entirely
TOOL-USING
Chat that needs to look up an order status, file a return, schedule a callback, or update a CRM record. Different shape from agents — the tool surface is narrow (typically three to eight tools), the conversation stays single-thread, and the tool-call success rate matters more than reasoning depth. Roughly 20% of engagements. Eval is tool-call success rate, argument correctness, and refusal precision when the user asks for something the tool surface doesn't cover.
- Order-status + return chat for ecommerce
- appointment-booking chat with calendar write
- CRM-aware sales chat with lead-write
- support chat with ticket-creation
- banking-info chat with read-only account lookups
- Multi-step task automation — that's an agent loop, not a tool-using chatbot
- Pure information retrieval — RAG-grounded is cheaper and faster
- Workflows that need branching execution state — escalate to a state-graph architecture
VOICE
Voice chatbot development is its own engineering discipline. The latency budget is the headline constraint — anything past 500ms perceived turn-take feels like an IVR. LiveKit Agents and Pipecat dominate the production stack; streaming STT (Deepgram) plus parallel TTS (ElevenLabs) plus a small-context chat model is the canonical shape. Roughly 15% of engagements but the highest-value tier per minute of build. Eval is turn-take p50/p95, interruption handling, and barge-in recovery — none of which are visible in a text-chat eval suite.
- Support deflection where the existing channel is voice
- clinical intake at scale
- sales prospecting outbound
- after-hours triage routing
- IVR replacement where the legacy DTMF tree has lost the war on customer patience
- Voice as a CEO whim with no deflection use case
- Visual-context-required interactions (returns with photo evidence, document review)
- Anything where sub-second turn-take isn't actually the constraint — text chat is cheaper
HYBRID + HANDOFF
Production support chat is rarely AI-only. The hybrid shape — AI handles tier-one, escalates to a human agent with full conversation context, often with AI co-pilot suggestions in the agent's queue — is roughly 40% of enterprise chatbot development engagements. The headline engineering choice is the handoff layer: warm transfer with conversation context wins; cold handoff that re-asks the customer's name loses every customer who's been re-asked their name. We've migrated five clients off cold-handoff vendors in 2026 — every one of them lifted CSAT inside the first month.
- Enterprise support with a real human agent team
- regulated chat where AI deflects but a licensed human signs the resolution
- complex products with long-tail questions AI can't reliably answer
- brands where escalation latency is itself a customer-experience metric
- Pure self-service products with no human support team
- Volume-only deflection where the human agent doesn't exist
- Workflows where AI cannot escalate quickly enough — that's an architecture flag, not an engagement
What a six-week custom chatbot development engagement actually ships.
A grounded chatbot pilot is roughly five distinct engineering phases against a fixed timeline. The phases below are the standard shape for a single-channel grounded chat; multi-channel adds a parallel integration sprint, voice chatbot development adds a barge-in and turn-take tuning phase, legacy modernisation prepends a contract-and-baseline read. The phases ship in series; eval lands in code before the chat model touches the corpus.
- 01
Discovery + use-case lock
Sixty-minute exec session locks the use case, the channel, the deflection target, and the handoff posture. Existing channel data — call recordings, ticket archive, chat transcripts — read for tone, escalation patterns, and the long-tail of questions the human agent team currently absorbs. Output is a one-page chatbot spec everyone signs off in writing before any code lands. Some engagements end at this phase because the right answer is "do a discovery sprint first, build later" — we still ship the spec and bill the phase.
- 02
Corpus + retrieval scaffold + gold-set eval
Ingest the corpus, chunk, embed, build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns the engineering time. First eval run on a 50-question gold set built with the buyer team — questions the human agent team actually fields, with the correct answer in the corpus and the wrong answer adjacent. Faithfulness and retrieval precision land in writing before the chat model touches the corpus. If the gold set fails at this stage, the corpus needs work before the chat does.
- 03
Chat surface + tool wiring
System prompt iteration, tool-call wiring, response shaping. Chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface stays small and explicit — three to eight tools is the envelope; ten-plus tools is an agent in disguise. For grounded chat, the system prompt names the corpus, the failure modes, and the escalation triggers in writing.
- 04
Eval harness + handoff layer
RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, a custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context — Zendesk Sunshine or the channel-native equivalent — never cold handoff. Eval thresholds locked in writing as the ship gate. The harness is the artefact that survives the engagement; the buyer's team owns it on day one of phase five.
- 05
Channel integration + UAT + ship
Web widget, Intercom, Zendesk, Slack, or WhatsApp — whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass — voice can't reuse text eval, ever. Progressive rollout behind a feature flag; first-month tuning loop wired with the on-call team. Engagement closes with a written handover memo plus a thirty-day check-in.
Clean handoff is the default. The chatbot, the eval harness, the observability dashboards, the channel integration code — all owned by the buyer's engineering team on day one of phase five. About a third of pilot engagements convert to a build engagement under shapes 02–04; we don't push that conversion, the memo names the call either way.
Channel picker. Where the chatbot lives, by use case.
The channel decision drives the engagement scope more than buyers realise. A single-channel pilot is roughly three to four weeks; a multi-channel build is six to eight; a voice build is eight to twelve. The grid below is the call we'd make on the rubric — channel across the top, use case down the side. Yes-cells are the default; maybe-cells need a follow-up call; no-cells are usually a misframing we'd route differently.
| Use case | Web widget | Mobile native | Voice (phone) | Messaging (Slack / WhatsApp) |
|---|---|---|---|---|
| Customer service deflection | Default | Strong | High-value | Common |
| Sales qualifying / lead-capture | Default | Useful | Outbound only | WhatsApp wins |
| Internal employee Q&A | Possible | Rare | Skip | Slack default |
| Clinical / regulated intake | Yes — gated | Yes — app-context | High-value | Avoid SMS |
| Onboarding / activation | Default | In-product | Skip | Useful nudge |
| Booking / scheduling | Default | In-app | Strong fit | WhatsApp common |
| Knowledge-base / search-replace | Default | Useful | Voice search rare | Slack for internal |
Multi-channel chat is the default ask in 2026. The channel abstraction layer is roughly a week of engineering and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year — every one validated the abstraction call we made at week two.
Six eval gates an honest chatbot ship clears.
A chatbot eval memo is only as honest as the gates the team runs before ship. Below is the screen we apply to every chatbot development services engagement — and the same screen we use when we're hired to second-opinion a chatbot a different vendor already shipped. Second-opinion work routinely flags at least one gate the original build silently skipped, usually faithfulness or handoff precision.
-
01 Faithfulness against retrieval
Does the answer match what retrieval returned, or is the model blowing answers from its training corpus? RAGAS plus a custom harness against a clinician- or domain-expert-built gold set. Tracked over time, regression flagged. Faithfulness is the headline failure mode for grounded chatbot work — and the easiest one for a vendor to hide behind a quarterly cherry-pick.
-
02 Retrieval precision per query
Did the right chunks rank above the wrong ones? Per-query precision tracked over time, with the long-tail of low-precision queries flagged for corpus work. Retrieval drift is the silent failure that turns a six-month-old chatbot into a ticket generator. We don't ship without a regression gate on this metric.
-
03 Escalation precision
When the chat should hand off to a human, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set built with the buyer's support leadership. Escalation false-negatives hurt CSAT; false-positives drown the human agent queue. Both directions matter and both get a threshold.
-
04 Tone + brand alignment
Does the chat sound like the brand? Human-rated against a rubric the buyer's brand and CX teams sign off in writing. Scored over time. The rubric is shared, not run on a private spreadsheet. Tone is the failure mode buyers feel before they notice the faithfulness gap — a chat that sounds wrong loses customer trust before the eval team catches the drift.
-
05 Latency p50 + p95
Sub-2-second p95 for grounded chat. Sub-400ms turn-take p50 for voice. Measured in production, alerted on regression. Latency regression usually traces to retrieval-layer drift or a model swap — and both are the kind of thing observability surfaces in week one of the regression, not month three.
-
06 Failure mode named in writing
What's the single most likely way this chatbot fails at month six? If the memo can't answer that question, it isn't a ship-ready eval — it's a vendor demo. We name the failure mode, the leading indicator, and the threshold at which the trigger fires. Roughly one in five engagements names a failure mode that catches the regression before the customer-facing incident; that's the metric we measure ourselves on.
Six-out-of-six clean is the ship gate; we don't launch below threshold. Two or fewer clean is the trigger for a methodology intervention — the build needs more eval work before any prompt tuning. Eval rigour is the cheapest insurance in the chatbot stack and the most-skipped line item in the vendor proposal.
Six chatbot shapes across six industries — where we've shipped.
Capability-by-industry heatgrid for chatbot development services we've actually built, not what the brochure promises. Strength reflects engagement depth — dark cells are repeat patterns; light cells are honest about depth we haven't built yet.
Six steps. Six weeks. One shipped chatbot.
Eval-first, gold-set-anchored, channel-aware custom chatbot development methodology — refined across grounded chat, tool-using chat, voice chat, and legacy modernisation engagements. The sequence below is the standard six-week build for a single-channel grounded chat. Voice adds a barge-in tuning phase; multi-channel adds a parallel integration sprint; legacy modernisation prepends a contract-and-baseline read. None of these run on a time-and-materials clock — fixed scope, fixed fee, fixed timeline.
Discovery + use-case lock
60-minute exec session to lock the use case, the channel, the deflection target, and the handoff posture. Read of the existing channel data — call recordings, ticket archive, chat transcripts. Output is a one-page chatbot spec everyone signs off in writing before any code lands.
Corpus + retrieval scaffold
For grounded chat: ingest the corpus, chunk, embed, and build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns it. First eval run on a 50-question gold set built with the buyer team. Faithfulness and retrieval precision land in writing before the chat model touches the corpus.
Chat surface + tool wiring
System prompt iteration, tool-call wiring, response shaping. The chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface is small and explicit — three to eight tools is the typical envelope.
Eval harness + handoff layer
RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context — Zendesk Sunshine or the channel-native equivalent. Eval thresholds locked in writing as the ship gate.
Channel integration + UAT
Web widget, Intercom, Zendesk, Slack, or WhatsApp — whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass; voice can't reuse text eval.
Ship + observability tuning
Live launch behind progressive rollout. Langfuse dashboards tuned with the on-call team. First-month tuning loop wired in — drift detection on retrieval, faithfulness regression flags, escalation precision regression flags. Engagement closes with a written handover memo plus a 30-day check-in.
Why teams pick us for enterprise chatbot development.
-
01 Eval before prompt-tuning
We ship the eval harness in code before the chat model touches the corpus. Most stalled chatbot builds we audit failed because the team prompt-tuned for three months without a regression gate. Faithfulness, retrieval precision, escalation, tone, latency — all measured in code, all regression-flagged, all owned by the buyer's team on day one.
-
02 Multi-vendor chat by default
We don't ship single-vendor chat. The routing layer above the model SDKs costs roughly a week of engineering and saves the contract renegotiation that always arrives in month nine. Sonnet 4.6 plus a GPT-5 mini fallback is the default; the buyer's team can swap providers in two weeks of vendor support effort, not six months of re-platforming.
-
03 Warm-handoff, never cold
Every chatbot we ship into a support context wires a warm-handoff layer with full conversation context. Cold handoff that re-asks the customer their name loses the customer who's been re-asked their name. The handoff layer is the highest-value engineering decision in any hybrid chat — and the most-skipped line item in legacy vendor proposals.
-
04 Voice as its own discipline
Voice chatbot development isn't text chat with a TTS layer slapped on. Sub-400ms turn-take, barge-in handling, interruption recovery, call-recording-as-eval-set — all engineering surface that doesn't exist in a text-chat build. We've shipped voice on LiveKit + Pipecat across healthcare, fintech, and logistics; never had to re-platform the voice layer.
-
05 Channel abstraction from week two
Multi-channel chat is the default ask in 2026. The channel abstraction is roughly a week of engineering at week two and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year — the abstraction call holds up every time.
-
06 Fixed scope, written deliverable
Three to twelve weeks per engagement; no time-and-materials clock; no vendor lock-in. The chatbot, the eval harness, the observability dashboards, the channel integration code — all owned by the buyer's team at handoff. About a third of pilot engagements convert to a follow-up shape; we don't push the conversion, the memo names the call either way.
Four ways to start a chatbot development services engagement.
The four shapes as picker cards. Fixed-scope, fixed-fee, written deliverable. Pick the closest match — the framing call refines if needed.
Where the chat has landed.
Three typical-shape engagement patterns. Function, segment, and deliverable are real shapes; specific client metrics land in case studies once shipped engagements clear NDA.
Grounded member-services chatbot against a regulated policy corpus
Typical shape: a US healthcare payer needs a member-services chatbot that grounds every answer in the actual benefit policy document, never the model's training corpus. We build the chat surface on Sonnet 4.6 + pgvector hybrid retrieval against the policy library, wire RAGAS eval against a clinician-built gold set, and ship the handoff layer warm-transfer-with-context into the existing Zendesk queue. Faithfulness is the headline gate; we don't ship below the threshold.
Returns + order-status chatbot across web, Intercom, and WhatsApp
Typical shape: a DTC retailer wants to deflect tier-one post-purchase volume across three channels without re-asking the customer their order number twice. We build a tool-using chatbot on GPT-5 mini with order-lookup, return-initiation, and tracking tools; ship the channel abstraction above Intercom plus a WhatsApp Business adapter; wire the warm-handoff layer to the human agent queue with full conversation context. Eval on tool-call accuracy and escalation precision against a stratified live-traffic shadow.
Voice chatbot for KYC pre-fill and after-hours intake
Typical shape: a regulated lending platform wants to compress KYC intake to a five-minute voice conversation with a sub-second turn-take target. We ship LiveKit Agents + Deepgram + ElevenLabs + Claude Sonnet 4.6 as the chat brain. Barge-in handling, interruption recovery, and call-recording-as-eval-set wired before launch. Handoff layer warm-transfers to a licensed loan officer when the conversation reaches a regulated decision point.
The stack we ship against.
Chat models, conversation frameworks, retrieval, voice, observability, and channel — the surface a 2026 chatbot build actually touches.
- Claude Sonnet 4.6
- GPT-5 mini
- Gemini 3 Flash
- Llama 4
- LangGraph
- OpenAI Agents SDK
- pgvector
- Pinecone
- Qdrant
- Langfuse
- Braintrust
- RAGAS
- LiveKit
- Pipecat
- Deepgram
- ElevenLabs
- Twilio
- Intercom
- Zendesk Sunshine
- Slack
- Claude Sonnet 4.6
- GPT-5 mini
- Gemini 3 Flash
- Llama 4
- LangGraph
- OpenAI Agents SDK
- pgvector
- Pinecone
- Qdrant
- Langfuse
- Braintrust
- RAGAS
- LiveKit
- Pipecat
- Deepgram
- ElevenLabs
- Twilio
- Intercom
- Zendesk Sunshine
- Slack
What buyers ask before signing.
What's the difference between chatbot development services and ai agent development?
Different shape, different engineering discipline. Chatbot development services here cover single-turn-ish conversational systems — the user asks, the chat answers (often grounded in a retrieval pipeline), the conversation either resolves or hands off to a human. AI agent development covers multi-step autonomous task execution — plan, act, observe, iterate, often over minutes or hours of runtime with ten-plus tool calls per session. The model class is different (Sonnet 4.6 for chat, Opus 4.7 for the plan layer of agents); the framework is different (thin SDK or LangGraph for chat, full state-graph for agents); the eval surface is different (faithfulness and escalation for chat, trajectory and tool-call success for agents); the latency budget is different (sub-2-second for chat, 5–60 seconds is normal for agents). If your use case is customer service, sales-qualifying, internal Q&A, voice deflection — you're in the right pillar. If it's multi-system workflow automation — route to the agent practice.
Why do you default to Claude Sonnet 4.6 instead of GPT-5 for grounded chat?
Tone and grounding. Sonnet 4.6 leads on faithfulness-against-retrieved-context in our eval runs — measurably less hallucinated content when the retrieval layer feeds it on-topic chunks, and a noticeably better refusal posture when retrieval misses. GPT-5 mini wins on raw throughput and is our backstop in the routing layer; we ship a multi-vendor abstraction above both. Gemini 3 Flash is competitive at the cheap end for low-stakes chat — we've shipped it on two internal-Q&A engagements where the cost economics flipped the spreadsheet. The honest answer is: it depends on the use case, and the default is Sonnet 4.6 plus a routing fallback. We've never recommended a single-vendor chat stack for any production engagement — vendor risk is real and the abstraction costs roughly a week of engineering.
How do you eval a chatbot before it ships?
Five surfaces. Faithfulness — does the answer match what retrieval returned? RAGAS plus a custom harness against a gold set built with the buyer team. Retrieval precision — did the right chunks rank above the wrong ones? Tracked per query, regression flagged. Escalation precision — when the chat should hand off, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set. Tone — does the chat sound like the brand? Human-rated against a rubric, scored over time. Latency p50 / p95 — measured in production, alerted on regression. For voice chat add turn-take p50 / p95, barge-in success, and interruption-recovery. We ship eval in code before the chat model touches the corpus — the harness is the artefact that survives the engagement.
Do you build voice chatbot development on LiveKit or build it custom?
LiveKit Agents or Pipecat for every voice chatbot we ship. We don't build the voice transport layer custom — Twilio carries the PSTN side, LiveKit handles the WebRTC and the agent runtime, Deepgram does the streaming STT, ElevenLabs handles TTS. The custom work is the chat brain — system prompt, retrieval over the right corpus, tool wiring, handoff layer, and the eval harness. Building voice transport custom is a six-month engineering yak-shave that no buyer has ever earned the ROI on. LiveKit lands sub-400ms turn-take out of the box; the engineering work is making the chat brain not stupid inside that latency budget, which is where every voice chatbot development engagement actually lives.
What does an enterprise chatbot development engagement cost?
Fixed scope, fixed fee. A single-channel chatbot pilot runs three to four weeks at the lower end of the band. A full multi-channel custom chatbot development engagement runs six to eight weeks. Voice chatbot development runs eight to twelve weeks (the engineering surface is bigger). Legacy chatbot modernisation runs six to ten weeks depending on the migration depth. We quote exact numbers after a 30-minute scoping call. None of our chatbot development services engagements run on a time-and-materials clock — we sell a working chatbot against a fixed scope, not hours. Pricing scales with channel count, corpus complexity, and handoff depth; the lower end is single-channel grounded chat, the upper end is multi-channel voice + handoff + IVR replacement.
Can you migrate us off an existing chatbot vendor (Drift, Ada, Intercom Fin, etc.)?
Yes — legacy chatbot modernisation is a defined engagement shape. We've migrated chat off four legacy vendor stacks in 2026, every one for the same reasons: eval depth was vendor-controlled, model swap was vendor-blocked, channel abstraction didn't exist, and renewal economics had walked away from the value. The migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline (where one exists). The target stack is usually Sonnet 4.6 + pgvector + Langfuse + your existing channel anchor (Zendesk, Intercom, Twilio) — vendor surface kept, brain replaced. Migration runs six to ten weeks; we ship behind a progressive feature flag so the cutover is reversible until the eval baseline is matched-or-exceeded.
How is this different from your conversational ai development or llm chatbot development work?
Same pillar, different language. Conversational AI development is the head-term — the discipline of building conversational systems, regardless of channel. Chatbot development services is the buyer-side phrase — what the procurement team types into the search bar. LLM chatbot development is the architectural sub-category — chat where the brain is an LLM (which is roughly 100% of 2026 builds; rule-based chat is a legacy modernisation target, not a greenfield shape). All three terms describe work we ship under this pillar. The sibling practices: AI agent development for multi-step autonomous task execution; RAG development services for retrieval pipelines deeper than the chat use case requires; LLM development services for the model-engineering layer (fine-tuning, hosted-vs-self-hosted, cost engineering) beneath a chat surface.
Adjacent services.
Ship a grounded chatbot in six weeks.
Grounded chatbot pilot in 3–4. Custom chatbot development in 6–8. Voice chatbot development in 8–12. Legacy chatbot modernisation in 6–10.