P8 · Services

Chatbot development services, grounded chat, tool-using chat, voice chat, hybrid AI + human handoff

Chatbot development services from an engineering-led custom chatbot development team. Conversational AI development for customer service, sales, internal Q&A, voice deflection, and regulated-grounded chat, eval-instrumented, multi-channel, with a warm-handoff layer to the human agent queue. We default to Claude Sonnet 4.6 with a multi-vendor routing layer above; we don't ship single-vendor chat stacks.

Talk to a partner See engagement shapes

Practice Conversational AI development

Shapes Grounded · Tool-using · Voice · Hybrid

Default Claude Sonnet 4.6 + GPT-5 mini fallback

Engagements 3–12 weeks · fixed scope

001 / FRAME

Chatbot or agent? The most expensive misframing in this category.

Most buyers arrive at chatbot development services with an agent-shaped problem, or vice versa. Pricing the wrong shape burns one to two quarters of engineering before anyone notices. The grid below is the first frame we run, single-turn-ish conversational chat on the left, multi-step autonomous agent on the right, scored on the dimensions that actually drive the architectural call. Read it once; the rest of this page assumes you've landed on the chat side. If you're on the agent side, route to our AI agent development practice.

	Chatbot (this pillar)	Agent (sibling pillar)
Decision shape	Single-turn-ish, answer or escalate within one conversation	Multi-step autonomous, plan, act, observe, iterate over N turns
Default response budget	Sub-2-second perceived latency; voice ≈ sub-400ms turn-take	5–60 seconds is normal; long-running tasks measured in minutes
Speed is a feature in customer-facing chat, users abandon threads after 4–5 seconds of silence. The sub-2-second budget isn't arbitrary; it's the threshold where chat still feels like chat rather than a loading screen. Agents trade latency for capability by design, which is the right call for back-office work but the wrong call on a live support surface.
Tool calls per session	Zero to three, retrieve, lookup-by-id, maybe a write	Often ten-plus; explicit state machine over the tool surface
Tool depth is the clearest capability signal. Three tool calls is the practical ceiling for a chatbot before the state tracking collapses and latency blows the budget, if your build needs more than that, you're not designing a chatbot, you're designing an agent with a chat surface. The explicit state machine that agents require isn't overhead; it's the mechanism that keeps tool-call ten coherent with tool-call one.
Grounding source	RAG over enterprise corpus or product KB; tight context window	Open-ended retrieval, web search, multi-corpus, internal tools
Tight scope is a chatbot advantage, not a limitation. Grounding against a controlled corpus means every retrieval chunk can be audited, every answer traced back to a source document. Agents need open-ended retrieval because their tasks demand it, but that breadth trades away the faithfulness guarantees that RAG-grounded chatbots can provide and regulated industries require.
Failure mode	Hallucinated answer, missed escalation, wrong tone	Tool-call loops, plan drift, partial state, idempotency bugs
Right model class	Claude Sonnet 4.6 or GPT-5 mini, fast, cheap, well-grounded	Claude Opus 4.7 / GPT-5 reasoning, for the plan layer
Right framework	Direct SDK + a thin orchestrator (or LangGraph for state)	LangGraph with proper state-graph control; never CrewAI for serious builds
Eval surface	Faithfulness, escalation precision, tone, latency p95	Trajectory eval, tool-call success, plan coherence, end-state correctness
Eval surface is where chatbot builds most often cut corners. Faithfulness and escalation precision sound simple but measuring them at scale requires a dedicated harness, RAGAS for retrieval quality, Langfuse traces for latency p95, and a held-out escalation set that fires on every deploy. Agents have it harder (trajectory evals are genuinely more complex), but chatbot teams who skip the eval build are the ones regretting it in month three.
Where we recommend it	Customer service, sales-qualifying, internal Q&A, voice deflection	Multi-system workflows, research agents, ops automation, see siblings

Most production builds aren't pure one-or-the-other, a chatbot can call a small agent loop for a sub-task; an agent can expose a chat surface to its operator. The point of the frame is to lock the primary shape so the architecture, the model class, the framework, and the eval surface all line up.

If your buyer is asking for "an AI agent" but the actual user-facing surface is a single-turn customer support widget, you're in the chat pillar, the agent vocabulary is procurement-side language, not architecture. We'll make that call on the framing call and route accordingly.

002 / SHAPES

Five chatbot shapes. Every chatbot development services engagement maps to one.

Conversational AI development isn't a single shape, it's five distinct engineering envelopes with different default models, different frameworks, different eval surfaces, and different cost curves. The five shapes below cover roughly 100% of inbound. We won't sell you a voice chatbot development engagement when the deflection use case lives on web; we won't sell you a tool-using build when the actual ask is RAG-grounded.

01 · GROUNDED
RAG chatbot over a curated corpus

Customer service deflection, internal Q&A, regulated policy chat, product knowledge base. The corpus is curated, the answer is grounded, the eval surface is faithfulness plus retrieval precision. Roughly 60% of engagements. Default stack: Claude Sonnet 4.6, pgvector + BM25 hybrid retrieval, RAGAS eval, Langfuse traces.
02 · TOOL-USING
Chat that calls structured tools

Order-status, return-initiation, appointment-booking, CRM-aware chat. Three to eight tools is the typical envelope; tool-call success rate matters more than reasoning depth. Different from agents, the conversation stays single-thread. Default stack: GPT-5 mini or Sonnet 4.6, Zod / Pydantic schemas, function-calling SDK, eval on tool arg correctness.
03 · VOICE
Sub-400ms voice chatbot development

Phone-channel deflection, clinical intake, sales prospecting outbound, IVR replacement. Latency budget is the headline constraint. LiveKit Agents or Pipecat as the runtime, Deepgram for streaming STT, ElevenLabs for TTS, Sonnet 4.6 as the chat brain. Eval includes turn-take p50/p95 and barge-in recovery, neither shows up in a text eval suite.
04 · HYBRID + HANDOFF
AI tier-one with warm transfer to humans

Enterprise support where AI handles tier-one and escalates to a human agent with full conversation context. Roughly 40% of enterprise chatbot development engagements. The handoff layer is the highest-value engineering decision, warm transfer with context wins, cold handoff that re-asks the customer's name loses. Default: LangGraph state-graph, Zendesk Sunshine handoff, co-pilot suggestions in the agent's queue.
05 · MODERNISE
Legacy chatbot replacement

Drift, Ada, Intercom Fin, first-gen LLM chatbot, or rule-based legacy. Migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline. Target stack usually keeps the channel surface (Zendesk, Intercom, Twilio) and replaces the brain. Ships behind progressive rollout, cutover is reversible until the eval baseline is matched-or-exceeded.

A chatbot can call a small agent loop for one sub-task and still be a chatbot. The primary shape is the one that owns the user-facing surface, if the user sees a single chat thread that resolves or escalates, you're in chat; if the user kicks off a job and comes back later for the result, you're in agents. Most inbound for a support chatbot or customer service chatbot lands in shape 01 (grounded) or shape 04 (hybrid handoff); a support chatbot with a real human queue almost always wants 04.

003 / NUMBERS

What production chatbot development services look like at the latency level.

Performance targets we hold ourselves to before any chatbot development services engagement ships. The numbers below are typical-workload defaults, actual targets land in writing during the discovery phase and become the ship gate. We don't ship below the latency or faithfulness threshold; we'd rather miss the launch date than ship a chat that hallucinates under production traffic.

004 / STACK

The six families of the modern chatbot stack.

A modern enterprise chatbot development build touches six families, chat model, conversation framework, retrieval, voice (if voice is in scope), observability + eval, and channel + handoff. The categories below name the default pick, the cost-floor alternative, and the conditions under which we'd revisit. Per-family recommendations land in writing during discovery; the chat-stack inventory is the artefact that survives the engagement.

Chat model layer (Claude Sonnet 4.6 · GPT-5 mini · Gemini 3 Flash)

Strengths

The mid-tier hosted models are where production chat lives in 2026. Claude Sonnet 4.6 leads on grounded answers and tone control. GPT-5 mini wins on raw throughput at scale. Gemini 3 Flash is competitive at the cheap end. Pricing is in the $0.3–3 input / $1.5–15 output per million tokens band, about an order of magnitude cheaper than the frontier tier and indistinguishable from it on most chat workloads we audit.

When We Pick

Default pick for almost every chatbot development services engagement. Customer service deflection, internal Q&A, sales qualifying, support knowledge bases. Below ~500M monthly tokens the hosted-mid economics beat self-hosted by 3–8×. C-suite buyers who want a named vendor on the contract.

When We Don't

Strict data residency where the provider's region map can't close, route to self-hosted Llama 4 on your infra. High-frequency low-value chat (FAQ-style triage) where a smaller distilled model wins on cost. Reasoning-heavy multi-step shapes, that's not chatbot, that's an agent.

Paiteq Pattern

Default architecture: Sonnet 4.6 for grounded chat, GPT-5 mini as the multi-vendor backstop with a thin routing layer above. Costs roughly a week of engineering and saves the renegotiation when one provider hikes prices in month nine. We've never recommended a single-vendor chat stack for any production engagement.

Mid-hostedMulti-vendorGrounded

Conversation framework (LangGraph · OpenAI Agents SDK · raw SDK)

Strengths

Chat doesn't always need a framework. A thin orchestrator over the SDK is the cleanest pattern for single-turn-ish chat, a system prompt, a retrieval call, a response, a handoff branch. LangGraph earns its keep when conversation state matters across turns: long context, multiple tool calls, structured handoff to a human agent. The OpenAI Agents SDK is fine for narrow OpenAI-only builds; we don't recommend it for multi-vendor chat.

When We Pick

LangGraph when state across turns matters, escalation flows, multi-step diagnostic chat, conversational forms with branching logic. Raw SDK + thin orchestrator when chat is genuinely single-turn-ish. Always wire eval and observability before the second sprint, regardless of framework.

When We Don't

CrewAI for serious chat, the role abstraction starts to fight you once you need state-graph control, and we've yet to ship a CrewAI-based chat into production without a re-platforming sprint. AutoGen, stalled relative to LangGraph; not recommended for new builds.

Paiteq Pattern

Roughly two-thirds of our chatbot development services engagements ship on a thin SDK orchestrator. The other third use LangGraph, usually because the chat has a handoff layer (warm transfer to human, escalation router, multi-channel context sync) where the state graph carries real weight.

State-machineThin-SDKLangGraph

Retrieval layer (pgvector · Pinecone · Qdrant · BM25 hybrid)

Strengths

RAG chatbot work lives or dies on the retrieval layer. pgvector is the cheapest, lowest-friction pick when Postgres is already in the stack, and Postgres is already in roughly nine out of ten enterprise stacks. Pinecone Serverless cuts ops bandwidth to near zero at a premium tier. Qdrant self-hosts cleanly when data residency is non-negotiable. BM25 hybrid (sparse + dense) wins on technical-product corpora where the exact-keyword match still beats embeddings half the time.

When We Pick

Any grounded chatbot, clinical, legal, regulated, internal knowledge base, product Q&A. Almost every chatbot engagement we ship recommends retrieval-augmented chat as the spine, not a generic conversational LLM blowing answers from training data. We route the deeper retrieval-pipeline work to our RAG development services practice when the scope earns it.

When We Don't

Pure tone-and-brand chat with no enterprise knowledge to ground against (rare, usually a sign the product team hasn't found the use case yet). Tiny corpora under 5k chunks where in-context retrieval beats a vector store and the index is overkill.

Paiteq Pattern

Default recommendation: pgvector when Postgres is already there; Pinecone Serverless when ops capacity is the constraint; Qdrant self-hosted when data residency requires it. We don't recommend Weaviate for a chat-only build unless multi-modal retrieval is the headline requirement.

RAGHybridpgvector-first

Voice stack (LiveKit Agents · Pipecat · Deepgram · ElevenLabs)

Strengths

Voice chatbot development is its own engineering discipline. LiveKit Agents and Pipecat both land sub-400ms voice turn-take in production, and the latency budget is the whole game. Deepgram leads on streaming STT accuracy at scale. ElevenLabs leads on voice quality; the open-source side (Whisper Large v3, F5-TTS) is closing fast but still trails on edge cases. Sub-second perceived turn-take is the difference between a voice agent that feels conversational and one that feels like an IVR with extra steps.

When We Pick

Support deflection at scale where call-centre cost is the headline number. Clinical intake, sales prospecting, after-hours triage. Anywhere the buyer journey involves a human picking up a phone. Roughly 25% of our voice chatbot development engagements in 2026 have replaced an existing IVR rather than building net-new.

When We Don't

Voice as a CEO whim with no buyer-journey evidence, the build is hard, the deflection economics are real but specific, and a voice agent without a deflection use case is theatre. Multi-step task automation, that's an agent with a voice front, not a voice chatbot.

Paiteq Pattern

Default voice stack: LiveKit Agents + Claude Sonnet 4.6 + Deepgram + ElevenLabs. Open-source substitutes priced as a phase-two option when the volume crossover lands inside 12 months. We've shipped this exact stack three times in 2026, never had to re-platform the voice layer.

VoiceSub-400msDeflection

Observability + eval (Langfuse · Braintrust · RAGAS · Inspect)

Strengths

Every chatbot we ship costs more in instrumentation than the buyer expected and earns it back inside the first month. Langfuse leads OSS observability with traces, prompt versioning, and a usable eval surface. Braintrust dominates closed-source eval workflows with a clean diffing UX. RAGAS is the default retrieval-and-faithfulness harness for grounded chat. Inspect AI (UK AISI-backed) is the rigour pick for safety-critical chat in regulated industries.

When We Pick

Every chat engagement. Day-one cost line, not a phase-three nice-to-have. Most stalled chat pilots we audit failed because nobody knew which prompts were drifting, which retrievals were silently returning irrelevant chunks, or which escalations were silently being miss-routed. Instrumentation is the cheapest insurance in the stack.

When We Don't

Never. We've never shipped a production chatbot without observability wired before the second sprint. Toy demos and POCs are the only exception, and the moment the demo earns a budget, observability lands in the next ticket.

Paiteq Pattern

Default: Langfuse self-hosted for teams with data-control concerns; Braintrust for teams with budget and no ops capacity. RAGAS as the retrieval eval harness regardless of trace backend. We don't recommend bare logging-without-traces for anything past prototype, it's the false-economy that creates the month-six stall.

Day-oneTracesEval-first

Channel + handoff (Twilio · Intercom · Zendesk · Slack · web)

Strengths

The channel layer is where most chatbot pilots quietly fail. Twilio carries SMS, WhatsApp, and voice transport. Intercom and Zendesk are the dominant support-channel anchors, Zendesk Sunshine is the canonical handoff API for enterprise support chat. Slack is the default for internal-Q&A chat. The web widget is the simplest channel and the easiest to instrument; mobile native chat carries half the engineering cost and twice the polish.

When We Pick

Multi-channel chat is the default ask in 2026, buyer expects the same chatbot across web, Intercom, Slack, and a voice line. The handoff layer to a human agent is the single highest-value engineering decision: warm transfer with conversation context wins; cold handoff with a ticket creation loses every time.

When We Don't

Single-channel chat where the channel is fixed and there's no plausible expansion path, usually the buyer hasn't priced the multi-channel ask yet. We'd still build the abstraction; cost is roughly a week and the option value is enormous.

Paiteq Pattern

Default: channel abstraction layer above whichever vendor (Intercom / Zendesk / Twilio) carries the surface. Warm-transfer-with-context as the handoff default. We've migrated chat across three channel vendors for two clients this year, the abstraction earns its keep on the migration.

Multi-channelWarm-handoffTwilio

005 / ARCHITECTURES

Four conversation architectures. Pick the one that fits the buyer journey.

RAG-grounded, tool-using, voice, and hybrid-with-handoff are the four architectures we ship across roughly 100% of chatbot development services engagements. The architecture determines the model class, the framework, the eval surface, and the team shape. We won't sell you a tool-using build when the buyer journey is information retrieval; we won't sell you grounded chat when the buyer journey is task completion. The framing call is free.

RAG-GROUNDED

The most common chatbot shape we ship. A retrieval pipeline (pgvector + BM25 hybrid is the typical default) feeds the chat model context from a curated corpus, product docs, knowledge base, ticket history, regulated policy. Roughly 60% of chatbot development services engagements land here. The headline failure mode is silent retrieval drift: chunks rank well on cosine similarity but answer the wrong question. Eval is faithfulness + retrieval precision tracked over time, not a quarterly cherry-pick.

Pick when

Customer service deflection on a known product surface
internal Q&A over a curated KB
regulated chat where every answer must be grounded in policy
sales-assist over a product catalogue
B2B onboarding where the corpus is stable

Skip when

Open-ended chat with no curated corpus, that's a generic LLM, not a chatbot we'd build
Multi-step workflow execution, that's an agent
Pure entertainment chat, different design vocabulary entirely

Stack

Claude Sonnet 4.6pgvector + BM25RAGAS evalLangfuse traces

006 / PHASES

What a six-week custom chatbot development engagement actually ships.

A grounded chatbot pilot is roughly five distinct engineering phases against a fixed timeline. The phases below are the standard shape for a single-channel grounded chat; multi-channel adds a parallel integration sprint, voice chatbot development adds a barge-in and turn-take tuning phase, legacy modernisation prepends a contract-and-baseline read. The phases ship in series; eval lands in code before the chat model touches the corpus.

01
Discovery + use-case lock

Sixty-minute exec session locks the use case, the channel, the deflection target, and the handoff posture. Existing channel data, call recordings, ticket archive, chat transcripts, read for tone, escalation patterns, and the long-tail of questions the human agent team currently absorbs. Output is a one-page chatbot spec everyone signs off in writing before any code lands. Some engagements end at this phase because the right answer is "do a discovery sprint first, build later", we still ship the spec and bill the phase.
02
Corpus + retrieval scaffold + gold-set eval

Ingest the corpus, chunk, embed, build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns the engineering time. First eval run on a 50-question gold set built with the buyer team, questions the human agent team actually fields, with the correct answer in the corpus and the wrong answer adjacent. Faithfulness and retrieval precision land in writing before the chat model touches the corpus. If the gold set fails at this stage, the corpus needs work before the chat does.
03
Chat surface + tool wiring

System prompt iteration, tool-call wiring, response shaping. Chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface stays small and explicit, three to eight tools is the envelope; ten-plus tools is an agent in disguise. For grounded chat, the system prompt names the corpus, the failure modes, and the escalation triggers in writing.
04
Eval harness + handoff layer

RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, a custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context, Zendesk Sunshine or the channel-native equivalent, never cold handoff. Eval thresholds locked in writing as the ship gate. The harness is the artefact that survives the engagement; the buyer's team owns it on day one of phase five.
05
Channel integration + UAT + ship

Web widget, Intercom, Zendesk, Slack, or WhatsApp, whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass, voice can't reuse text eval, ever. Progressive rollout behind a feature flag; first-month tuning loop wired with the on-call team. Engagement closes with a written handover memo plus a thirty-day check-in.

Clean handoff is the default. The chatbot, the eval harness, the observability dashboards, the channel integration code, all owned by the buyer's engineering team on day one of phase five. About a third of pilot engagements convert to a build engagement under shapes 02–04; we don't push that conversion, the memo names the call either way.

007 / CHANNEL

Channel picker. Where the chatbot lives, by use case.

The channel decision drives the engagement scope more than buyers realise. A single-channel pilot is roughly three to four weeks; a multi-channel build is six to eight; a voice build is eight to twelve. The grid below is the call we'd make on the rubric, channel across the top, use case down the side. Yes-cells are the default; maybe-cells need a follow-up call; no-cells are usually a misframing we'd route differently.

Use case	Web widget	Mobile native	Voice (phone)	Messaging (Slack / WhatsApp)
Customer service deflection	Default	Strong	High-value	Common
Sales qualifying / lead-capture	Default	Useful	Outbound only	WhatsApp wins
Internal employee Q&A	Possible	Rare	Skip	Slack default
Clinical / regulated intake	Yes, gated	Yes, app-context	High-value	Avoid SMS
Onboarding / activation	Default	In-product	Skip	Useful nudge
Booking / scheduling	Default	In-app	Strong fit	WhatsApp common
Knowledge-base / search-replace	Default	Useful	Voice search rare	Slack for internal

Yes = default channel for this use case. Maybe = depends on buyer-journey specifics. No = usually a misframing, we'd route to a different shape.

Multi-channel chat is the default ask in 2026. The channel abstraction layer is roughly a week of engineering and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year, every one validated the abstraction call we made at week two.

008 / GATES

Six eval gates an honest chatbot ship clears.

A chatbot eval memo is only as honest as the gates the team runs before ship. Below is the screen we apply to every chatbot development services engagement, and the same screen we use when we're hired to second-opinion a chatbot a different vendor already shipped. Second-opinion work routinely flags at least one gate the original build silently skipped, usually faithfulness or handoff precision.

01
Faithfulness against retrieval

Does the answer match what retrieval returned, or is the model blowing answers from its training corpus? RAGAS plus a custom harness against a clinician- or domain-expert-built gold set. Tracked over time, regression flagged. Faithfulness is the headline failure mode for grounded chatbot work, and the easiest one for a vendor to hide behind a quarterly cherry-pick.
02
Retrieval precision per query

Did the right chunks rank above the wrong ones? Per-query precision tracked over time, with the long-tail of low-precision queries flagged for corpus work. Retrieval drift is the silent failure that turns a six-month-old chatbot into a ticket generator. We don't ship without a regression gate on this metric.
03
Escalation precision

When the chat should hand off to a human, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set built with the buyer's support leadership. Escalation false-negatives hurt CSAT; false-positives drown the human agent queue. Both directions matter and both get a threshold.
04
Tone + brand alignment

Does the chat sound like the brand? Human-rated against a rubric the buyer's brand and CX teams sign off in writing. Scored over time. The rubric is shared, not run on a private spreadsheet. Tone is the failure mode buyers feel before they notice the faithfulness gap, a chat that sounds wrong loses customer trust before the eval team catches the drift.
05
Latency p50 + p95

Sub-2-second p95 for grounded chat. Sub-400ms turn-take p50 for voice. Measured in production, alerted on regression. Latency regression usually traces to retrieval-layer drift or a model swap, and both are the kind of thing observability surfaces in week one of the regression, not month three.
06
Failure mode named in writing

What's the single most likely way this chatbot fails at month six? If the memo can't answer that question, it isn't a ship-ready eval, it's a vendor demo. We name the failure mode, the leading indicator, and the threshold at which the trigger fires. Roughly one in five engagements names a failure mode that catches the regression before the customer-facing incident; that's the metric we measure ourselves on.

Six-out-of-six clean is the ship gate; we don't launch below threshold. Two or fewer clean is the trigger for a methodology intervention, the build needs more eval work before any prompt tuning. Eval rigour is the cheapest insurance in the chatbot stack and the most-skipped line item in the vendor proposal.

009 / WHERE

Six chatbot shapes across six industries, where we've shipped.

Capability-by-industry heatgrid for chatbot development services we've actually built, not what the brochure promises. Strength reflects engagement depth, dark cells are repeat patterns; light cells are honest about depth we haven't built yet.

Function Industry

B2B SaaS

Ecommerce

Healthcare

Fintech

Logistics

EdTech

Customer service chatbot

Sales / lead-qualifying

Internal Q&A bot

Voice chatbot

Regulated grounded chat

Hybrid AI + human handoff

Customer service chatbot

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Sales / lead-qualifying

B2B SaaSEcommerceFintechEdTech HealthcareLogistics

Internal Q&A bot

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Voice chatbot

B2B SaaSEcommerceHealthcareFintechLogistics EdTech

Regulated grounded chat

B2B SaaSHealthcareFintechLogisticsEdTech Ecommerce

Hybrid AI + human handoff

B2B SaaSEcommerceHealthcareFintechLogisticsEdTech

Possible fit Good fit Primary vertical

Dark cells: repeat engagement pattern. Medium: shipped at least once. Light: scoped but not yet completed. Empty: not yet relevant to the industry.

010 / PROCESS

Six steps. Six weeks. One shipped chatbot.

Eval-first, gold-set-anchored, channel-aware custom chatbot development methodology, refined across grounded chat, tool-using chat, voice chat, and legacy modernisation engagements. The sequence below is the standard six-week build for a single-channel grounded chat. Voice adds a barge-in tuning phase; multi-channel adds a parallel integration sprint; legacy modernisation prepends a contract-and-baseline read. None of these run on a time-and-materials clock, fixed scope, fixed fee, fixed timeline.

WEEK 1

Discovery + use-case lock

60-minute exec session to lock the use case, the channel, the deflection target, and the handoff posture. Read of the existing channel data, call recordings, ticket archive, chat transcripts. Output is a one-page chatbot spec everyone signs off in writing before any code lands.

WEEK 1–2

Corpus + retrieval scaffold

For grounded chat: ingest the corpus, chunk, embed, and build the retrieval scaffold. Hybrid (pgvector + BM25) by default; per-corpus tuning where it earns it. First eval run on a 50-question gold set built with the buyer team. Faithfulness and retrieval precision land in writing before the chat model touches the corpus.

WEEK 2–3

Chat surface + tool wiring

System prompt iteration, tool-call wiring, response shaping. The chat model layer ships behind a feature flag with full Langfuse tracing on every turn. Tool calls validated against Zod or Pydantic schemas. For tool-using chat, the tool surface is small and explicit, three to eight tools is the typical envelope.

WEEK 3–4

Eval harness + handoff layer

RAGAS for retrieval + faithfulness, Inspect or DeepEval for behaviour eval, custom harness for handoff precision and tone. Handoff layer wired with warm-transfer-with-context, Zendesk Sunshine or the channel-native equivalent. Eval thresholds locked in writing as the ship gate.

WEEK 4–5

Channel integration + UAT

Web widget, Intercom, Zendesk, Slack, or WhatsApp, whichever channel the discovery phase locked. UAT against the gold set plus a stratified live-traffic shadow run. Voice builds add a parallel barge-in and interruption-handling pass; voice can't reuse text eval.

WEEK 5–6

Ship + observability tuning

Live launch behind progressive rollout. Langfuse dashboards tuned with the on-call team. First-month tuning loop wired in, drift detection on retrieval, faithfulness regression flags, escalation precision regression flags. Engagement closes with a written handover memo plus a 30-day check-in.

011 / WHY PAITEQ

Why teams pick us for enterprise chatbot development.

01
Eval before prompt-tuning

We ship the eval harness in code before the chat model touches the corpus. Most stalled chatbot builds we audit failed because the team prompt-tuned for three months without a regression gate. Faithfulness, retrieval precision, escalation, tone, latency, all measured in code, all regression-flagged, all owned by the buyer's team on day one.
02
Multi-vendor chat by default

We don't ship single-vendor chat. The routing layer above the model SDKs costs roughly a week of engineering and saves the contract renegotiation that always arrives in month nine. Sonnet 4.6 plus a GPT-5 mini fallback is the default; the buyer's team can swap providers in two weeks of vendor support effort, not six months of re-platforming.
03
Warm-handoff, never cold

Every chatbot we ship into a support context wires a warm-handoff layer with full conversation context. Cold handoff that re-asks the customer their name loses the customer who's been re-asked their name. The handoff layer is the highest-value engineering decision in any hybrid chat, and the most-skipped line item in legacy vendor proposals.
04
Voice as its own discipline

Voice chatbot development isn't text chat with a TTS layer slapped on. Sub-400ms turn-take, barge-in handling, interruption recovery, call-recording-as-eval-set, all engineering surface that doesn't exist in a text-chat build. We've shipped voice on LiveKit + Pipecat across healthcare, fintech, and logistics; never had to re-platform the voice layer.
05
Channel abstraction from week two

Multi-channel chat is the default ask in 2026. The channel abstraction is roughly a week of engineering at week two and earns its keep on the first vendor renegotiation. We've migrated chat across three channel vendors for two clients this year, the abstraction call holds up every time.
06
Fixed scope, written deliverable

Three to twelve weeks per engagement; no time-and-materials clock; no vendor lock-in. The chatbot, the eval harness, the observability dashboards, the channel integration code, all owned by the buyer's team at handoff. About a third of pilot engagements convert to a follow-up shape; we don't push the conversion, the memo names the call either way.

012 / SHAPES

Four ways to start a chatbot development services engagement.

The four shapes as picker cards. Fixed-scope, fixed-fee, written deliverable. Pick the closest match, the framing call refines if needed.

01 / PILOT ↗

Chatbot pilot, single channel

Three to four weeks, fixed scope. One channel, one use case, one corpus. Grounded chat with eval harness wired, observability live, handoff layer scoped. Deliverable is a working chatbot behind a feature flag plus a written handover memo. The default entry shape for ai chatbot development services, usually a customer service chatbot or a support chatbot pilot.

3–4 wksFixed

02 / BUILD ↗

Custom chatbot development

Six to eight weeks, fixed scope. Multi-channel, tool-using or RAG-grounded, with eval and observability wired. Handoff layer included. Most enterprise chatbot development engagements land here. Ships behind progressive rollout with first-month tuning loop.

6–8 wksFixed

03 / VOICE ↗

Voice chatbot development

Eight to twelve weeks, fixed scope. Sub-400ms turn-take voice chat with LiveKit + Pipecat + Deepgram + ElevenLabs. Includes barge-in handling, interruption recovery, and an IVR-replacement path where the legacy is in scope. Eval rigour at production-voice depth.

8–12 wksFixed

04 / MODERNISE ↗

Legacy chatbot modernisation

Six to ten weeks, fixed scope. Replace a stalled rule-based or first-gen LLM chatbot with a grounded, eval-instrumented modern stack. Migration plan against the existing channel and handoff vendor contracts. We've shipped this against four legacy chatbot vendors in 2026.

6–10 wksFixed

013 / USE CASES

Where the chat has landed.

Three typical-shape engagement patterns. Function, segment, and deliverable are real shapes; specific client metrics land in case studies once shipped engagements clear NDA.

Healthcare

Multi-state payer · HIPAA-grounded chat

Grounded member-services chatbot against a regulated policy corpus

Typical shape: a US healthcare payer needs a member-services chatbot that grounds every answer in the actual benefit policy document, never the model's training corpus. We build the chat surface on Sonnet 4.6 + pgvector hybrid retrieval against the policy library, wire RAGAS eval against a clinician-built gold set, and ship the handoff layer warm-transfer-with-context into the existing Zendesk queue. Faithfulness is the headline gate; we don't ship below the threshold.

Deliverable: grounded chat + RAGAS eval harness + warm-handoff into Zendesk Sunshine

Ecommerce

DTC retail · multi-channel post-purchase chat

Returns + order-status chatbot across web, Intercom, and WhatsApp

Typical shape: a DTC retailer wants to deflect tier-one post-purchase volume across three channels without re-asking the customer their order number twice. We build a tool-using chatbot on GPT-5 mini with order-lookup, return-initiation, and tracking tools; ship the channel abstraction above Intercom plus a WhatsApp Business adapter; wire the warm-handoff layer to the human agent queue with full conversation context. Eval on tool-call accuracy and escalation precision against a stratified live-traffic shadow.

Deliverable: multi-channel tool-using chat + channel abstraction + handoff layer

Fintech

Pre-Series-B lending · regulated voice chat

Voice chatbot for KYC pre-fill and after-hours intake

Typical shape: a regulated lending platform wants to compress KYC intake to a five-minute voice conversation with a sub-second turn-take target. We ship LiveKit Agents + Deepgram + ElevenLabs + Claude Sonnet 4.6 as the chat brain. Barge-in handling, interruption recovery, and call-recording-as-eval-set wired before launch. Handoff layer warm-transfers to a licensed loan officer when the conversation reaches a regulated decision point.

Deliverable: voice chat stack + sub-400ms turn-take + regulated-handoff layer

014 / STACK

The stack we ship against.

Chat models, conversation frameworks, retrieval, voice, observability, and channel, the surface a 2026 chatbot build actually touches.

Claude Sonnet 4.6
GPT-5 mini
Gemini 3 Flash
Llama 4
LangGraph
OpenAI Agents SDK
pgvector
Pinecone
Qdrant
Langfuse
Braintrust
RAGAS
LiveKit
Pipecat
Deepgram
ElevenLabs
Twilio
Intercom
Zendesk Sunshine
Slack
Claude Sonnet 4.6
GPT-5 mini
Gemini 3 Flash
Llama 4
LangGraph
OpenAI Agents SDK
pgvector
Pinecone
Qdrant
Langfuse
Braintrust
RAGAS
LiveKit
Pipecat
Deepgram
ElevenLabs
Twilio
Intercom
Zendesk Sunshine
Slack

015 / FAQ

What buyers ask before signing.

What's the difference between chatbot development services and ai agent development?

Different shape, different engineering discipline. Chatbot development services here cover single-turn-ish conversational systems, the user asks, the chat answers (often grounded in a retrieval pipeline), the conversation either resolves or hands off to a human. AI agent development covers multi-step autonomous task execution, plan, act, observe, iterate, often over minutes or hours of runtime with ten-plus tool calls per session. The model class is different (Sonnet 4.6 for chat, Opus 4.7 for the plan layer of agents); the framework is different (thin SDK or LangGraph for chat, full state-graph for agents); the eval surface is different (faithfulness and escalation for chat, trajectory and tool-call success for agents); the latency budget is different (sub-2-second for chat, 5–60 seconds is normal for agents). If your use case is customer service, sales-qualifying, internal Q&A, voice deflection, you're in the right pillar. If it's multi-system workflow automation, route to the agent practice.

Why do you default to Claude Sonnet 4.6 instead of GPT-5 for grounded chat?

Tone and grounding. Sonnet 4.6 leads on faithfulness-against-retrieved-context in our eval runs, measurably less hallucinated content when the retrieval layer feeds it on-topic chunks, and a noticeably better refusal posture when retrieval misses. GPT-5 mini wins on raw throughput and is our backstop in the routing layer; we ship a multi-vendor abstraction above both. Gemini 3 Flash is competitive at the cheap end for low-stakes chat, we've shipped it on two internal-Q&A engagements where the cost economics flipped the spreadsheet. The honest answer is: it depends on the use case, and the default is Sonnet 4.6 plus a routing fallback. We've never recommended a single-vendor chat stack for any production engagement, vendor risk is real and the abstraction costs roughly a week of engineering.

How do you eval a chatbot before it ships?

Five surfaces. Faithfulness, does the answer match what retrieval returned? RAGAS plus a custom harness against a gold set built with the buyer team. Retrieval precision, did the right chunks rank above the wrong ones? Tracked per query, regression flagged. Escalation precision, when the chat should hand off, does it? When it shouldn't, does it stay in flow? Custom harness against a stratified scenario set. Tone, does the chat sound like the brand? Human-rated against a rubric, scored over time. Latency p50 / p95, measured in production, alerted on regression. For voice chat add turn-take p50 / p95, barge-in success, and interruption-recovery. We ship eval in code before the chat model touches the corpus, the harness is the artefact that survives the engagement.

Do you build voice chatbot development on LiveKit or build it custom?

LiveKit Agents or Pipecat for every voice chatbot we ship. We don't build the voice transport layer custom, Twilio carries the PSTN side, LiveKit handles the WebRTC and the agent runtime, Deepgram does the streaming STT, ElevenLabs handles TTS. The custom work is the chat brain, system prompt, retrieval over the right corpus, tool wiring, handoff layer, and the eval harness. Building voice transport custom is a six-month engineering yak-shave that no buyer has ever earned the ROI on. LiveKit lands sub-400ms turn-take out of the box; the engineering work is making the chat brain not stupid inside that latency budget, which is where every voice chatbot development engagement actually lives.

What does an enterprise chatbot development engagement cost?

Fixed scope, fixed fee. A single-channel chatbot pilot runs three to four weeks at the lower end of the band. A full multi-channel custom chatbot development engagement runs six to eight weeks. Voice chatbot development runs eight to twelve weeks (the engineering surface is bigger). Legacy chatbot modernisation runs six to ten weeks depending on the migration depth. We quote exact numbers after a 30-minute scoping call. None of our chatbot development services engagements run on a time-and-materials clock, we sell a working chatbot against a fixed scope, not hours. Pricing scales with channel count, corpus complexity, and handoff depth; the lower end is single-channel grounded chat, the upper end is multi-channel voice + handoff + IVR replacement.

Can you migrate us off an existing chatbot vendor (Drift, Ada, Intercom Fin, etc.)?

Yes, legacy chatbot modernisation is a defined engagement shape. We've migrated chat off four legacy vendor stacks in 2026, every one for the same reasons: eval depth was vendor-controlled, model swap was vendor-blocked, channel abstraction didn't exist, and renewal economics had walked away from the value. The migration plan reads the existing channel contracts, the existing handoff layer, and the existing eval baseline (where one exists). The target stack is usually Sonnet 4.6 + pgvector + Langfuse + your existing channel anchor (Zendesk, Intercom, Twilio), vendor surface kept, brain replaced. Migration runs six to ten weeks; we ship behind a progressive feature flag so the cutover is reversible until the eval baseline is matched-or-exceeded.

How is this different from your conversational ai development or llm chatbot development work?

Same pillar, different language. Conversational AI development is the head-term, the discipline of building conversational systems, regardless of channel. Chatbot development services is the buyer-side phrase, what the procurement team types into the search bar. LLM chatbot development is the architectural sub-category, chat where the brain is an LLM (which is roughly 100% of 2026 builds; rule-based chat is a legacy modernisation target, not a greenfield shape). All three terms describe work we ship under this pillar. The sibling practices: AI agent development for multi-step autonomous task execution; RAG development services for retrieval pipelines deeper than the chat use case requires; LLM development services for the model-engineering layer (fine-tuning, hosted-vs-self-hosted, cost engineering) beneath a chat surface.

016 / FURTHER READING

Where this practice connects.

If you're on the agent side of that decision grid, route to our AI agent development company practice; the architecture and eval shape are meaningfully different from chat. The sibling practices: RAG development services for retrieval pipelines deeper than the chat use case requires; LLM development services for fine-tuning, hosted-vs-self-hosted, and cost engineering beneath a chat surface.

For chatbots that sit on the auth boundary (account-takeover surface, transaction-confirmation flows), our AI fraud detection at the auth boundary walkthrough covers the hybrid rules + ML + LLM architecture we deploy underneath. Almost every chatbot engagement we ship recommends a retrieval-augmented generation pipeline as the spine, not a generic conversational LLM blowing answers from training data. Most inbound for a support chatbot or customer service chatbot lands in shape 01 (grounded) or shape 04 (hybrid handoff); the 2026 customer service chatbot guide has the vendor matrix and cost-per-resolved-contact math.

Industry routes: AI healthcare software development chatbots is the single most common regulated entry shape (BAA-backed model hosting, PHI masking before any model call), with AI for fintech support chatbots a close second when the SOC 2 + FFIEC posture is the gate, and custom AI insurance development chatbots growing fast for policyholder self-service. For mobile-first chat surfaces (the in-app chat lives inside a Flutter or React Native shell), the engineering bench is the same team that maintains GetWidget (the open-source Flutter UI library, 4,800+ stars on the github repo, 2026-Q2). UI components, voice stack, and LLM gateway sit in one team — not three vendors with three eval rubrics.

017 / Related practices

Adjacent services.

AI AGENT DEVELOPMENT

AI Agent Development

Autonomous, tool-using AI agents for production workloads.

RAG DEVELOPMENT

RAG Development

Retrieval-augmented generation systems with evaluation built in.

LLM DEVELOPMENT

LLM Development

Custom LLM apps — RAG, fine-tuning, evaluation, deployment.

018 / Start a chatbot engagement

Ship a grounded chatbot in six weeks.

Grounded chatbot pilot in 3–4. Custom chatbot development in 6–8. Voice chatbot development in 8–12. Legacy chatbot modernisation in 6–10.

Talk to a partner See engagement shapes

Chatbot development services, grounded chat, tool-using chat, voice chat, hybrid AI + human handoff

Chatbot or agent? The most expensive misframing in this category.

Five chatbot shapes. Every chatbot development services engagement maps to one.

What production chatbot development services look like at the latency level.

The six families of the modern chatbot stack.

Four conversation architectures. Pick the one that fits the buyer journey.

RAG-GROUNDED

TOOL-USING

VOICE

HYBRID + HANDOFF

What a six-week custom chatbot development engagement actually ships.

Discovery + use-case lock

Corpus + retrieval scaffold + gold-set eval

Chat surface + tool wiring

Eval harness + handoff layer

Channel integration + UAT + ship

Channel picker. Where the chatbot lives, by use case.

Six eval gates an honest chatbot ship clears.

Six chatbot shapes across six industries, where we've shipped.

Six steps. Six weeks. One shipped chatbot.

Discovery + use-case lock

Corpus + retrieval scaffold

Chat surface + tool wiring

Eval harness + handoff layer

Channel integration + UAT

Ship + observability tuning

Why teams pick us for enterprise chatbot development.

Four ways to start a chatbot development services engagement.

Where the chat has landed.

Grounded member-services chatbot against a regulated policy corpus

Returns + order-status chatbot across web, Intercom, and WhatsApp

Voice chatbot for KYC pre-fill and after-hours intake

The stack we ship against.

What buyers ask before signing.

Where this practice connects.

Adjacent services.

Ship a grounded chatbot in six weeks.