We will buy a customer service chatbot when we can audit what it knows, see where it draws its answers from, and measure how often it gets the answer right on tickets we already have ground truth on . Until then, every vendor pitch is just a deflection-rate number with no methodology under it.
Enterprise AI chatbot, in production.
42% tier-1 deflection in 8 weeks.
A Series C B2B SaaS company was running a 6-hour first-response time on its tier-1 support queue. We shipped a RAG-grounded enterprise AI chatbot over the help center and 90 days of resolved tickets. Claude Haiku 4.5 classifies intent in 280 ms. Claude Sonnet 4.6 returns the grounded reply with every claim cited to a retrieved chunk. A confidence gate at 0.7 routes the output into one of four lanes: autonomous reply, escalate with draft, refuse and queue, or out-of-scope. Five-week pilot, two weeks of internal-note shadow, three weeks of staged rollout. By week 8 the chatbot was deflecting 42% of tier-1 traffic (2026-Q1) at $800 per month all-in run cost (2026-Q1).
What this case study shows
A Series C B2B SaaS company shipped a customer service chatbot example on Claude Sonnet 4.6 plus a Haiku 4.5 intent classifier, grounded in 480 help-center articles and 90 days of resolved Zendesk tickets. By week 8 the chatbot deflected 42% of tier-1 traffic at steady state (resolution-counted, not first-touch routing), with groundedness at 0.95 and escalation-draft acceptance at 78%. First-response time on deflected tickets dropped from a 6-hour median to 12 minutes. Run cost: about $800 per month all-in. Stack: Claude Sonnet 4.6, Claude Haiku 4.5, pgvector, Zendesk function-calling, Langfuse. Five-week pilot plus three-week staged rollout, with a shadow-mode pause that caught three production failure modes before any customer saw a reply. The KB layer underneath is the same engagement shape we ship as a standalone AI knowledge base when retrieval is the primary surface.
Where the support queue
actually was.
A 6-hour median first-response time. 14 agents on follow-the-sun rotation. Three-quarters of tier-1 inbound was the same five ticket shapes. Generic chatbots had been tried; all three were turned down.
The client is a Series C B2B SaaS company. Their support floor sat at 14 agents on a follow-the-sun rotation, with an inbound Zendesk queue averaging 6-hour first-response time on tier-1 traffic and an 18 to 24-hour tail during US-morning peak. The Head of Support Engineering ran the numbers in our discovery week and named the binding constraint. 76% of tier-1 ticket volume mapped to five repeating shapes: password resets, billing questions like "why was I charged on the 4th, not the 1st", feature how-tos lifted directly from the help center, plan-tier entitlement questions, and SSO connection errors with a known knowledge-base remedy.
The team had previously evaluated three off-the-shelf AI customer support software vendors and turned all of them down. Objections were consistent. Tone did not match the brand voice. Product knowledge was weak or wrong on entitlement edge cases the marketing site did not cover. And there was no honest measurement methodology behind the deflection-rate claims those vendors had put in their pitch decks. The Head of Support Engineering said it plainly: we will buy a customer service chatbot when we can audit what it knows, see where it draws its answers from, and measure how often it gets the answer right on tickets we already have ground truth on.
today · without the chatbot
with the chatbot
What we built,
and why.
Two models. One confidence gate. Four routing outcomes. One eval set the customer's own support engineering team labelled. Every reply has to cite the chunk it pulled from, or the schema validator rejects.
We shipped a customer service chatbot that an engineering team will recognize as eval-disciplined rather than vendor-pitched. The intent-classify step runs on Claude Haiku 4.5: every inbound message routes into one of 9 ticket categories or out-of-scope. The reply step runs on Claude Sonnet 4.6 with retrieval over the help center plus 90 days of resolved Zendesk tickets. Every Sonnet reply must cite the chunk it retrieved the answer from, or the Zod validator rejects the output and the agent retries once with a stricter prompt, then fails closed to a human ticket.
The confidence gate sits at 0.7. Above threshold, the chatbot replies to the customer directly. Below 0.7 and above 0.4, the chatbot does not reply to the customer at all. Instead it escalates the ticket to the human queue with the AI-drafted reply attached as an internal Zendesk note. The agent reads the draft, edits or rewrites it, and sends. We measured that draft-attached escalation path as its own metric because we suspected (correctly) that it would carry a meaningful share of the productivity gain even on tickets the chatbot could not close autonomously. Below 0.4 the chatbot refuses outright; the ticket queues for a human with retrieval candidates pre-attached so the agent does not start cold.
Zendesk function-calling handles every write action. The chatbot can create a ticket, update ticket status, attach the conversation transcript to a parent ticket, and tag the case for routing. It cannot issue refunds. It cannot change plan tiers. It cannot read or write to billing systems. Every write tool has a policy file the agent runtime imports at startup, and the agent refuses to call a tool whose policy preconditions fail. The widget itself lives inside the customer dashboard, not the marketing site. This is a customer support chatbot for authenticated users with an account context the agent reads at session start. The retrieval scope is narrower than a marketing chatbot, entitlement questions are answerable from authenticated context, and the failure modes are different.
Two-model pipeline · Haiku 4.5 classifies, Sonnet 4.6 replies
- we rejected
- Single Sonnet model on every inbound message
- because
- Roughly 60% of tier-1 inbound resolves to one of 9 narrow intents we classify in 280 ms p95 with Haiku at about 5% of Sonnet's per-call cost. The classifier also lets us refuse out-of-scope (refunds, plan changes, billing disputes) at the cheap step, before retrieval ever fires. Routing-only queries terminate at Haiku and never pay for grounded reasoning they do not need.
Confidence gate at 0.7 with escalate-with-draft below
- we rejected
- Single threshold at 0.9 · reply only when very confident, otherwise human cold-start
- because
- We shipped 0.8 initially and watched it kick autonomous-resolvable tickets into the escalation lane. Shadow data showed the model was strong-and-safe on password resets, SSO errors, and feature how-tos at the 0.7 band; the conservative threshold cost real resolutions. The bigger productivity gain was below the gate: AI drafts attached as internal Zendesk notes, 78% accepted by agents with light edits, first-response on the escalation lane dropped from 6 hr to 1 hr 40.
Authenticated dashboard widget · not a marketing-site bot
- we rejected
- Public marketing-site chat with anonymous visitor context
- because
- Authenticated session at start means we read entitlement context from the customer's plan tier and Salesforce contract before the model generates anything. The bot answers 'why was I charged on the 4th, not the 1st' against the customer's real billing schedule, not a generic pricing page. Marketing widgets cannot do this; retrieval scope is different, failure modes are different, the SERP intent is different.
Six stages of one
customer service chatbot turn.
From inbound message to logged trace. Each stage carries its own latency budget, model pick, and failure mode. Skip any one and you ship the demo bot competitors do.
- 01ClassifyIntent routingClaude Haiku 4.5 · 9-way classifier · 280 ms p95 · 60-token system prompt$0.0006 / turn
- 02RetrieveRAG over help center + ticket historypgvector top-k 6 · 480 articles + 18,000 resolved tickets · 512-token chunks~1,200 in tokens
- 03Tool callZendesk function executioncreate ticket · update status · tag · attach transcript · 4-second timeout0–2 calls / turn
- 04GenerateGrounded reply with citationsClaude Sonnet 4.6 · forced JSON · every claim cites a chunk · streamed~420 out tokens
- 05GuardrailConfidence gate + policy checkZod schema validate · ≥0.7 reply · 0.4–0.7 escalate with draft · <0.4 refusefail-closed
- 06LogTrace + nightly eval replayLangfuse self-hosted · 500-item eval re-runs · alert on >1.5 pt dropevaled nightly
Latency and token numbers are 30-day rolling p95 from production traces. The 0.7 confidence gate is the policy threshold tuned in shadow week 2.
One model output.
Four routing outcomes.
Every Sonnet response carries a confidence score. The score plus the intent category from Haiku decides what the customer actually sees. Autonomous reply on the high-confidence path, escalate-with-draft in the middle band, refusal queue at the bottom, out-of-scope intents never reach Sonnet at all. The router is the customer-experience contract.
- Autonomous reply · ≥ 0.7 schema-valid grounded answer · streamed to customer · ticket auto-tagged + resolved · ≈ 42% of inbound
- Escalate with draft · 0.4 – 0.7 AI draft posted as internal Zendesk note · agent reads, edits, sends · 78% acceptance rate
- Refuse + queue · < 0.4 no customer-facing reply · ticket routed to human queue · retrieval candidates pre-attached for the agent
- Out-of-scope · direct to human refunds · plan changes · billing disputes · never reach Sonnet · routed at the Haiku step
Threshold tuned in shadow mode week 2. 0.9 was too conservative (kicked safe autonomous-resolvables into the escalation lane). 0.5 was too aggressive (surfaced low-grounded replies to customers). 0.7 sits where draft-attached acceptance and autonomous-resolution both stay healthy.
The tool policy file, in a table
| action | can do? | audit log | policy gate |
|---|---|---|---|
| Reply to customer with cited answer | yes · ≥0.7 conf | Langfuse + Zendesk | schema validates every claim |
| Create internal Zendesk note (draft) | yes | Zendesk audit | always logged · never auto-sent |
| Update ticket status (open → resolved) | yes · post auto reply | Zendesk audit | agent over-ride is 1-click |
| Tag for routing | yes | Zendesk audit | 9-tag vocabulary · hard-coded |
| Attach conversation transcript to parent | yes | Zendesk audit | PII scrub before attach |
| Issue refund | no | — | billing-system tool not exposed |
| Change plan tier or entitlement | no | — | Salesforce write-path not exposed |
| Read or write billing system | no | — | out of scope by policy file |
| Send email outside Zendesk | no | — | no SMTP tool available to runtime |
- action Reply to customer with cited answercan do? yes · ≥0.7 confaudit log Langfuse + Zendeskpolicy gate schema validates every claim
- action Create internal Zendesk note (draft)can do? yesaudit log Zendesk auditpolicy gate always logged · never auto-sent
- action Update ticket status (open → resolved)can do? yes · post auto replyaudit log Zendesk auditpolicy gate agent over-ride is 1-click
- action Tag for routingcan do? yesaudit log Zendesk auditpolicy gate 9-tag vocabulary · hard-coded
- action Attach conversation transcript to parentcan do? yesaudit log Zendesk auditpolicy gate PII scrub before attach
- action Issue refundcan do? noaudit log —policy gate billing-system tool not exposed
- action Change plan tier or entitlementcan do? noaudit log —policy gate Salesforce write-path not exposed
- action Read or write billing systemcan do? noaudit log —policy gate out of scope by policy file
- action Send email outside Zendeskcan do? noaudit log —policy gate no SMTP tool available to runtime
Policy file lives in the customer's repo. Agent runtime imports it at startup. Any tool call whose preconditions fail is refused before the call is made — no dry-runs, no soft failures. Tier-2 actions (refund, plan change, billing write) are not in this build by design; we scope them as a separate engagement with stricter risk model.
Stack we shipped,
all of it audit-able.
What the chatbot reads from and how
| surface | count | chunking / treatment | refresh cadence |
|---|---|---|---|
| Help center articles | 480 articles | 512-token chunks · 80-token overlap · sentence-anchored | daily incremental on CMS publish |
| Resolved Zendesk tickets (90d window) | ≈ 18,000 tickets | subject + first agent reply + final resolution · dedup by canonical resolution | nightly rolling window |
| Entitlement context (Salesforce) | live · per-session | read at session start · authenticated user id only | per-session · no batch refresh |
| Intent categories | 9 + out-of-scope | password reset · billing · feature how-to · plan/entitlement · SSO · refund · cancellation · escalation · feature-flag | frozen · labelled by client's support engineering |
| Confidence threshold | 0.7 reply / 0.4 refuse | tuned in shadow mode week 2 · drift-monitored nightly | review quarterly |
- surface Help center articlescount 480 articleschunking / treatment 512-token chunks · 80-token overlap · sentence-anchoredrefresh cadence daily incremental on CMS publish
- surface Resolved Zendesk tickets (90d window)count ≈ 18,000 ticketschunking / treatment subject + first agent reply + final resolution · dedup by canonical resolutionrefresh cadence nightly rolling window
- surface Entitlement context (Salesforce)count live · per-sessionchunking / treatment read at session start · authenticated user id onlyrefresh cadence per-session · no batch refresh
- surface Intent categoriescount 9 + out-of-scopechunking / treatment password reset · billing · feature how-to · plan/entitlement · SSO · refund · cancellation · escalation · feature-flagrefresh cadence frozen · labelled by client's support engineering
- surface Confidence thresholdcount 0.7 reply / 0.4 refusechunking / treatment tuned in shadow mode week 2 · drift-monitored nightlyrefresh cadence review quarterly
Embeddings: voyage-3-large at 1,024 dimensions. Retrieval: pgvector top-k 6, sentence-anchored chunks. Ticket dedup is canonical-resolution based (we group tickets that resolve to the same root cause and keep the cleanest exemplar). Salesforce entitlement context is read live per-session and never cached.
The timeline,
including the week we almost paused.
Five stages, milestone-billed. The week-5 shadow run caught three production failure modes (entitlement edge cases, billing-dispute auto-replies, stale release-notes answers) that all pre-dated any customer ever seeing a reply. The honest version of '8 weeks' includes the days we spent fixing them.
- Week 1
Discovery + 500-ticket eval set
One week with the Head of Support Engineering and two senior agents. We sampled 500 tier-1 tickets from the previous 90 days, stratified across the 9 intent categories. Their team labelled the correct reply, the correct ticket disposition (resolve, escalate, route), and the help-center article ids that should have grounded the answer. We wrote the eval harness. They wrote the answers. That organizational split was load-bearing for trust when results were reviewed two months later.
Frozen 500-item eval set · intent rubric · groundedness calibration corpus - Weeks 2–3
Corpus build + retrieval tuning
Indexed 480 help-center articles and 90 days of resolved Zendesk tickets (about 18,000 after dedup) into pgvector on the customer's existing Postgres. voyage-3-large embeddings at 1,024 dimensions. Chunked at 512 tokens with 80-token overlap, sentence-anchored, never splitting mid-quote. Week 2 cost us four days on a macro-library cleanup we had not scoped: the Zendesk macro library had drifted across two reorgs and 'resolved' tickets had been closed with the wrong macro applied. Future projects: we audit historical resolution data before the corpus build, not during.
Hybrid retrieval at recall@5 of 0.92 on the eval set - Week 4
Two-model pipeline + confidence gate
Claude Haiku 4.5 wired as the intent classifier (9 categories + out-of-scope). Claude Sonnet 4.6 generating cited replies under forced JSON. Confidence gate at 0.7 with escalate-with-draft routing below. Zendesk function-calling for ticket create, update, tag, attach. Tool policy file enforced at runtime: agent refuses to call any tool whose policy preconditions fail.
End-to-end pipeline behind a feature flag · CI green on the frozen eval - Weeks 5–6
Shadow mode · three failure modes caught
Two full weeks where the chatbot generated a reply for every tier-1 ticket, but the reply landed in an internal-note field, never the customer. A senior agent reviewed 10% of replies daily. Three production failures surfaced. One: confident replies on enterprise entitlement edge cases, fixed by reading Salesforce at session start. Two: auto-replies to billing disputes that should have escalated, tightened with adversarial examples in the classifier. Three: stale release-notes answers on feature-flag questions, fixed by excluding release notes from retrieval and routing those questions to the human queue.
Three retrieval-scope changes shipped · groundedness lifted 0.91 → 0.94Walk-away point - Weeks 7–8
Staged rollout · 42% holds at steady state
Three-week staged rollout. 10% of tier-1 traffic in week 6. 50% in week 7. 100% by end of week 7. Deflection held at 41 to 42% across all three stages. Escalation-draft acceptance climbed from 71% to 78% as agents learned to trust and rely on the draft. Nightly Langfuse eval replay caught one regression: a Sonnet model update added subtle verbosity that hurt groundedness by 1.8 points. We held the promotion, rolled the prompt, and re-promoted two days later.
42% tier-1 deflection sustained · 0.95 groundedness · 78% draft accept rate
How we know
42% deflection is real.
The eval set is frozen. Every prompt change, retrieval change, threshold change, and model upgrade replays the full 500. Nothing promotes if any metric drops more than 1.5 points from the prior cut. Numbers below are the current production cut against the four staged-rollout reference points.
Deflection is resolution-counted: share of inbound tier-1 where the chatbot replied autonomously AND the ticket did not return within 14 days for the same root cause. Denominator is full tier-1 inbound, not just chatbot-attempted. Groundedness is LLM-as-judge on the 500-item eval, calibrated against 80 human-judged samples. Escalation-draft acceptance is the share of AI drafts the human agent kept as-is or lightly edited (versus rewriting from scratch). 14-day reopen is the share of auto-replied tickets that reopened for the same root cause within 14 days.
An SSO error,
end to end.
One of the 9 intent categories: SSO connection failure with a known help-center remedy. Watch Haiku 4.5 classify the intent in 240 ms, pgvector retrieve three cited chunks, Sonnet 4.6 generate a grounded reply, Zendesk function-calling create a tagged ticket, and Langfuse log the whole trace for nightly eval replay. The chip on the active stage card shows what just ran.
Illustrative per-decision economics — typical engagement-band figures, not literal client numbers.
- Customer: Hi — I cannot log in. The dashboard keeps showing 'SSO connection failed'.
- Bot: Looking at your SSO config — one moment.
- Bot: Your IdP signing cert expired on Wednesday. Easiest fix: rotate the cert in your Okta admin (steps 1–4 in our SSO renewal guide). Want me to attach the article to a ticket so you have it on hand?
- Customer: Yes please, and tag urgent.
- Bot: Creating the ticket and attaching the article…
- Bot: Ticket #4421 created, tagged urgent, SSO renewal article attached. If you hit a snag at step 3, reply here and I will loop in a human agent.
- Bot: Logged · trace lf_8e3a · 4 turns · total cost $0.014 · eval queued for nightly replay
The honest read on
the numbers you have seen elsewhere.
You have probably seen bigger numbers. Klarna said its OpenAI assistant handled two-thirds of customer service chats. Intercom Fin publishes a 67% resolution rate across 40 million conversations. Vendor pitch decks regularly promise 80% and up. Our 42% sits below all of them. Here is the read.
Handled is not resolved. The most-cited Klarna number — two-thirds of chats — measures which channel touched the customer first, not whether the customer's problem ended there. Klarna's CEO acknowledged in 2025 that the cost-first organization of support produced lower quality, and the company moved back to a hybrid AI-plus-human staffing model. We measure resolution on a frozen eval set with ground-truth disposition labelled by the customer's own support engineering team. A reply that lands but generates a second ticket next week does not count toward our 42%.
Industry RAG band is 40 to 60% for resolution. Across published benchmarks for retrieval-augmented chatbots in 2026, the strong-performance range is 40 to 60% resolution at confidence-gated routing. Below that band, the chatbot is doing rule-based matching at premium model cost. Above it, you are usually looking at routing metrics rather than resolution. Our 42% sits at the lower end of the strong band on a deliberately narrow tier-1 scope. Adding tier-2 paths (refunds, plan changes, escalation reasoning) is the route to lifting the number; we have scoped that as a separate 6-week build, not a feature add to this one.
Different ticket mix means different ceilings. Klarna's volume is consumer fintech with high-frequency, narrow-shape queries: delivery status, refund status, BNPL repayment dates. Our client is B2B SaaS with technical entitlement questions, SSO config, billing edge cases, and feature-flag rollout questions. The intent distribution is structurally different. A 42% number on B2B SaaS tier-1 is not directly comparable to a 67% number on consumer fintech. The share of inherently-resolvable inbound is itself a function of the customer's product surface, and consumer-fintech inbound has a much higher share of queryable-from-records intents than B2B SaaS technical inbound does.
Deflected does not mean resolved. The most expensive failure mode for any AI customer-service chatbot is the confident-wrong autonomous reply that closes a ticket the customer then reopens. We measured the 14-day reopen rate on autonomously-replied tickets at 6.2% and excluded those reopens from the 42% denominator. Vendor pitch decks rarely report this number. When you evaluate any chatbot vendor, the second question after "what is your deflection rate" should be "what is your 14-day reopen rate on the auto-replied subset?" The two together are the honest measurement.
CSAT held through rollout. First-response time on deflected tickets dropped from a 6-hour median to 12 minutes (most of those 12 minutes is the customer's own reading and follow-up latency, not the model). On escalated tickets, the agent starts from a drafted reply rather than reading the ticket cold; first-response on the escalation lane dropped from 6 hours to 1 hour 40. Customer-satisfaction score on deflected tickets held within 0.3 points of the all-human baseline through the staged rollout. We are not chasing a deflection number at the cost of CSAT, which is the Klarna lesson rendered into engineering practice.
The 42% in our title is the honest version of the number. The bigger numbers in other vendors' decks are usually a different number wearing the same name.
The four shapes
we turn down.
A customer service chatbot built on this architecture will mislead users in any of the following situations. We turn down the engagement before a pilot is scoped, and we mean it.
Ticket volume under 500 per month
RAG infrastructure plus a 500-ticket eval set is overkill below this volume. The pilot pays back over 12+ months of run cost versus adding a tier-1 agent. Below 500/month an off-the-shelf vendor's flat fee usually wins.
Help center is thin or stale
If under 30% of inbound is recoverable from existing docs, the model has nothing to ground on. We turn this engagement down. The right answer is: hire a docs writer for two months and we revisit. An agent over thin docs is a hallucination engine.
Tier-2 write-paths required on day one
Refunds, plan changes, billing writes need a stricter risk model than tier-1 deflection. We scope tier-2 as a separate 6-week pilot with per-action audit and dollar caps. Buyers who insist on day-one tier-2 are buyers we cannot serve well.
Support engineering will not own the eval set
If the client cannot commit one support engineer to weekly eval review and ground-truth labelling for the first six months, the deflection number drifts. The eval is the contract. Without an owner on the customer side, the number degrades silently and nobody catches it.
How an engagement
| stage | duration | what it covers |
|---|---|---|
| Discovery audit | 1–2 weeks | 500-ticket eval set + retrieval architecture memo + model + run-cost projection · client keeps every artifact regardless of next step |
| Pilot | 4–6 weeks + 2 wk shadow | End-to-end chatbot · two-model pipeline · Zendesk function-calling · confidence gate · escalate-with-draft · Langfuse · CI-wired eval harness |
| Staged rollout | 3 weeks | 10% → 50% → 100% rollout · daily eval review · drift monitoring · prompt and retrieval tuning |
| Continuous delivery | ongoing · 30-day notice | Embedded engineer · weekly eval review · drift monitoring · scope extensions · prompt and retrieval updates |
- stage Discovery auditduration 1–2 weekswhat it covers 500-ticket eval set + retrieval architecture memo + model + run-cost projection · client keeps every artifact regardless of next step
- stage Pilotduration 4–6 weeks + 2 wk shadowwhat it covers End-to-end chatbot · two-model pipeline · Zendesk function-calling · confidence gate · escalate-with-draft · Langfuse · CI-wired eval harness
- stage Staged rolloutduration 3 weekswhat it covers 10% → 50% → 100% rollout · daily eval review · drift monitoring · prompt and retrieval tuning
- stage Continuous deliveryduration ongoing · 30-day noticewhat it covers Embedded engineer · weekly eval review · drift monitoring · scope extensions · prompt and retrieval updates
Audit deliverable (eval set, architecture memo, model + run-cost projection) is the client's to keep regardless of whether they continue. Pilot is milestone-billed against the CI-wired eval set. Continuous engagement runs on 30-day notice on either side. Run cost lands around the Claude API + pgvector + Langfuse spend at the observed volume — token math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing (Sonnet 4.6 at $3/1M input, $15/1M output; Haiku 4.5 at $1/1M input, $5/1M output) as of May 2026. Talk to our team for engagement scoping.
What buyers ask first.
Real answers, no hedging.
Will this customer service chatbot work on Intercom, Salesforce Service Cloud, or Help Scout instead of Zendesk?
How long for our team to reach 42% deflection on tier-1?
What about EU data residency and GDPR?
Can this chatbot handle tier-2 — refunds, plan changes, billing writes?
How is the 42% measured? Is this deflection or resolution?
What happens when Anthropic releases a new Claude model?
Do we own the eval set, prompts, and retrieval config after the engagement ends?
What is the difference between this and Zendesk Answer Bot or Intercom Fin?
Where this case study
points back to.
Pillars and sibling case studies that share architecture, model stack, or distribution surface with this customer service chatbot build.
AI Chatbot Development
The chatbot pillar. RAG-grounded support bots, confidence gates, escalation-with-draft patterns. Same eval-first loop used on this build.
Claude Development
Sonnet 4.6 + Haiku 4.5 integration patterns. Forced JSON, intent-classify routing, prompt cache shapes we used on this customer service chatbot.
OpenAI Development
Alternative model stack. GPT-4o + GPT-4o-mini routing if your platform team has an OpenAI commitment or function-calling preference.
AI Integration Services
Zendesk, Salesforce, Intercom, and ticketing-system function-calling. The integration layer behind every tier-1 deflection build.
Claude RAG for Product Docs
Sibling RAG case study. Same retrieval shape, different distribution surface — docs search instead of chat widget. 64% deflection on docs-recoverable queries.
OpenAI Realtime Voice Agent
Voice-channel sibling. Same intent-classify routing, different modality and latency budget.
AI App Development Services
How a chatbot-deflection build fits inside a broader AI development engagement — widget shell + retrieval + classifier + Zendesk integration + dashboard.
AI Consulting
This engagement started with our standard discovery audit — workflow inventory, deflection-rate baseline, eval-set design, and the kill point that gated the pilot.
Customer Support Automation
Customer-support deflection in production. The workflow our customer-support automation pattern was built from.
Want a chatbot case study like this
for your support stack?
Start with a discovery audit. We will sample 100 of your tier-1 tickets, scope the eval set, recommend a model + retrieval recipe, project token + run cost, and tell you honestly whether your support volume is chatbot-shaped. About one audit in five ends with 'you do not need this — buy the platform, here is the SOW for integration.'