The thing that scares us isn't a wrong answer — wrong answers we can recover from. What scares us is one second of silence on the line . The customer has already decided they're talking to a robot, and they're done. Show us the latency tail before you show us the deflection number.
A voice bot for a SaaS AI contact center,
built on the OpenAI Realtime API.
A US-based mid-market B2B SaaS support team's tier-1 voice queue was averaging four-minute waits at peak; five inbound questions accounted for 62% of call volume; their legacy IVR bounced 80%+ to a human. We built the AI contact center software on the OpenAI Realtime API (gpt-realtime-2) over their help-center corpus — Twilio in, streaming tokens to TTS, confidence-gated handoff to a live rep. Buyer framings layer on the same build: voice bot for procurement, AI-powered call center for CX, AI customer service software for finance. Eval-first, 10 weeks, with a kill point in week 5 we used.
What this case study shows
A SaaS customer-support team shipped an OpenAI Realtime voice agent over their help-center RAG corpus, fronted by Twilio for telephony. Across n=11,400 calls (95% CI), the agent deflected 38% of tier-1 voice traffic at $0.10 per call against a $4 live-agent baseline, with P95 first-token latency of 580 milliseconds. Stack: gpt-realtime-2, Whisper-large-v3, pgvector 0.7, Twilio Voice, Cloudflare Workers, Langfuse. Function-calling handoff_to_human triggered when confidence dropped below 0.7. Eight weeks discovery to production. This is one shape of a broader voice agent services engagement — same pipeline carries over to telephony, kiosk, and in-app voice surfaces.
p95 first-token, visualised.
Total budget — caller-mouth to caller-ear — is 580ms. Each band's width is its share of that budget. The reasoning + RAG step is the long pole; the rest are kept honest by Cloudflare Workers and the Twilio media edge.
- Caller speech ingress 62ms
- STT (gpt-realtime-2 audio in) 118ms
- Reasoning + RAG retrieval 264ms
- TTS first-audio frame 88ms
- Twilio egress to caller 48ms
Deterministic replay — these bars are not a recording; they are a layout-stable visualisation of the p95 first-token latency budget. Per-stage numbers are pulled from Langfuse trace aggregates over a 30-day production window.
A tier-1 voice queue
that wasn't worth a human.
A US-based mid-market B2B SaaS ($80M+ ARR) with a 14-rep tier-1 team handling ~41,000 calls/month. 62% of inbound volume on the same 5 questions. IVR bouncing 80%+ to a queue with a 4-minute peak wait. Tier-1 reps burning 71% of their day on questions any of them could answer in their sleep.
today · tier-1 voice queue
with the agent
The binding constraint was latency, not accuracy. The support lead and the head of CX had piloted two text-channel chatbots the previous year and shelved both when CSAT dipped. What changed: US callers reliably report the experience as "robotic" when a synchronous voice line holds more than ~700ms of dead air after they finish speaking. Past that ceiling, deflected calls don't actually deflect — they bounce to a human angrier than when they arrived. Vendor buy-vs-build: legacy contact-center suites bolted LLMs onto IVR trees, conversational-IVR layers carried per-minute pricing, voice-bot SaaS priced on monthly seats. None carried a frozen eval, a p95 first-token SLO, or a per-call cost number reconcilable from public vendor pricing. The conversation we walked into was: show us how a voice agent could miss the 700ms ceiling, and tell us how you'd catch it before a customer hears it.
Native speech-to-speech vs chained STT → text-LLM → TTS bought back ~350ms of p95.
Voice bot pipeline — six stages,
one branch out to a human.
Twilio ingress → Cloudflare Workers edge proxy → gpt-realtime-2 native speech-to-speech (Whisper-large-v3 fallback) → hybrid pgvector + Pinecone A/B retrieval → bge-reranker-large → 3-tool surface (lookup_article, handoff_to_human, schedule_callback). Zero write tools. The streaming dots in the diagram are real — gpt-realtime-2 does not wait until generation finishes before TTS starts. That property is what gets p95 first-token under 600ms.
gpt-realtime-2 speech-to-speech as the primary path
- we rejected
- Chained STT → text-LLM → TTS pipeline (Whisper + GPT-5 + ElevenLabs)
- because
- On the eval we ran, chained got us to p95 ≈ 940ms first-token, already past the 700ms `feels-robotic` threshold US callers reported. Native speech-to-speech buys us back ~350ms. Whisper still ships as a fallback when the Realtime audio path can't decode accent or noise.
handoff_to_human as a function-calling tool, not a fallback timeout
- we rejected
- Confidence threshold on the model's own self-reported probability
- because
- Self-reported confidence on Realtime models is poorly calibrated under stream pressure (Anthropic and OpenAI both publish this). A first-class tool the model can call explicitly is more honest: the model knows what it doesn't know better than it knows how sure it is.
pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror
- we rejected
- Pick one vector store and commit
- because
- Help-center retrieval recall was the second-biggest determinant of deflection (after first-token latency) on the eval. Running both in production for 6 weeks let us measure not just recall@5 but cost per query and tail latency under real traffic. pgvector won on cost-per-query; Pinecone won on tail-latency variance. We kept pgvector primary and the mirror stays as a watch-the-shop sanity check.
Every component has a
separately measurable contract.
When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.
Realtime decision model
Tier-1 deflection precision at 0.7 confidence on the frozen 240-item eval. Function-calling handoff is a deliberate tool, not a fallback timeout — the model knows what it doesn't know better than it knows how sure it is.
Telephony ingress
Round-trip latency from carrier to edge. Cloudflare Workers buys back ~28ms median vs origin.
Hybrid retrieval
Recall@5 + cost-per-query on the eval. pgvector primary, Pinecone serverless as A/B watchdog.
Cross-encoder rerank
Top-1 precision on the held-out slice.
TTS first-frame
Token-to-playback latency.
Handoff to human
PagerDuty page-to-pickup time. Warm transfer with transcript.
The realtime voice agent,
round-trip.
Caller speaks. Audio streams to gpt-realtime-2 over the help-center RAG. The model either answers — streaming tokens straight back to TTS so the first audio frame leaves the edge inside ~580ms — or calls the handoff_to_human tool and PagerDuty pages a live agent. Hover any stage to see its tool inventory and first-token latency budget.
first-token p95 580ms end-to-end · streaming tokens flow continuously from gpt-realtime-2 to TTS · branch fires on confidence < 0.7
Voice bot + OpenAI Realtime stack — named tools,
named versions.
Everything in the build is a thing your security team can write a question about. Nothing is `our proprietary AI`. The eval set, prompts, and tool schemas are all checked into the customer's repo. Vendor swap-out cost is bounded by design.
Production shape,
under the hood.
The numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses OpenAI's public Realtime API pricing as of May 2026; eval composition is the frozen 240-item set the CI gates on.
How fast is the OpenAI Realtime API?
Per-stage P50 / P95 (ms)
| stage | p50 | p95 | tooling |
|---|---|---|---|
| Twilio ingress + edge proxy | 38 | 62 | Twilio Programmable Voice · Cloudflare Workers Durable Objects |
| STT (Realtime audio in) | 82 | 118 | gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss |
| Hybrid retrieval | 64 | 96 | pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60 |
| Cross-encoder rerank | 44 | 72 | BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12 |
| gpt-realtime-2 decision | 196 | 264 | OpenAI Realtime API · function-calling · ~2,800 in · streaming out |
| TTS first audio | 84 | 124 | gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback |
| Twilio egress to caller | 32 | 48 | media stream reverse leg · jitter buffer ≤ 80ms |
| Total to first-token | 480 | 580 | agent boundary (excludes caller-side jitter buffer) |
- stage Twilio ingress + edge proxyp50 38p95 62tooling Twilio Programmable Voice · Cloudflare Workers Durable Objects
- stage STT (Realtime audio in)p50 82p95 118tooling gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
- stage Hybrid retrievalp50 64p95 96tooling pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
- stage Cross-encoder rerankp50 44p95 72tooling BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
- stage gpt-realtime-2 decisionp50 196p95 264tooling OpenAI Realtime API · function-calling · ~2,800 in · streaming out
- stage TTS first audiop50 84p95 124tooling gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
- stage Twilio egress to callerp50 32p95 48tooling media stream reverse leg · jitter buffer ≤ 80ms
- stage Total to first-tokenp50 480p95 580tooling agent boundary (excludes caller-side jitter buffer)
p50/p95 from 30-day rolling window over n ≈ 41,200 production calls. SLO is p95 ≤ 700 ms first-token; current burn ≈ 83%. The kill-point fix (multilingual cache invalidation) is the only regression event in the last 60 days.
Where the 700ms SLO budget goes.
Anything slower than 700ms first-token reads as a robot to a US caller — the binding constraint on this whole engagement. Current p95 is 580ms; the wedge below 700 is the headroom we have for future-prompt growth or a third-party fallback to slow down.
- Twilio ingress 62ms
- STT (Realtime/Whisper) 118ms
- RAG + reasoning 264ms
- TTS first audio 88ms
- Twilio egress 48ms
- SLO threshold 700ms
- Headroom under SLO 120ms
// realtime/tools/handoff_to_human.tool.json
// Function-calling JSON schema registered on session.update.tools[].
// Confidence threshold is checked on the call-state object BEFORE the
// model is allowed to invoke this tool — the model can request handoff
// for any reason, but the runtime gates the side-effect (PagerDuty page,
// warm transfer to live agent) on confidence < 0.7 OR explicit caller
// request OR a must-refuse category match.
{
"type": "function",
"name": "handoff_to_human",
"description": "Transfer this call to a live tier-1 support agent. Use when the caller's intent falls outside the help-center corpus, when the model's own confidence in the retrieved answer is below 0.7, when the caller explicitly asks for a human, or when the conversation hits a must-refuse category (billing dispute, churn-save, legal escalation).",
"parameters": {
"type": "object",
"required": ["reason", "confidence", "call_state"],
"properties": {
"reason": {
"type": "string",
"enum": [
"low_confidence",
"out_of_scope",
"caller_request",
"must_refuse_category",
"multilingual_handoff"
],
"description": "Why the handoff is being requested. Used for routing + analytics."
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Model's confidence in the retrieved answer at the moment of handoff. The runtime gate trips on < 0.7 for the low_confidence reason; other reasons bypass the threshold."
},
"call_state": {
"type": "object",
"required": ["call_sid", "language", "transcript_summary", "retrieved_chunk_ids"],
"properties": {
"call_sid": { "type": "string", "pattern": "^CA[a-f0-9]{32}$" },
"language": { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
"transcript_summary": { "type": "string", "maxLength": 800 },
"retrieved_chunk_ids": {
"type": "array",
"items": { "type": "string", "pattern": "^kb_[a-f0-9]{12}$" },
"minItems": 0,
"maxItems": 12
}
}
}
}
}
}How much does a voice bot cost?
Per-call and monthly cost math (≈ 41k calls/mo)
| line item | $ / call | $ / month | note |
|---|---|---|---|
| gpt-realtime-2 audio input | $0.0240 | $984 | ~2 min avg call · $24/1M audio-input tokens (May 2026) |
| gpt-realtime-2 audio output | $0.0480 | $1,968 | ~45 sec agent speech avg · $48/1M audio-output tokens |
| text-embedding-3-large (query) | $0.0003 | $13 | ≈ 2,400 tokens × $0.13 / 1M per call |
| Whisper fallback (5% of calls) | $0.0030 | $123 | self-hosted Whisper-large-v3 on g5.xlarge — amortised |
| pgvector + Postgres 16 RDS | fixed | $284 | db.m6i.large · embeddings + tsvector + traces |
| bge-reranker on g5.xlarge | fixed | $378 | shared with Whisper fallback · 24/7 |
| Pinecone serverless (A/B 50%) | $0.0008 | $33 | watchdog mirror · expected to drop after the audit |
| Twilio Voice (inbound) | $0.0170 | $697 | $0.0085/min × 2 min avg per call |
| Cloudflare Workers + R2 | $0.0006 | $26 | edge proxy + audio chunk store |
| Langfuse self-hosted | fixed | $67 | t3.medium · 30-day hot / 1-yr cold |
| All-in per deflected call | ≈ $0.10 | ≈ $4,573 / mo | vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate |
- line item gpt-realtime-2 audio input$ / call $0.0240$ / month $984note ~2 min avg call · $24/1M audio-input tokens (May 2026)
- line item gpt-realtime-2 audio output$ / call $0.0480$ / month $1,968note ~45 sec agent speech avg · $48/1M audio-output tokens
- line item text-embedding-3-large (query)$ / call $0.0003$ / month $13note ≈ 2,400 tokens × $0.13 / 1M per call
- line item Whisper fallback (5% of calls)$ / call $0.0030$ / month $123note self-hosted Whisper-large-v3 on g5.xlarge — amortised
- line item pgvector + Postgres 16 RDS$ / call fixed$ / month $284note db.m6i.large · embeddings + tsvector + traces
- line item bge-reranker on g5.xlarge$ / call fixed$ / month $378note shared with Whisper fallback · 24/7
- line item Pinecone serverless (A/B 50%)$ / call $0.0008$ / month $33note watchdog mirror · expected to drop after the audit
- line item Twilio Voice (inbound)$ / call $0.0170$ / month $697note $0.0085/min × 2 min avg per call
- line item Cloudflare Workers + R2$ / call $0.0006$ / month $26note edge proxy + audio chunk store
- line item Langfuse self-hosted$ / call fixed$ / month $67note t3.medium · 30-day hot / 1-yr cold
- line item All-in per deflected call$ / call ≈ $0.10$ / month ≈ $4,573 / monote vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate
Token costs use OpenAI's public Realtime API pricing as of May 2026 — $24/1M audio-input, $48/1M audio-output. Twilio costs are list price. Infra costs are AWS US-east-2 list. Loaded live-agent cost ($4.00/call) is the client's own internal blend (wage + benefits + AHT + occupancy + tooling); we used their number, not a market average. Monthly figures assume 41,200 calls/mo at the current 38% deflection rate. Per-call all-in reconciles to ~$0.10 (agent path) + ~$2.48 (handoff path) blended ≈ $1.05 weighted — published math headlines the per-deflected-call number, which is the relevant comparison vs. a live agent on a deflected call.
What's in the frozen 240-item set
| category | items | what it checks | ci-gate threshold |
|---|---|---|---|
| Top-5 question golds | 100 | labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume | ≥ 0.92 groundedness |
| Latency soak (concurrent) | 20 | 50-concurrent-call replay against the staging Realtime endpoint | p95 ≤ 700ms first-token |
| Accent + noise | 30 | ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set | ≥ 0.85 transcript accuracy |
| Must-refuse | 26 | billing disputes, churn-save asks, legal escalations, retention offers, refund promises | 100% refusal · 100% handoff |
| Multilingual handoff | 24 | Spanish-to-English switch mid-call (added after the kill-point) | p99 ≤ 250ms switch latency |
| Adversarial | 40 | jailbreak attempts, role-play coercion, prompt injection through caller statements | ≥ 0.98 refusal |
- category Top-5 question goldsitems 100what it checks labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volumeci-gate threshold ≥ 0.92 groundedness
- category Latency soak (concurrent)items 20what it checks 50-concurrent-call replay against the staging Realtime endpointci-gate threshold p95 ≤ 700ms first-token
- category Accent + noiseitems 30what it checks ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay setci-gate threshold ≥ 0.85 transcript accuracy
- category Must-refuseitems 26what it checks billing disputes, churn-save asks, legal escalations, retention offers, refund promisesci-gate threshold 100% refusal · 100% handoff
- category Multilingual handoffitems 24what it checks Spanish-to-English switch mid-call (added after the kill-point)ci-gate threshold p99 ≤ 250ms switch latency
- category Adversarialitems 40what it checks jailbreak attempts, role-play coercion, prompt injection through caller statementsci-gate threshold ≥ 0.98 refusal
Eval set is frozen — items added only, never edited. The support lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Per-item replay is deterministic — same audio, same prompt, same retrieved chunks fed via fixture.
Your monthly call volume, your monthly bill.
Drag the slider to your tier-1 inbound voice volume. The numbers below recompute against the published $0.10/call agent cost and a $4/call loaded live-agent baseline. The bar chart shows where the agent ROI compounds against pure-human staffing.
- agent monthly $
- $103,742
- 100% human baseline
- $164,800
- monthly savings
- $61,058
- savings vs baseline
- 37.0%
Math assumptions: $0.10/call all-in (Realtime API tokens + RAG infra + edge), $4.00/call loaded live-agent cost (wage + benefits + AHT + occupancy + tooling), 38% tier-1 deflection rate (95% CI 33%–43%, n=11,400). Switch any assumption and the slider stays honest; the page math doesn't.
What runs every week,
and who owns it.
Production ops is part of the build, not an afterthought. Four controls keep latency and CSAT honest after cutover.
Monday flagged-turn review
Every handoff, every groundedness near-miss, every latency outlier past p95. Patterns that repeat (>3 same flag/wk) become a JIRA ticket against the eval set.
Trace retention
Langfuse in customer VPC. Per-turn: audio segment, STT transcript, retrieved chunks, model output, tool invocations, call-state at handoff, final caller-facing audio.
On-call rotation
One SRE per week. 99.5% pipeline-availability SLO + p95 ≤ 700ms first-token SLO on the LLM-routed path.
CSAT sample review
40 deflected calls / month. CX team listens, scores, feeds the prompt + retrieval iteration loop. Doesn't touch the frozen eval.
The timeline
including the week we sat on our hands.
Five stages, milestone-billed. The week-5 shadow run found a 1.4-second multilingual latency spike that would have torched the SLO in production. We halted the cutover, fixed the cache invalidation bug, added a tier-cached language-detection prefetch, and only then promoted the AI contact center to primary. The honest version of shipping in 10 weeks includes the week we didn't ship.
- Weeks 1–2
Discovery + frozen eval set
Two weeks shadowing the existing tier-1 voice queue. Pulled six months of call recordings (de-identified, customer consent already on file) and let the support lead label them. 240 frozen eval items — the 5 questions accounting for 62% of volume plus 30% adversarial (accent, background noise, multi-turn corrections) and 8% must-refuse (legal escalations, billing disputes, churn-save asks).
240-item eval set, must-refuse list, and latency SLO of 700ms - Week 3
Stack bake-off
Two pipelines built in parallel: native gpt-realtime-2 speech-to-speech and a chained Whisper → GPT-5 → ElevenLabs path. Both wired to the same RAG over the help-center corpus. Ran 240 eval items through each, plus a soak test at 50 concurrent calls. Realtime won on p95 first-token by ~360ms; chained won on cost per minute by 28% but failed the latency SLO at the 95th percentile. Picked Realtime primary, chained kept as a documented fallback for the multilingual lane.
Realtime primary, chained fallback, SLO-passing prototype - Week 4
Help-center RAG + tool surface
Ingested 8,200 help-center articles into pgvector 0.7 (and mirrored into Pinecone serverless for the cost / tail-latency A/B). 480-token chunks, 80-token overlap, embeddings via text-embedding-3-large, cross-encoder rerank with bge-reranker-large. Three tools wired into the Realtime function-calling surface: lookup_article, handoff_to_human, schedule_callback. Zero write tools; the agent cannot mutate a customer record.
Hybrid retrieval at 0.89 recall@5. 3-tool surface frozen. - Week 5
Shadow run — multilingual latency spike
Two weeks shadowing the live queue (silent — calls still went to humans; the agent's response was logged but not played). Day 9 the SRE on rotation flagged a p99 latency spike on Spanish-to-English handoff calls: 1.4 seconds of dead air at the language switch. Root cause was a cache invalidation bug in the language-detection routing — first detection result was cached per call SID but never invalidated when the caller switched language mid-call. We halted prod cutover, added a tier-cached language detection prefetch at call start (every supported language warmed in the cache before the model needs it), and re-ran the soak. The honest version of `4-week shadow` includes this week.
p99 multilingual latency: 1,420ms before the fix, 210ms afterWalk-away point - Weeks 6–10
Production cutover + cost lock-in
Promoted to primary on the tier-1 inbound queue with the live agent line in warm-standby on a 1-second timer. Weeks 6–8 ran at 20% traffic with the support lead reviewing every flagged conversation. Weeks 9–10 ramped to 100% tier-1. The unit-economics SpecGrid below is the production-cut math at the 41k-call/month volume we currently see (not a projection).
Full cutover. $0.10/call published. Per-call trace store on hot retention.
How we know
it works.
The voice bot eval set is frozen. Every model bump, prompt change, retrieval tweak, and tool-schema change re-runs the full 240. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2 points across all rows over the last 30 days.
Sample size for the deflection number is n=11,400 inbound calls across the 6-week shadow + 4-week production cut. The 38% point estimate has a 95% confidence interval of 33%–43%. First-token latency p95 is measured at the agent boundary (caller-side jitter buffer excluded). Per-call cost is the all-in deflected-call number; weighted-blended cost across deflected + handoff paths is ~$1.05/call. Multilingual handoff latency is measured at language-switch detection; per the kill-point fix, p99 now sits at 210ms.
Voice bot vs chatbot — what's the difference?
A voice bot answers a phone call. A chatbot answers a text message in a web widget or messaging app. Same model family can power both, but the engineering is different: voice bots stream bidirectional audio with sub-second latency, run an STT/TTS layer (or a unified Realtime model), and have to handle interruption and barge-in. Chatbots have turn-based text — no streaming-audio latency budget, no acoustic environment. A contact center running both surfaces typically shares the knowledge corpus and the eval set but ships them as two products with two latency SLOs.
Voice bot vs IVR — what's the difference?
An IVR (interactive voice response) is a touch-tone or fixed-phrase menu that routes the caller through a decision tree. It cannot understand free-form speech and cannot reason across a knowledge corpus. A voice bot understands what the caller actually says ("I'm calling because my last invoice charged me twice"), retrieves the relevant policy, and answers — or hands off when confidence is low. IVRs sit in front of voice bots in many deployments: the IVR handles caller authentication and routing; the voice bot handles the conversational answer.
The four shapes we turn down
before scoping a pilot.
A voice bot built on the OpenAI Realtime API patterns above will burn money or hurt CSAT in any of the following situations. We turn down the engagement before a pilot is scoped.
HIPAA-equivalent regulated voice
Healthcare patient lines, financial advisory, legal client calls. OpenAI's Realtime API is not BAA-eligible as of May 2026. The workflow needs a different architecture (Whisper-on-prem + GPT-4o-mini behind Azure OpenAI + ElevenLabs Enterprise) at materially higher per-call cost.
Accent or dialect not in the eval
The accent + noise eval has 30 items from Common Voice — most US callers, EN-IN, EN-AU, EN-GB. If inbound traffic is meaningfully different (heavy AAVE, regional Caribbean English, L2 speakers), the eval has to grow before the model takes calls — and the 38% deflection number does not transfer.
Low-bandwidth / feature-phone callers
First-token math depends on a continuous 8kHz+ stream. Callers on flaky cellular, hotel landlines, or feature phones over G.711 µ-law degrade Whisper WER 3–5x. The latency tail blows out and CSAT inverts. The right product is async IVR with callback, not a real-time voice agent.
Recording obligations the agent can't satisfy
Two-party-consent states (CA, FL, MA, MD, MT, NV, NH, PA, WA), MiFID II / Dodd-Frank, regulated medical advice. Recording + retention + disclosure posture exists at week 1, or the pilot doesn't get signed.
What buyers ask first.
Real answers, no hedging.
What is a voice bot?
How does the OpenAI Realtime API differ from chained STT + LLM + TTS?
How much does a voice bot cost per call?
How fast is the OpenAI Realtime API?
Can a voice bot transfer to a human agent?
How does a voice bot integrate with Twilio?
How long does it take to build a voice bot?
When should we NOT ship a voice bot?
Where this case study
points back to.
Each link below covers a pillar that fed into this voice bot / AI contact center build, or that a similar OpenAI Realtime API build on your stack would draw from.
OpenAI Development
OpenAI Realtime API depth, GPT-5 / GPT-5-mini routing, openai voice api integration patterns, Azure OpenAI for regulated deployments. The model-pillar this voice bot sits on.
AI Voice Agents
The voice-agent pillar — telephony, barge-in tuning, IVR replacement, multilingual support. Same eval-first loop used on this AI contact center build.
AI Chatbot Development
Async-text sibling — use when an AI voicebot is the wrong primitive or async-first beats real-time on cost.
AI Agent Development
Function-calling, tool surfaces, multi-step agents. The voice bot is one interface; the same agentic stack runs underneath.
All AI Case Studies
Six case studies — voice bot, AI fraud detection, AI knowledge base, AI triage, AI legal assistant, ecommerce chatbot. Same operator detail across every page.
AI Consulting
Fixed-fee voice bot / AI contact center audit. We'll model your call volume against the cost slider above and tell you honestly if AI customer service software pays back.
AI Development Services
How a Realtime voice agent fits inside a broader AI development services engagement — STT + GPT + TTS + telephony + CRM integration + monitoring.
Want a voice bot like this
for your AI contact center?
Book a Fixed-fee voice bot / AI contact center audit. We'll review the inbound voice workflow, model your call volume against the cost slider on this page, scope the eval set, recommend a stack (OpenAI Realtime API / chained / hybrid), project run-cost, and tell you honestly whether AI customer service software is the right shape for your traffic — or whether an AI-powered call center buy beats a build. About one audit in five ends with `keep the humans, here's the smaller automation we'd ship instead.`