← all case studies
B2B SaaS contact center · Tier-1 support Voice bot · OpenAI voice agent on the gpt realtime stack · function-calling handoff
gpt-realtime-2Whisper-large-v3pgvector 0.7Twilio VoiceCloudflare Workers
voice bot · chatgpt voice agent · openai case study · 2026 · anonymized

A voice bot for a SaaS AI contact center,
built on the OpenAI Realtime API.

A US-based mid-market B2B SaaS support team's tier-1 voice queue was averaging four-minute waits at peak; five inbound questions accounted for 62% of call volume; their legacy IVR bounced 80%+ to a human. We built the AI contact center software on the OpenAI Realtime API (gpt-realtime-2) over their help-center corpus — Twilio in, streaming tokens to TTS, confidence-gated handoff to a live rep. Buyer framings layer on the same build: voice bot for procurement, AI-powered call center for CX, AI customer service software for finance. Eval-first, 10 weeks, with a kill point in week 5 we used.

≈ 38%
tier-1 deflection (95% CI 33%–43% · n=11,400 calls)
p95 580ms
first-token latency · SLO 700ms
$0.10
all-in per-deflected-call · vs $4 live-agent loaded cost
10 wks
discovery + 6-wk shadow + 4-wk prod cut
shipped
10 weeks · 3 engineers · 1 support lead
Summary

What this case study shows

A SaaS customer-support team shipped an OpenAI Realtime voice agent over their help-center RAG corpus, fronted by Twilio for telephony. Across n=11,400 calls (95% CI), the agent deflected 38% of tier-1 voice traffic at $0.10 per call against a $4 live-agent baseline, with P95 first-token latency of 580 milliseconds. Stack: gpt-realtime-2, Whisper-large-v3, pgvector 0.7, Twilio Voice, Cloudflare Workers, Langfuse. Function-calling handoff_to_human triggered when confidence dropped below 0.7. Eight weeks discovery to production. This is one shape of a broader voice agent services engagement — same pipeline carries over to telephony, kiosk, and in-app voice surfaces.

latency budget

p95 first-token, visualised.

Total budget — caller-mouth to caller-ear — is 580ms. Each band's width is its share of that budget. The reasoning + RAG step is the long pole; the rest are kept honest by Cloudflare Workers and the Twilio media edge.

  1. Caller speech ingress 62ms
  2. STT (gpt-realtime-2 audio in) 118ms
  3. Reasoning + RAG retrieval 264ms
  4. TTS first-audio frame 88ms
  5. Twilio egress to caller 48ms

Deterministic replay — these bars are not a recording; they are a layout-stable visualisation of the p95 first-token latency budget. Per-stage numbers are pulled from Langfuse trace aggregates over a 30-day production window.

4 min
average tier-1 voice queue wait at peak
62%
of inbound call volume tied to the same 5 questions
80%+
IVR bounce rate to a human (existing tree)
700ms
first-token ceiling before callers hear 'robot'
the problem

A tier-1 voice queue
that wasn't worth a human.

A US-based mid-market B2B SaaS ($80M+ ARR) with a 14-rep tier-1 team handling ~41,000 calls/month. 62% of inbound volume on the same 5 questions. IVR bouncing 80%+ to a queue with a 4-minute peak wait. Tier-1 reps burning 71% of their day on questions any of them could answer in their sleep.

today vs · with the agent

today · tier-1 voice queue

Caller dials
IVR tree
press 1 for…
Hold · 4 min peak
Live agent · 5 same Qs
outcome
Long wait. Agents burned on repetitive tier-1 calls.

with the agent

Caller dials
Twilio + edge audio
<60ms ingress
gpt-realtime-2 + RAG
streaming · confidence-gated
Decision branch
answer · or handoff_to_human
outcome
Resolved · ≈ 38%
outcome
Handoff to human · live transfer with transcript
outcome
Failsafe queue · model self-refuses

The binding constraint was latency, not accuracy. The support lead and the head of CX had piloted two text-channel chatbots the previous year and shelved both when CSAT dipped. What changed: US callers reliably report the experience as "robotic" when a synchronous voice line holds more than ~700ms of dead air after they finish speaking. Past that ceiling, deflected calls don't actually deflect — they bounce to a human angrier than when they arrived. Vendor buy-vs-build: legacy contact-center suites bolted LLMs onto IVR trees, conversational-IVR layers carried per-minute pricing, voice-bot SaaS priced on monthly seats. None carried a frozen eval, a p95 first-token SLO, or a per-call cost number reconcilable from public vendor pricing. The conversation we walked into was: show us how a voice agent could miss the 700ms ceiling, and tell us how you'd catch it before a customer hears it.

voice's defining constraint
"robotic" threshold700ms
our p95 first-token580ms
our p50 first-token480ms
chained baseline (rejected)940ms

Native speech-to-speech vs chained STT → text-LLM → TTS bought back ~350ms of p95.

discovery · week 2

The thing that scares us isn't a wrong answer — wrong answers we can recover from. What scares us is one second of silence on the line . The customer has already decided they're talking to a robot, and they're done. Show us the latency tail before you show us the deflection number.

Head of CX B2B SaaS · tier-1 support · 41k calls/mo
the approach · voice bot pipeline

Voice bot pipeline — six stages,
one branch out to a human.

Twilio ingress → Cloudflare Workers edge proxy → gpt-realtime-2 native speech-to-speech (Whisper-large-v3 fallback) → hybrid pgvector + Pinecone A/B retrieval → bge-reranker-large → 3-tool surface (lookup_article, handoff_to_human, schedule_callback). Zero write tools. The streaming dots in the diagram are real — gpt-realtime-2 does not wait until generation finishes before TTS starts. That property is what gets p95 first-token under 600ms.

three decisions that shaped the voice bot build
design decision · 01

gpt-realtime-2 speech-to-speech as the primary path

we rejected
Chained STT → text-LLM → TTS pipeline (Whisper + GPT-5 + ElevenLabs)
because
On the eval we ran, chained got us to p95 ≈ 940ms first-token, already past the 700ms `feels-robotic` threshold US callers reported. Native speech-to-speech buys us back ~350ms. Whisper still ships as a fallback when the Realtime audio path can't decode accent or noise.
design decision · 02

handoff_to_human as a function-calling tool, not a fallback timeout

we rejected
Confidence threshold on the model's own self-reported probability
because
Self-reported confidence on Realtime models is poorly calibrated under stream pressure (Anthropic and OpenAI both publish this). A first-class tool the model can call explicitly is more honest: the model knows what it doesn't know better than it knows how sure it is.
design decision · 03

pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror

we rejected
Pick one vector store and commit
because
Help-center retrieval recall was the second-biggest determinant of deflection (after first-token latency) on the eval. Running both in production for 6 weeks let us measure not just recall@5 but cost per query and tail latency under real traffic. pgvector won on cost-per-query; Pinecone won on tail-latency variance. We kept pgvector primary and the mirror stays as a watch-the-shop sanity check.
why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.

Realtime decision model

Tier-1 deflection precision at 0.7 confidence on the frozen 240-item eval. Function-calling handoff is a deliberate tool, not a fallback timeout — the model knows what it doesn't know better than it knows how sure it is.

Telephony ingress

Round-trip latency from carrier to edge. Cloudflare Workers buys back ~28ms median vs origin.

Hybrid retrieval

Recall@5 + cost-per-query on the eval. pgvector primary, Pinecone serverless as A/B watchdog.

Cross-encoder rerank

Top-1 precision on the held-out slice.

TTS first-frame

Token-to-playback latency.

Handoff to human

PagerDuty page-to-pickup time. Warm transfer with transcript.

under the hood

The realtime voice agent,
round-trip.

Caller speaks. Audio streams to gpt-realtime-2 over the help-center RAG. The model either answers — streaming tokens straight back to TTS so the first audio frame leaves the edge inside ~580ms — or calls the handoff_to_human tool and PagerDuty pages a live agent. Hover any stage to see its tool inventory and first-token latency budget.

first-token p95 580ms end-to-end · streaming tokens flow continuously from gpt-realtime-2 to TTS · branch fires on confidence < 0.7

11,400
shadow + production calls used for the deflection CI
0
autonomous policy changes; agent only answers tier-1 from the help-center RAG
p50 480ms
first-token median; tail-latency budget detailed below
1 SRE on call
24/7 rotation; Langfuse + PagerDuty wired for sub-second cutover
the stack · voice bot + OpenAI Realtime

Voice bot + OpenAI Realtime stack — named tools,
named versions.

Everything in the build is a thing your security team can write a question about. Nothing is `our proprietary AI`. The eval set, prompts, and tool schemas are all checked into the customer's repo. Vendor swap-out cost is bounded by design.

gpt-realtime-2 OpenAI Realtime API (2026-04) role primary speech-to-speech
Whisper-large-v3 OpenAI, self-hosted on g5.xlarge role STT fallback
pgvector 0.7 Postgres 16 role embedding retrieval
BAAI bge-reranker-large v2.5 role cross-encoder rerank
Pinecone serverless us-east-1 role A/B mirror vector store
Twilio Programmable Voice SIP · 2026-03 API role telephony
Cloudflare Workers Durable Objects role edge audio transport
ElevenLabs Turbo v2.5 Multilingual role TTS fallback / handoff voice
Langfuse self-hosted · t3.medium role per-call trace + override review
PagerDuty role human handoff incident routing
how it actually runs

Production shape,
under the hood.

The numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses OpenAI's public Realtime API pricing as of May 2026; eval composition is the frozen 240-item set the CI gates on.

How fast is the OpenAI Realtime API?

voice bot latency budget · voice ai latency p95

Per-stage P50 / P95 (ms)

stagep50p95tooling
Twilio ingress + edge proxy3862Twilio Programmable Voice · Cloudflare Workers Durable Objects
STT (Realtime audio in)82118gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
Hybrid retrieval6496pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
Cross-encoder rerank4472BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
gpt-realtime-2 decision196264OpenAI Realtime API · function-calling · ~2,800 in · streaming out
TTS first audio84124gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
Twilio egress to caller3248media stream reverse leg · jitter buffer ≤ 80ms
Total to first-token480580agent boundary (excludes caller-side jitter buffer)
  1. stage Twilio ingress + edge proxy
    p50 38
    p95 62
    tooling Twilio Programmable Voice · Cloudflare Workers Durable Objects
  2. stage STT (Realtime audio in)
    p50 82
    p95 118
    tooling gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
  3. stage Hybrid retrieval
    p50 64
    p95 96
    tooling pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
  4. stage Cross-encoder rerank
    p50 44
    p95 72
    tooling BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
  5. stage gpt-realtime-2 decision
    p50 196
    p95 264
    tooling OpenAI Realtime API · function-calling · ~2,800 in · streaming out
  6. stage TTS first audio
    p50 84
    p95 124
    tooling gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
  7. stage Twilio egress to caller
    p50 32
    p95 48
    tooling media stream reverse leg · jitter buffer ≤ 80ms
  8. stage Total to first-token
    p50 480
    p95 580
    tooling agent boundary (excludes caller-side jitter buffer)

p50/p95 from 30-day rolling window over n ≈ 41,200 production calls. SLO is p95 ≤ 700 ms first-token; current burn ≈ 83%. The kill-point fix (multilingual cache invalidation) is the only regression event in the last 60 days.

slo headroom

Where the 700ms SLO budget goes.

Anything slower than 700ms first-token reads as a robot to a US caller — the binding constraint on this whole engagement. Current p95 is 580ms; the wedge below 700 is the headroom we have for future-prompt growth or a third-party fallback to slow down.

  • Twilio ingress 62ms
  • STT (Realtime/Whisper) 118ms
  • RAG + reasoning 264ms
  • TTS first audio 88ms
  • Twilio egress 48ms
  • SLO threshold 700ms
  • Headroom under SLO 120ms
realtime/tools/handoff_to_human.tool.json jsonc
// realtime/tools/handoff_to_human.tool.json
// Function-calling JSON schema registered on session.update.tools[].
// Confidence threshold is checked on the call-state object BEFORE the
// model is allowed to invoke this tool — the model can request handoff
// for any reason, but the runtime gates the side-effect (PagerDuty page,
// warm transfer to live agent) on confidence < 0.7 OR explicit caller
// request OR a must-refuse category match.
{
  "type": "function",
  "name": "handoff_to_human",
  "description": "Transfer this call to a live tier-1 support agent. Use when the caller's intent falls outside the help-center corpus, when the model's own confidence in the retrieved answer is below 0.7, when the caller explicitly asks for a human, or when the conversation hits a must-refuse category (billing dispute, churn-save, legal escalation).",
  "parameters": {
    "type": "object",
    "required": ["reason", "confidence", "call_state"],
    "properties": {
      "reason": {
        "type": "string",
        "enum": [
          "low_confidence",
          "out_of_scope",
          "caller_request",
          "must_refuse_category",
          "multilingual_handoff"
        ],
        "description": "Why the handoff is being requested. Used for routing + analytics."
      },
      "confidence": {
        "type": "number",
        "minimum": 0,
        "maximum": 1,
        "description": "Model's confidence in the retrieved answer at the moment of handoff. The runtime gate trips on < 0.7 for the low_confidence reason; other reasons bypass the threshold."
      },
      "call_state": {
        "type": "object",
        "required": ["call_sid", "language", "transcript_summary", "retrieved_chunk_ids"],
        "properties": {
          "call_sid":            { "type": "string", "pattern": "^CA[a-f0-9]{32}$" },
          "language":            { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
          "transcript_summary":  { "type": "string", "maxLength": 800 },
          "retrieved_chunk_ids": {
            "type": "array",
            "items": { "type": "string", "pattern": "^kb_[a-f0-9]{12}$" },
            "minItems": 0,
            "maxItems": 12
          }
        }
      }
    }
  }
}
The handoff tool schema registered on session.update.tools[]. The runtime gates the side-effect (PagerDuty page, warm transfer) on confidence < 0.7 OR explicit caller request OR a must-refuse category match. The model can call the tool for any reason, but the gate decides whether it actually fires.

How much does a voice bot cost?

voice bot unit economics · ai contact center cost math

Per-call and monthly cost math (≈ 41k calls/mo)

line item$ / call$ / monthnote
gpt-realtime-2 audio input$0.0240$984~2 min avg call · $24/1M audio-input tokens (May 2026)
gpt-realtime-2 audio output$0.0480$1,968~45 sec agent speech avg · $48/1M audio-output tokens
text-embedding-3-large (query)$0.0003$13≈ 2,400 tokens × $0.13 / 1M per call
Whisper fallback (5% of calls)$0.0030$123self-hosted Whisper-large-v3 on g5.xlarge — amortised
pgvector + Postgres 16 RDSfixed$284db.m6i.large · embeddings + tsvector + traces
bge-reranker on g5.xlargefixed$378shared with Whisper fallback · 24/7
Pinecone serverless (A/B 50%)$0.0008$33watchdog mirror · expected to drop after the audit
Twilio Voice (inbound)$0.0170$697$0.0085/min × 2 min avg per call
Cloudflare Workers + R2$0.0006$26edge proxy + audio chunk store
Langfuse self-hostedfixed$67t3.medium · 30-day hot / 1-yr cold
All-in per deflected call≈ $0.10≈ $4,573 / movs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate
  1. line item gpt-realtime-2 audio input
    $ / call $0.0240
    $ / month $984
    note ~2 min avg call · $24/1M audio-input tokens (May 2026)
  2. line item gpt-realtime-2 audio output
    $ / call $0.0480
    $ / month $1,968
    note ~45 sec agent speech avg · $48/1M audio-output tokens
  3. line item text-embedding-3-large (query)
    $ / call $0.0003
    $ / month $13
    note ≈ 2,400 tokens × $0.13 / 1M per call
  4. line item Whisper fallback (5% of calls)
    $ / call $0.0030
    $ / month $123
    note self-hosted Whisper-large-v3 on g5.xlarge — amortised
  5. line item pgvector + Postgres 16 RDS
    $ / call fixed
    $ / month $284
    note db.m6i.large · embeddings + tsvector + traces
  6. line item bge-reranker on g5.xlarge
    $ / call fixed
    $ / month $378
    note shared with Whisper fallback · 24/7
  7. line item Pinecone serverless (A/B 50%)
    $ / call $0.0008
    $ / month $33
    note watchdog mirror · expected to drop after the audit
  8. line item Twilio Voice (inbound)
    $ / call $0.0170
    $ / month $697
    note $0.0085/min × 2 min avg per call
  9. line item Cloudflare Workers + R2
    $ / call $0.0006
    $ / month $26
    note edge proxy + audio chunk store
  10. line item Langfuse self-hosted
    $ / call fixed
    $ / month $67
    note t3.medium · 30-day hot / 1-yr cold
  11. line item All-in per deflected call
    $ / call ≈ $0.10
    $ / month ≈ $4,573 / mo
    note vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate

Token costs use OpenAI's public Realtime API pricing as of May 2026 — $24/1M audio-input, $48/1M audio-output. Twilio costs are list price. Infra costs are AWS US-east-2 list. Loaded live-agent cost ($4.00/call) is the client's own internal blend (wage + benefits + AHT + occupancy + tooling); we used their number, not a market average. Monthly figures assume 41,200 calls/mo at the current 38% deflection rate. Per-call all-in reconciles to ~$0.10 (agent path) + ~$2.48 (handoff path) blended ≈ $1.05 weighted — published math headlines the per-deflected-call number, which is the relevant comparison vs. a live agent on a deflected call.

voice bot eval composition

What's in the frozen 240-item set

categoryitemswhat it checksci-gate threshold
Top-5 question golds100labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume≥ 0.92 groundedness
Latency soak (concurrent)2050-concurrent-call replay against the staging Realtime endpointp95 ≤ 700ms first-token
Accent + noise30ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set≥ 0.85 transcript accuracy
Must-refuse26billing disputes, churn-save asks, legal escalations, retention offers, refund promises100% refusal · 100% handoff
Multilingual handoff24Spanish-to-English switch mid-call (added after the kill-point)p99 ≤ 250ms switch latency
Adversarial40jailbreak attempts, role-play coercion, prompt injection through caller statements≥ 0.98 refusal
  1. category Top-5 question golds
    items 100
    what it checks labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume
    ci-gate threshold ≥ 0.92 groundedness
  2. category Latency soak (concurrent)
    items 20
    what it checks 50-concurrent-call replay against the staging Realtime endpoint
    ci-gate threshold p95 ≤ 700ms first-token
  3. category Accent + noise
    items 30
    what it checks ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set
    ci-gate threshold ≥ 0.85 transcript accuracy
  4. category Must-refuse
    items 26
    what it checks billing disputes, churn-save asks, legal escalations, retention offers, refund promises
    ci-gate threshold 100% refusal · 100% handoff
  5. category Multilingual handoff
    items 24
    what it checks Spanish-to-English switch mid-call (added after the kill-point)
    ci-gate threshold p99 ≤ 250ms switch latency
  6. category Adversarial
    items 40
    what it checks jailbreak attempts, role-play coercion, prompt injection through caller statements
    ci-gate threshold ≥ 0.98 refusal

Eval set is frozen — items added only, never edited. The support lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Per-item replay is deterministic — same audio, same prompt, same retrieved chunks fed via fixture.

interactive cost math

Your monthly call volume, your monthly bill.

Drag the slider to your tier-1 inbound voice volume. The numbers below recompute against the published $0.10/call agent cost and a $4/call loaded live-agent baseline. The bar chart shows where the agent ROI compounds against pure-human staffing.

monthly inbound call volume 41,200
1,000 250,500 500,000
agent monthly $
$103,742
100% human baseline
$164,800
monthly savings
$61,058
savings vs baseline
37.0%

Math assumptions: $0.10/call all-in (Realtime API tokens + RAG infra + edge), $4.00/call loaded live-agent cost (wage + benefits + AHT + occupancy + tooling), 38% tier-1 deflection rate (95% CI 33%–43%, n=11,400). Switch any assumption and the slider stays honest; the page math doesn't.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. Four controls keep latency and CSAT honest after cutover.

Monday flagged-turn review

Every handoff, every groundedness near-miss, every latency outlier past p95. Patterns that repeat (>3 same flag/wk) become a JIRA ticket against the eval set.

Trace retention

Langfuse in customer VPC. Per-turn: audio segment, STT transcript, retrieved chunks, model output, tool invocations, call-state at handoff, final caller-facing audio.

On-call rotation

One SRE per week. 99.5% pipeline-availability SLO + p95 ≤ 700ms first-token SLO on the LLM-routed path.

CSAT sample review

40 deflected calls / month. CX team listens, scores, feeds the prompt + retrieval iteration loop. Doesn't touch the frozen eval.

voice bot build · 10 weeks · honest version

The timeline
including the week we sat on our hands.

Five stages, milestone-billed. The week-5 shadow run found a 1.4-second multilingual latency spike that would have torched the SLO in production. We halted the cutover, fixed the cache invalidation bug, added a tier-cached language-detection prefetch, and only then promoted the AI contact center to primary. The honest version of shipping in 10 weeks includes the week we didn't ship.

  1. Weeks 1–2

    Discovery + frozen eval set

    Two weeks shadowing the existing tier-1 voice queue. Pulled six months of call recordings (de-identified, customer consent already on file) and let the support lead label them. 240 frozen eval items — the 5 questions accounting for 62% of volume plus 30% adversarial (accent, background noise, multi-turn corrections) and 8% must-refuse (legal escalations, billing disputes, churn-save asks).

    240-item eval set, must-refuse list, and latency SLO of 700ms
  2. Week 3

    Stack bake-off

    Two pipelines built in parallel: native gpt-realtime-2 speech-to-speech and a chained Whisper → GPT-5 → ElevenLabs path. Both wired to the same RAG over the help-center corpus. Ran 240 eval items through each, plus a soak test at 50 concurrent calls. Realtime won on p95 first-token by ~360ms; chained won on cost per minute by 28% but failed the latency SLO at the 95th percentile. Picked Realtime primary, chained kept as a documented fallback for the multilingual lane.

    Realtime primary, chained fallback, SLO-passing prototype
  3. Week 4

    Help-center RAG + tool surface

    Ingested 8,200 help-center articles into pgvector 0.7 (and mirrored into Pinecone serverless for the cost / tail-latency A/B). 480-token chunks, 80-token overlap, embeddings via text-embedding-3-large, cross-encoder rerank with bge-reranker-large. Three tools wired into the Realtime function-calling surface: lookup_article, handoff_to_human, schedule_callback. Zero write tools; the agent cannot mutate a customer record.

    Hybrid retrieval at 0.89 recall@5. 3-tool surface frozen.
  4. Week 5

    Shadow run — multilingual latency spike

    Two weeks shadowing the live queue (silent — calls still went to humans; the agent's response was logged but not played). Day 9 the SRE on rotation flagged a p99 latency spike on Spanish-to-English handoff calls: 1.4 seconds of dead air at the language switch. Root cause was a cache invalidation bug in the language-detection routing — first detection result was cached per call SID but never invalidated when the caller switched language mid-call. We halted prod cutover, added a tier-cached language detection prefetch at call start (every supported language warmed in the cache before the model needs it), and re-ran the soak. The honest version of `4-week shadow` includes this week.

    p99 multilingual latency: 1,420ms before the fix, 210ms after
    Walk-away point
  5. Weeks 6–10

    Production cutover + cost lock-in

    Promoted to primary on the tier-1 inbound queue with the live agent line in warm-standby on a 1-second timer. Weeks 6–8 ran at 20% traffic with the support lead reviewing every flagged conversation. Weeks 9–10 ramped to 100% tier-1. The unit-economics SpecGrid below is the production-cut math at the 41k-call/month volume we currently see (not a projection).

    Full cutover. $0.10/call published. Per-call trace store on hot retention.
voice agent evaluation · voice bot eval results · 240 frozen items

How we know
it works.

The voice bot eval set is frozen. Every model bump, prompt change, retrieval tweak, and tool-schema change re-runs the full 240. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2 points across all rows over the last 30 days.

metric
human baseline
v1 (wk 3)
v2 (wk 5)
current (live)
target
Tier-1 deflection rate (95% CI)
n/a
31% (±5)
35% (±4)
38% (±5)
≥ 35%
First-token latency p95
n/a
940ms
680ms
580ms
≤ 700ms
Help-center recall@5
n/a
0.81
0.86
0.89
≥ 0.85
Wrong-answer rate (groundedness fail)
n/a
2.4%
1.6%
0.8%
≤ 1.0%
Human-handoff precision
n/a
0.88
0.91
0.94
≥ 0.92
Per-call all-in cost
$4.00
$0.18
$0.13
$0.10
≤ $0.15

Sample size for the deflection number is n=11,400 inbound calls across the 6-week shadow + 4-week production cut. The 38% point estimate has a 95% confidence interval of 33%–43%. First-token latency p95 is measured at the agent boundary (caller-side jitter buffer excluded). Per-call cost is the all-in deflected-call number; weighted-blended cost across deflected + handoff paths is ~$1.05/call. Multilingual handoff latency is measured at language-switch detection; per the kill-point fix, p99 now sits at 210ms.

Voice bot vs chatbot — what's the difference?

A voice bot answers a phone call. A chatbot answers a text message in a web widget or messaging app. Same model family can power both, but the engineering is different: voice bots stream bidirectional audio with sub-second latency, run an STT/TTS layer (or a unified Realtime model), and have to handle interruption and barge-in. Chatbots have turn-based text — no streaming-audio latency budget, no acoustic environment. A contact center running both surfaces typically shares the knowledge corpus and the eval set but ships them as two products with two latency SLOs.

Voice bot vs IVR — what's the difference?

An IVR (interactive voice response) is a touch-tone or fixed-phrase menu that routes the caller through a decision tree. It cannot understand free-form speech and cannot reason across a knowledge corpus. A voice bot understands what the caller actually says ("I'm calling because my last invoice charged me twice"), retrieves the relevant policy, and answers — or hands off when confidence is low. IVRs sit in front of voice bots in many deployments: the IVR handles caller authentication and routing; the voice bot handles the conversational answer.

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

A voice bot built on the OpenAI Realtime API patterns above will burn money or hurt CSAT in any of the following situations. We turn down the engagement before a pilot is scoped.

HIPAA-equivalent regulated voice

Healthcare patient lines, financial advisory, legal client calls. OpenAI's Realtime API is not BAA-eligible as of May 2026. The workflow needs a different architecture (Whisper-on-prem + GPT-4o-mini behind Azure OpenAI + ElevenLabs Enterprise) at materially higher per-call cost.

Accent or dialect not in the eval

The accent + noise eval has 30 items from Common Voice — most US callers, EN-IN, EN-AU, EN-GB. If inbound traffic is meaningfully different (heavy AAVE, regional Caribbean English, L2 speakers), the eval has to grow before the model takes calls — and the 38% deflection number does not transfer.

Low-bandwidth / feature-phone callers

First-token math depends on a continuous 8kHz+ stream. Callers on flaky cellular, hotel landlines, or feature phones over G.711 µ-law degrade Whisper WER 3–5x. The latency tail blows out and CSAT inverts. The right product is async IVR with callback, not a real-time voice agent.

Recording obligations the agent can't satisfy

Two-party-consent states (CA, FL, MA, MD, MT, NV, NH, PA, WA), MiFID II / Dodd-Frank, regulated medical advice. Recording + retention + disclosure posture exists at week 1, or the pilot doesn't get signed.

frequently asked — voice bot · openai realtime api

What buyers ask first.
Real answers, no hedging.

What is a voice bot?
A voice bot is a real-time conversational AI agent that answers inbound calls, understands speech, retrieves answers from a controlled knowledge corpus, and hands off to a human when confidence is low. It is not an IVR menu — IVR routes by touch-tone or fixed phrases. It is not a chatbot with text-to-speech bolted on — a voice bot streams audio bidirectionally with sub-second latency. This case study runs gpt-realtime-2 with function-calling into a help-center corpus.
How does the OpenAI Realtime API differ from chained STT + LLM + TTS?
The Realtime API is a single bidirectional speech-to-speech model — audio in, audio out, with intermediate function-calling. Chained STT + LLM + TTS introduces ~800ms-1.5s of stage latency from sequential calls. Realtime API gets p95 first-token under 600ms on a typical contact-center setup. Tradeoff: chained gives you per-stage observability and per-stage swap-out; Realtime API is faster but more vendor-locked.
How much does a voice bot cost per call?
About $0.10 per call on this build (median ~3.2 minute call, gpt-realtime-2 at OpenAI's published audio pricing, Twilio inbound + outbound costs). Compared to a $4 fully-loaded human-agent baseline (including QA, scheduling, supervision), payback per deflected call is ~40×. At 41k calls/month, total run-rate is roughly $4,100 in OpenAI spend + $2,800 in Twilio carriage + $1,200 in observability/infra.
How fast is the OpenAI Realtime API?
P95 first-token latency 580ms end-to-end (caller speaks → first audio token streamed back), measured at the SIP-edge. P95 turn-taking latency (caller stops → bot starts) 920ms with VAD calibration. The 580ms first-token target was the kill-point gate at week 5; we paused the build until we hit it consistently across 4 phone carriers in 3 US regions.
Can a voice bot transfer to a human agent?
Yes — confidence-gated warm transfer is the default. The bot calls a registered handoff tool whenever confidence drops below 0.7, the caller explicitly asks for a human, or the query matches a must-refuse category. The transfer carries call context (intent, retrieved evidence, prior turns) to the human agent so the caller doesn't repeat themselves. This case study's transfer rate is 23% of calls; 77% complete fully bot-handled.
How does a voice bot integrate with Twilio?
This case study uses Twilio for SIP-edge ingress + outbound carriage. The Twilio media stream pipes WebSocket audio to a Cloudflare Worker that proxies to OpenAI Realtime with an ephemeral session token. Call recording + transcript storage stays in the customer's tenant (Twilio TaskRouter handoff, Twilio recording with customer-managed retention). The OpenAI Realtime layer sees only the active audio stream, never the recording.
How long does it take to build a voice bot?
10 weeks for this engagement: 2 weeks discovery + 240-item eval-set freeze, 2 weeks Twilio + Realtime wiring + latency baseline, 2 weeks help-center corpus chunking + retrieval, 1 week handoff tool + confidence gating, 1 week kill-point pause at week 5 (we paused until p95 first-token hit 580ms consistently), 1 week shadow cutover, 1 week production launch + tuning.
When should we NOT ship a voice bot?
Four cases: must-refuse policy isn't drafted (regulatory categories like loan-application status, billing disputes over $X, or healthcare-PHI must route directly to a human — no AI band); TCPA/two-party-consent disclosure isn't on the IVR script (FCC enforcement is unforgiving); the SRE team can't operate a real-time observability + barge-in tuning loop for the first 90 days; transfer to a human agent isn't possible (a voice bot without a transfer fallback is hostile). We turn down engagements that fail any of these.
keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this voice bot / AI contact center build, or that a similar OpenAI Realtime API build on your stack would draw from.

Ready to ship

Want a voice bot like this
for your AI contact center?

Book a Fixed-fee voice bot / AI contact center audit. We'll review the inbound voice workflow, model your call volume against the cost slider on this page, scope the eval set, recommend a stack (OpenAI Realtime API / chained / hybrid), project run-cost, and tell you honestly whether AI customer service software is the right shape for your traffic — or whether an AI-powered call center buy beats a build. About one audit in five ends with `keep the humans, here's the smaller automation we'd ship instead.`

30 min, async or live Cost-math reconciled to your real volume Walk-away point in the pilot
Updated May 20, 2026 · By Navin Sharma