B2B SaaS contact center · Tier-1 support Voice bot · OpenAI voice agent on the gpt realtime stack · function-calling handoff

gpt-realtime-2Whisper-large-v3pgvector 0.7Twilio VoiceCloudflare Workers

voice bot · chatgpt voice agent · openai case study · 2026 · anonymized

A voice bot for a SaaS AI contact center,
built on the OpenAI Realtime API.

A US-based mid-market B2B SaaS support team's tier-1 voice queue was averaging four-minute waits at peak; five inbound questions accounted for 62% of call volume; their legacy IVR bounced 80%+ to a human. We built the AI contact center software on the OpenAI Realtime API (gpt-realtime-2) over their help-center corpus — Twilio in, streaming tokens to TTS, confidence-gated handoff to a live rep. Buyer framings layer on the same build: voice bot for procurement, AI-powered call center for CX, AI customer service software for finance. Eval-first, 10 weeks, with a kill point in week 5 we used.

≈ 38%

tier-1 deflection (95% CI 33%–43% · n=11,400 calls)

p95 580ms

first-token latency · SLO 700ms

$0.10

all-in per-deflected-call · vs $4 live-agent loaded cost

10 wks

discovery + 6-wk shadow + 4-wk prod cut

shipped

10 weeks · 3 engineers · 1 support lead

Summary

What this case study shows

A SaaS customer-support team shipped an OpenAI Realtime voice agent over their help-center RAG corpus, fronted by Twilio for telephony. Across n=11,400 calls (95% CI), the agent deflected 38% of tier-1 voice traffic at $0.10 per call against a $4 live-agent baseline, with P95 first-token latency of 580 milliseconds. Stack: gpt-realtime-2, Whisper-large-v3, pgvector 0.7, Twilio Voice, Cloudflare Workers, Langfuse. Function-calling handoff_to_human triggered when confidence dropped below 0.7. Eight weeks discovery to production. This is one shape of a broader voice agent services engagement — same pipeline carries over to telephony, kiosk, and in-app voice surfaces.

latency budget

p95 first-token, visualised.

Total budget — caller-mouth to caller-ear — is 580ms. Each band's width is its share of that budget. The reasoning + RAG step is the long pole; the rest are kept honest by Cloudflare Workers and the Twilio media edge.

62ms Caller speech ingress

118ms STT (gpt-realtime-2 audio in)

264ms Reasoning + RAG retrieval

88ms TTS first-audio frame

48ms Twilio egress to caller

Caller speech ingress 62ms
STT (gpt-realtime-2 audio in) 118ms
Reasoning + RAG retrieval 264ms
TTS first-audio frame 88ms
Twilio egress to caller 48ms

Deterministic replay — these bars are not a recording; they are a layout-stable visualisation of the p95 first-token latency budget. Per-stage numbers are pulled from Langfuse trace aggregates over a 30-day production window.

4 min

average tier-1 voice queue wait at peak

62%

of inbound call volume tied to the same 5 questions

80%+

IVR bounce rate to a human (existing tree)

700ms

first-token ceiling before callers hear 'robot'

the problem

A tier-1 voice queue
that wasn't worth a human.

A US-based mid-market B2B SaaS ($80M+ ARR) with a 14-rep tier-1 team handling ~41,000 calls/month. 62% of inbound volume on the same 5 questions. IVR bouncing 80%+ to a queue with a 4-minute peak wait. Tier-1 reps burning 71% of their day on questions any of them could answer in their sleep.

today vs · with the agent

today · tier-1 voice queue

Caller dials

IVR tree

press 1 for…

Hold · 4 min peak

Live agent · 5 same Qs

outcome

Long wait. Agents burned on repetitive tier-1 calls.

with the agent

Caller dials

Twilio + edge audio

<60ms ingress

gpt-realtime-2 + RAG

streaming · confidence-gated

Decision branch

answer · or handoff_to_human

outcome

Resolved · ≈ 38%

outcome

Handoff to human · live transfer with transcript

outcome

Failsafe queue · model self-refuses

The binding constraint was latency, not accuracy. The support lead and the head of CX had piloted two text-channel chatbots the previous year and shelved both when CSAT dipped. What changed: US callers reliably report the experience as "robotic" when a synchronous voice line holds more than ~700ms of dead air after they finish speaking. Past that ceiling, deflected calls don't actually deflect — they bounce to a human angrier than when they arrived. Vendor buy-vs-build: legacy contact-center suites bolted LLMs onto IVR trees, conversational-IVR layers carried per-minute pricing, voice-bot SaaS priced on monthly seats. None carried a frozen eval, a p95 first-token SLO, or a per-call cost number reconcilable from public vendor pricing. The conversation we walked into was: show us how a voice agent could miss the 700ms ceiling, and tell us how you'd catch it before a customer hears it.

voice's defining constraint

"robotic" threshold700ms

our p95 first-token580ms

our p50 first-token480ms

chained baseline (rejected)940ms

Native speech-to-speech vs chained STT → text-LLM → TTS bought back ~350ms of p95.

The thing that scares us isn't a wrong answer — wrong answers we can recover from. What scares us is one second of silence on the line . The customer has already decided they're talking to a robot, and they're done. Show us the latency tail before you show us the deflection number.

Head of CX B2B SaaS · tier-1 support · 41k calls/mo

the approach · voice bot pipeline

Voice bot pipeline — six stages,
one branch out to a human.

Twilio ingress → Cloudflare Workers edge proxy → gpt-realtime-2 native speech-to-speech (Whisper-large-v3 fallback) → hybrid pgvector + Pinecone A/B retrieval → bge-reranker-large → 3-tool surface (lookup_article, handoff_to_human, schedule_callback). Zero write tools. The streaming dots in the diagram are real — gpt-realtime-2 does not wait until generation finishes before TTS starts. That property is what gets p95 first-token under 600ms.

three decisions that shaped the voice bot build

design decision · 01

gpt-realtime-2 speech-to-speech as the primary path

we rejected: Chained STT → text-LLM → TTS pipeline (Whisper + GPT-5 + ElevenLabs)
because: On the eval we ran, chained got us to p95 ≈ 940ms first-token, already past the 700ms `feels-robotic` threshold US callers reported. Native speech-to-speech buys us back ~350ms. Whisper still ships as a fallback when the Realtime audio path can't decode accent or noise.

design decision · 02

handoff_to_human as a function-calling tool, not a fallback timeout

we rejected: Confidence threshold on the model's own self-reported probability
because: Self-reported confidence on Realtime models is poorly calibrated under stream pressure (Anthropic and OpenAI both publish this). A first-class tool the model can call explicitly is more honest: the model knows what it doesn't know better than it knows how sure it is.

design decision · 03

pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror

we rejected: Pick one vector store and commit
because: Help-center retrieval recall was the second-biggest determinant of deflection (after first-token latency) on the eval. Running both in production for 6 weeks let us measure not just recall@5 but cost per query and tail latency under real traffic. pgvector won on cost-per-query; Pinecone won on tail-latency variance. We kept pgvector primary and the mirror stays as a watch-the-shop sanity check.

why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.

Realtime decision model

Tier-1 deflection precision at 0.7 confidence on the frozen 240-item eval. Function-calling handoff is a deliberate tool, not a fallback timeout — the model knows what it doesn't know better than it knows how sure it is.

Telephony ingress

Round-trip latency from carrier to edge. Cloudflare Workers buys back ~28ms median vs origin.

Hybrid retrieval

Recall@5 + cost-per-query on the eval. pgvector primary, Pinecone serverless as A/B watchdog.

Cross-encoder rerank

Top-1 precision on the held-out slice.

TTS first-frame

Token-to-playback latency.

Handoff to human

PagerDuty page-to-pickup time. Warm transfer with transcript.

under the hood

The realtime voice agent,
round-trip.

Caller speaks. Audio streams to gpt-realtime-2 over the help-center RAG. The model either answers — streaming tokens straight back to TTS so the first audio frame leaves the edge inside ~580ms — or calls the handoff_to_human tool and PagerDuty pages a live agent. Hover any stage to see its tool inventory and first-token latency budget.

tool inventory

Hover or focus a stage on the left to see its tool surface, first-token latency budget, and what runs at the edge vs. in the OpenAI API path.

first-token p95 580ms end-to-end · streaming tokens flow continuously from gpt-realtime-2 to TTS · branch fires on confidence < 0.7

11,400

shadow + production calls used for the deflection CI

autonomous policy changes; agent only answers tier-1 from the help-center RAG

p50 480ms

first-token median; tail-latency budget detailed below

1 SRE on call

24/7 rotation; Langfuse + PagerDuty wired for sub-second cutover

the stack · voice bot + OpenAI Realtime

Voice bot + OpenAI Realtime stack — named tools,
named versions.

Everything in the build is a thing your security team can write a question about. Nothing is `our proprietary AI`. The eval set, prompts, and tool schemas are all checked into the customer's repo. Vendor swap-out cost is bounded by design.

gpt-realtime-2 OpenAI Realtime API (2026-04)

Whisper-large-v3 OpenAI, self-hosted on g5.xlarge

pgvector 0.7 Postgres 16

BAAI bge-reranker-large v2.5

Pinecone serverless us-east-1

Twilio Programmable Voice SIP · 2026-03 API

Cloudflare Workers Durable Objects

ElevenLabs Turbo v2.5 Multilingual

Langfuse self-hosted · t3.medium

PagerDuty

how it actually runs

Production shape,
under the hood.

The numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses OpenAI's public Realtime API pricing as of May 2026; eval composition is the frozen 240-item set the CI gates on.

How fast is the OpenAI Realtime API?

voice bot latency budget · voice ai latency p95

Per-stage P50 / P95 (ms)

stage	p50	p95	tooling
Twilio ingress + edge proxy	38	62	Twilio Programmable Voice · Cloudflare Workers Durable Objects
STT (Realtime audio in)	82	118	gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
Hybrid retrieval	64	96	pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
Cross-encoder rerank	44	72	BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
gpt-realtime-2 decision	196	264	OpenAI Realtime API · function-calling · ~2,800 in · streaming out
TTS first audio	84	124	gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
Twilio egress to caller	32	48	media stream reverse leg · jitter buffer ≤ 80ms
Total to first-token	480	580	agent boundary (excludes caller-side jitter buffer)

stage Twilio ingress + edge proxy
p50 38
p95 62
tooling Twilio Programmable Voice · Cloudflare Workers Durable Objects
stage STT (Realtime audio in)
p50 82
p95 118
tooling gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
stage Hybrid retrieval
p50 64
p95 96
tooling pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
stage Cross-encoder rerank
p50 44
p95 72
tooling BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
stage gpt-realtime-2 decision
p50 196
p95 264
tooling OpenAI Realtime API · function-calling · ~2,800 in · streaming out
stage TTS first audio
p50 84
p95 124
tooling gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
stage Twilio egress to caller
p50 32
p95 48
tooling media stream reverse leg · jitter buffer ≤ 80ms
stage Total to first-token
p50 480
p95 580
tooling agent boundary (excludes caller-side jitter buffer)

p50/p95 from 30-day rolling window over n ≈ 41,200 production calls. SLO is p95 ≤ 700 ms first-token; current burn ≈ 83%. The kill-point fix (multilingual cache invalidation) is the only regression event in the last 60 days.

slo headroom

Where the 700ms SLO budget goes.

Anything slower than 700ms first-token reads as a robot to a US caller — the binding constraint on this whole engagement. Current p95 is 580ms; the wedge below 700 is the headroom we have for future-prompt growth or a third-party fallback to slow down.

Twilio ingress 62ms
STT (Realtime/Whisper) 118ms
RAG + reasoning 264ms
TTS first audio 88ms
Twilio egress 48ms
SLO threshold 700ms
Headroom under SLO 120ms

realtime/tools/handoff_to_human.tool.json jsonc

// realtime/tools/handoff_to_human.tool.json
// Function-calling JSON schema registered on session.update.tools[].
// Confidence threshold is checked on the call-state object BEFORE the
// model is allowed to invoke this tool — the model can request handoff
// for any reason, but the runtime gates the side-effect (PagerDuty page,
// warm transfer to live agent) on confidence < 0.7 OR explicit caller
// request OR a must-refuse category match.
{
  "type": "function",
  "name": "handoff_to_human",
  "description": "Transfer this call to a live tier-1 support agent. Use when the caller's intent falls outside the help-center corpus, when the model's own confidence in the retrieved answer is below 0.7, when the caller explicitly asks for a human, or when the conversation hits a must-refuse category (billing dispute, churn-save, legal escalation).",
  "parameters": {
    "type": "object",
    "required": ["reason", "confidence", "call_state"],
    "properties": {
      "reason": {
        "type": "string",
        "enum": [
          "low_confidence",
          "out_of_scope",
          "caller_request",
          "must_refuse_category",
          "multilingual_handoff"
        ],
        "description": "Why the handoff is being requested. Used for routing + analytics."
      },
      "confidence": {
        "type": "number",
        "minimum": 0,
        "maximum": 1,
        "description": "Model's confidence in the retrieved answer at the moment of handoff. The runtime gate trips on < 0.7 for the low_confidence reason; other reasons bypass the threshold."
      },
      "call_state": {
        "type": "object",
        "required": ["call_sid", "language", "transcript_summary", "retrieved_chunk_ids"],
        "properties": {
          "call_sid":            { "type": "string", "pattern": "^CA[a-f0-9]{32}$" },
          "language":            { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
          "transcript_summary":  { "type": "string", "maxLength": 800 },
          "retrieved_chunk_ids": {
            "type": "array",
            "items": { "type": "string", "pattern": "^kb_[a-f0-9]{12}$" },
            "minItems": 0,
            "maxItems": 12
          }
        }
      }
    }
  }
}

// realtime/tools/handoff_to_human.tool.json
// Function-calling JSON schema registered on session.update.tools[].
// Confidence threshold is checked on the call-state object BEFORE the
// model is allowed to invoke this tool — the model can request handoff
// for any reason, but the runtime gates the side-effect (PagerDuty page,
// warm transfer to live agent) on confidence < 0.7 OR explicit caller
// request OR a must-refuse category match.
{
  "type": "function",
  "name": "handoff_to_human",
  "description": "Transfer this call to a live tier-1 support agent. Use when the caller's intent falls outside the help-center corpus, when the model's own confidence in the retrieved answer is below 0.7, when the caller explicitly asks for a human, or when the conversation hits a must-refuse category (billing dispute, churn-save, legal escalation).",
  "parameters": {
    "type": "object",
    "required": ["reason", "confidence", "call_state"],
    "properties": {
      "reason": {
        "type": "string",
        "enum": [
          "low_confidence",
          "out_of_scope",
          "caller_request",
          "must_refuse_category",
          "multilingual_handoff"
        ],
        "description": "Why the handoff is being requested. Used for routing + analytics."
      },
      "confidence": {
        "type": "number",
        "minimum": 0,
        "maximum": 1,
        "description": "Model's confidence in the retrieved answer at the moment of handoff. The runtime gate trips on < 0.7 for the low_confidence reason; other reasons bypass the threshold."
      },
      "call_state": {
        "type": "object",
        "required": ["call_sid", "language", "transcript_summary", "retrieved_chunk_ids"],
        "properties": {
          "call_sid":            { "type": "string", "pattern": "^CA[a-f0-9]{32}$" },
          "language":            { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
          "transcript_summary":  { "type": "string", "maxLength": 800 },
          "retrieved_chunk_ids": {
            "type": "array",
            "items": { "type": "string", "pattern": "^kb_[a-f0-9]{12}$" },
            "minItems": 0,
            "maxItems": 12
          }
        }
      }
    }
  }
}

The handoff tool schema registered on session.update.tools[]. The runtime gates the side-effect (PagerDuty page, warm transfer) on confidence < 0.7 OR explicit caller request OR a must-refuse category match. The model can call the tool for any reason, but the gate decides whether it actually fires.

How much does a voice bot cost?

voice bot unit economics · ai contact center cost math

Per-call and monthly cost math (≈ 41k calls/mo)

line item	$ / call	$ / month	note
gpt-realtime-2 audio input	$0.0240	$984	~2 min avg call · $24/1M audio-input tokens (May 2026)
gpt-realtime-2 audio output	$0.0480	$1,968	~45 sec agent speech avg · $48/1M audio-output tokens
text-embedding-3-large (query)	$0.0003	$13	≈ 2,400 tokens × $0.13 / 1M per call
Whisper fallback (5% of calls)	$0.0030	$123	self-hosted Whisper-large-v3 on g5.xlarge — amortised
pgvector + Postgres 16 RDS	fixed	$284	db.m6i.large · embeddings + tsvector + traces
bge-reranker on g5.xlarge	fixed	$378	shared with Whisper fallback · 24/7
Pinecone serverless (A/B 50%)	$0.0008	$33	watchdog mirror · expected to drop after the audit
Twilio Voice (inbound)	$0.0170	$697	$0.0085/min × 2 min avg per call
Cloudflare Workers + R2	$0.0006	$26	edge proxy + audio chunk store
Langfuse self-hosted	fixed	$67	t3.medium · 30-day hot / 1-yr cold
All-in per deflected call	≈ $0.10	≈ $4,573 / mo	vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate

line item gpt-realtime-2 audio input
$ / call $0.0240
$ / month $984
note ~2 min avg call · $24/1M audio-input tokens (May 2026)
line item gpt-realtime-2 audio output
$ / call $0.0480
$ / month $1,968
note ~45 sec agent speech avg · $48/1M audio-output tokens
line item text-embedding-3-large (query)
$ / call $0.0003
$ / month $13
note ≈ 2,400 tokens × $0.13 / 1M per call
line item Whisper fallback (5% of calls)
$ / call $0.0030
$ / month $123
note self-hosted Whisper-large-v3 on g5.xlarge — amortised
line item pgvector + Postgres 16 RDS
$ / call fixed
$ / month $284
note db.m6i.large · embeddings + tsvector + traces
line item bge-reranker on g5.xlarge
$ / call fixed
$ / month $378
note shared with Whisper fallback · 24/7
line item Pinecone serverless (A/B 50%)
$ / call $0.0008
$ / month $33
note watchdog mirror · expected to drop after the audit
line item Twilio Voice (inbound)
$ / call $0.0170
$ / month $697
note $0.0085/min × 2 min avg per call
line item Cloudflare Workers + R2
$ / call $0.0006
$ / month $26
note edge proxy + audio chunk store
line item Langfuse self-hosted
$ / call fixed
$ / month $67
note t3.medium · 30-day hot / 1-yr cold
line item All-in per deflected call
$ / call ≈ $0.10
$ / month ≈ $4,573 / mo
note vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate

Token costs use OpenAI's public Realtime API pricing as of May 2026 — $24/1M audio-input, $48/1M audio-output. Twilio costs are list price. Infra costs are AWS US-east-2 list. Loaded live-agent cost ($4.00/call) is the client's own internal blend (wage + benefits + AHT + occupancy + tooling); we used their number, not a market average. Monthly figures assume 41,200 calls/mo at the current 38% deflection rate. Per-call all-in reconciles to ~$0.10 (agent path) + ~$2.48 (handoff path) blended ≈ $1.05 weighted — published math headlines the per-deflected-call number, which is the relevant comparison vs. a live agent on a deflected call.

voice bot eval composition

What's in the frozen 240-item set

category	items	what it checks	ci-gate threshold
Top-5 question golds	100	labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume	≥ 0.92 groundedness
Latency soak (concurrent)	20	50-concurrent-call replay against the staging Realtime endpoint	p95 ≤ 700ms first-token
Accent + noise	30	ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set	≥ 0.85 transcript accuracy
Must-refuse	26	billing disputes, churn-save asks, legal escalations, retention offers, refund promises	100% refusal · 100% handoff
Multilingual handoff	24	Spanish-to-English switch mid-call (added after the kill-point)	p99 ≤ 250ms switch latency
Adversarial	40	jailbreak attempts, role-play coercion, prompt injection through caller statements	≥ 0.98 refusal

category Top-5 question golds
items 100
what it checks labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume
ci-gate threshold ≥ 0.92 groundedness
category Latency soak (concurrent)
items 20
what it checks 50-concurrent-call replay against the staging Realtime endpoint
ci-gate threshold p95 ≤ 700ms first-token
category Accent + noise
items 30
what it checks ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set
ci-gate threshold ≥ 0.85 transcript accuracy
category Must-refuse
items 26
what it checks billing disputes, churn-save asks, legal escalations, retention offers, refund promises
ci-gate threshold 100% refusal · 100% handoff
category Multilingual handoff
items 24
what it checks Spanish-to-English switch mid-call (added after the kill-point)
ci-gate threshold p99 ≤ 250ms switch latency
category Adversarial
items 40
what it checks jailbreak attempts, role-play coercion, prompt injection through caller statements
ci-gate threshold ≥ 0.98 refusal

Eval set is frozen — items added only, never edited. The support lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Per-item replay is deterministic — same audio, same prompt, same retrieved chunks fed via fixture.

interactive cost math

Your monthly call volume, your monthly bill.

Drag the slider to your tier-1 inbound voice volume. The numbers below recompute against the published $0.10/call agent cost and a $4/call loaded live-agent baseline. The bar chart shows where the agent ROI compounds against pure-human staffing.

monthly inbound call volume 41,200

1,000 250,500 500,000

agent monthly $: $103,742
100% human baseline: $164,800
monthly savings: $61,058
savings vs baseline: 37.0%

Math assumptions: $0.10/call all-in (Realtime API tokens + RAG infra + edge), $4.00/call loaded live-agent cost (wage + benefits + AHT + occupancy + tooling), 38% tier-1 deflection rate (95% CI 33%–43%, n=11,400). Switch any assumption and the slider stays honest; the page math doesn't.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. Four controls keep latency and CSAT honest after cutover.

Monday flagged-turn review

Every handoff, every groundedness near-miss, every latency outlier past p95. Patterns that repeat (>3 same flag/wk) become a JIRA ticket against the eval set.

Trace retention

Langfuse in customer VPC. Per-turn: audio segment, STT transcript, retrieved chunks, model output, tool invocations, call-state at handoff, final caller-facing audio.

On-call rotation

One SRE per week. 99.5% pipeline-availability SLO + p95 ≤ 700ms first-token SLO on the LLM-routed path.

CSAT sample review

40 deflected calls / month. CX team listens, scores, feeds the prompt + retrieval iteration loop. Doesn't touch the frozen eval.

voice bot build · 10 weeks · honest version

The timeline
including the week we sat on our hands.

Five stages, milestone-billed. The week-5 shadow run found a 1.4-second multilingual latency spike that would have torched the SLO in production. We halted the cutover, fixed the cache invalidation bug, added a tier-cached language-detection prefetch, and only then promoted the AI contact center to primary. The honest version of shipping in 10 weeks includes the week we didn't ship.

Weeks 1–2

Discovery + frozen eval set

Two weeks shadowing the existing tier-1 voice queue. Pulled six months of call recordings (de-identified, customer consent already on file) and let the support lead label them. 240 frozen eval items — the 5 questions accounting for 62% of volume plus 30% adversarial (accent, background noise, multi-turn corrections) and 8% must-refuse (legal escalations, billing disputes, churn-save asks).

240-item eval set, must-refuse list, and latency SLO of 700ms
Week 3

Stack bake-off

Two pipelines built in parallel: native gpt-realtime-2 speech-to-speech and a chained Whisper → GPT-5 → ElevenLabs path. Both wired to the same RAG over the help-center corpus. Ran 240 eval items through each, plus a soak test at 50 concurrent calls. Realtime won on p95 first-token by ~360ms; chained won on cost per minute by 28% but failed the latency SLO at the 95th percentile. Picked Realtime primary, chained kept as a documented fallback for the multilingual lane.

Realtime primary, chained fallback, SLO-passing prototype
Week 4

Help-center RAG + tool surface

Ingested 8,200 help-center articles into pgvector 0.7 (and mirrored into Pinecone serverless for the cost / tail-latency A/B). 480-token chunks, 80-token overlap, embeddings via text-embedding-3-large, cross-encoder rerank with bge-reranker-large. Three tools wired into the Realtime function-calling surface: lookup_article, handoff_to_human, schedule_callback. Zero write tools; the agent cannot mutate a customer record.

Hybrid retrieval at 0.89 recall@5. 3-tool surface frozen.
Week 5

Shadow run — multilingual latency spike

Two weeks shadowing the live queue (silent — calls still went to humans; the agent's response was logged but not played). Day 9 the SRE on rotation flagged a p99 latency spike on Spanish-to-English handoff calls: 1.4 seconds of dead air at the language switch. Root cause was a cache invalidation bug in the language-detection routing — first detection result was cached per call SID but never invalidated when the caller switched language mid-call. We halted prod cutover, added a tier-cached language detection prefetch at call start (every supported language warmed in the cache before the model needs it), and re-ran the soak. The honest version of `4-week shadow` includes this week.

p99 multilingual latency: 1,420ms before the fix, 210ms after

Walk-away point
Weeks 6–10

Production cutover + cost lock-in

Promoted to primary on the tier-1 inbound queue with the live agent line in warm-standby on a 1-second timer. Weeks 6–8 ran at 20% traffic with the support lead reviewing every flagged conversation. Weeks 9–10 ramped to 100% tier-1. The unit-economics SpecGrid below is the production-cut math at the 41k-call/month volume we currently see (not a projection).

Full cutover. $0.10/call published. Per-call trace store on hot retention.

voice agent evaluation · voice bot eval results · 240 frozen items

How we know
it works.

The voice bot eval set is frozen. Every model bump, prompt change, retrieval tweak, and tool-schema change re-runs the full 240. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2 points across all rows over the last 30 days.

metric

human baseline

v1 (wk 3)

v2 (wk 5)

current (live)

target

Tier-1 deflection rate (95% CI)

n/a

31% (±5)

35% (±4)

38% (±5)

≥ 35%

First-token latency p95

n/a

940ms

680ms

580ms

≤ 700ms

Help-center recall@5

n/a

0.81

0.86

0.89

≥ 0.85

Wrong-answer rate (groundedness fail)

n/a

2.4%

1.6%

0.8%

≤ 1.0%

Human-handoff precision

n/a

0.88

0.91

0.94

≥ 0.92

Per-call all-in cost

$4.00

$0.18

$0.13

$0.10

≤ $0.15

Sample size for the deflection number is n=11,400 inbound calls across the 6-week shadow + 4-week production cut. The 38% point estimate has a 95% confidence interval of 33%–43%. First-token latency p95 is measured at the agent boundary (caller-side jitter buffer excluded). Per-call cost is the all-in deflected-call number; weighted-blended cost across deflected + handoff paths is ~$1.05/call. Multilingual handoff latency is measured at language-switch detection; per the kill-point fix, p99 now sits at 210ms.

Voice bot vs chatbot — what's the difference?

A voice bot answers a phone call. A chatbot answers a text message in a web widget or messaging app. Same model family can power both, but the engineering is different: voice bots stream bidirectional audio with sub-second latency, run an STT/TTS layer (or a unified Realtime model), and have to handle interruption and barge-in. Chatbots have turn-based text — no streaming-audio latency budget, no acoustic environment. A contact center running both surfaces typically shares the knowledge corpus and the eval set but ships them as two products with two latency SLOs.

Voice bot vs IVR — what's the difference?

An IVR (interactive voice response) is a touch-tone or fixed-phrase menu that routes the caller through a decision tree. It cannot understand free-form speech and cannot reason across a knowledge corpus. A voice bot understands what the caller actually says ("I'm calling because my last invoice charged me twice"), retrieves the relevant policy, and answers — or hands off when confidence is low. IVRs sit in front of voice bots in many deployments: the IVR handles caller authentication and routing; the voice bot handles the conversational answer.

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

A voice bot built on the OpenAI Realtime API patterns above will burn money or hurt CSAT in any of the following situations. We turn down the engagement before a pilot is scoped.

HIPAA-equivalent regulated voice

Healthcare patient lines, financial advisory, legal client calls. OpenAI's Realtime API is not BAA-eligible as of May 2026. The workflow needs a different architecture (Whisper-on-prem + GPT-4o-mini behind Azure OpenAI + ElevenLabs Enterprise) at materially higher per-call cost.

Accent or dialect not in the eval

The accent + noise eval has 30 items from Common Voice — most US callers, EN-IN, EN-AU, EN-GB. If inbound traffic is meaningfully different (heavy AAVE, regional Caribbean English, L2 speakers), the eval has to grow before the model takes calls — and the 38% deflection number does not transfer.

Low-bandwidth / feature-phone callers

First-token math depends on a continuous 8kHz+ stream. Callers on flaky cellular, hotel landlines, or feature phones over G.711 µ-law degrade Whisper WER 3–5x. The latency tail blows out and CSAT inverts. The right product is async IVR with callback, not a real-time voice agent.

Recording obligations the agent can't satisfy

Two-party-consent states (CA, FL, MA, MD, MT, NV, NH, PA, WA), MiFID II / Dodd-Frank, regulated medical advice. Recording + retention + disclosure posture exists at week 1, or the pilot doesn't get signed.

frequently asked — voice bot · openai realtime api

What buyers ask first.
Real answers, no hedging.

What is a voice bot?

A voice bot is a real-time conversational AI agent that answers inbound calls, understands speech, retrieves answers from a controlled knowledge corpus, and hands off to a human when confidence is low. It is not an IVR menu — IVR routes by touch-tone or fixed phrases. It is not a chatbot with text-to-speech bolted on — a voice bot streams audio bidirectionally with sub-second latency. This case study runs gpt-realtime-2 with function-calling into a help-center corpus.

How does the OpenAI Realtime API differ from chained STT + LLM + TTS?

The Realtime API is a single bidirectional speech-to-speech model — audio in, audio out, with intermediate function-calling. Chained STT + LLM + TTS introduces ~800ms-1.5s of stage latency from sequential calls. Realtime API gets p95 first-token under 600ms on a typical contact-center setup. Tradeoff: chained gives you per-stage observability and per-stage swap-out; Realtime API is faster but more vendor-locked.

How much does a voice bot cost per call?

About $0.10 per call on this build (median ~3.2 minute call, gpt-realtime-2 at OpenAI's published audio pricing, Twilio inbound + outbound costs). Compared to a $4 fully-loaded human-agent baseline (including QA, scheduling, supervision), payback per deflected call is ~40×. At 41k calls/month, total run-rate is roughly $4,100 in OpenAI spend + $2,800 in Twilio carriage + $1,200 in observability/infra.

How fast is the OpenAI Realtime API?

P95 first-token latency 580ms end-to-end (caller speaks → first audio token streamed back), measured at the SIP-edge. P95 turn-taking latency (caller stops → bot starts) 920ms with VAD calibration. The 580ms first-token target was the kill-point gate at week 5; we paused the build until we hit it consistently across 4 phone carriers in 3 US regions.

Can a voice bot transfer to a human agent?

Yes — confidence-gated warm transfer is the default. The bot calls a registered handoff tool whenever confidence drops below 0.7, the caller explicitly asks for a human, or the query matches a must-refuse category. The transfer carries call context (intent, retrieved evidence, prior turns) to the human agent so the caller doesn't repeat themselves. This case study's transfer rate is 23% of calls; 77% complete fully bot-handled.

How does a voice bot integrate with Twilio?

This case study uses Twilio for SIP-edge ingress + outbound carriage. The Twilio media stream pipes WebSocket audio to a Cloudflare Worker that proxies to OpenAI Realtime with an ephemeral session token. Call recording + transcript storage stays in the customer's tenant (Twilio TaskRouter handoff, Twilio recording with customer-managed retention). The OpenAI Realtime layer sees only the active audio stream, never the recording.

How long does it take to build a voice bot?

10 weeks for this engagement: 2 weeks discovery + 240-item eval-set freeze, 2 weeks Twilio + Realtime wiring + latency baseline, 2 weeks help-center corpus chunking + retrieval, 1 week handoff tool + confidence gating, 1 week kill-point pause at week 5 (we paused until p95 first-token hit 580ms consistently), 1 week shadow cutover, 1 week production launch + tuning.

When should we NOT ship a voice bot?

Four cases: must-refuse policy isn't drafted (regulatory categories like loan-application status, billing disputes over $X, or healthcare-PHI must route directly to a human — no AI band); TCPA/two-party-consent disclosure isn't on the IVR script (FCC enforcement is unforgiving); the SRE team can't operate a real-time observability + barge-in tuning loop for the first 90 days; transfer to a human agent isn't possible (a voice bot without a transfer fallback is hostile). We turn down engagements that fail any of these.

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this voice bot / AI contact center build, or that a similar OpenAI Realtime API build on your stack would draw from.

01 Service

OpenAI Development

OpenAI Realtime API depth, GPT-5 / GPT-5-mini routing, openai voice api integration patterns, Azure OpenAI for regulated deployments. The model-pillar this voice bot sits on.

02 Service

AI Voice Agents

The voice-agent pillar — telephony, barge-in tuning, IVR replacement, multilingual support. Same eval-first loop used on this AI contact center build.

03 Service

AI Chatbot Development

Async-text sibling — use when an AI voicebot is the wrong primitive or async-first beats real-time on cost.

04 Service

AI Agent Development

Function-calling, tool surfaces, multi-step agents. The voice bot is one interface; the same agentic stack runs underneath.

05 Case study

All AI Case Studies

Six case studies — voice bot, AI fraud detection, AI knowledge base, AI triage, AI legal assistant, ecommerce chatbot. Same operator detail across every page.

06 Service

AI Consulting

Fixed-fee voice bot / AI contact center audit. We'll model your call volume against the cost slider above and tell you honestly if AI customer service software pays back.

07 Service

AI Development Services

How a Realtime voice agent fits inside a broader AI development services engagement — STT + GPT + TTS + telephony + CRM integration + monitoring.

Ready to ship

Want a voice bot like this
for your AI contact center?

Book a Fixed-fee voice bot / AI contact center audit. We'll review the inbound voice workflow, model your call volume against the cost slider on this page, scope the eval set, recommend a stack (OpenAI Realtime API / chained / hybrid), project run-cost, and tell you honestly whether AI customer service software is the right shape for your traffic — or whether an AI-powered call center buy beats a build. About one audit in five ends with `keep the humans, here's the smaller automation we'd ship instead.`

Read the voice agents pillar

30 min, async or live Cost-math reconciled to your real volume Walk-away point in the pilot

Updated May 20, 2026 · By Navin Sharma

A voice bot for a SaaS AI contact center, built on the OpenAI Realtime API.

What this case study shows

A tier-1 voice queue that wasn't worth a human.

today · tier-1 voice queue

with the agent

Voice bot pipeline — six stages, one branch out to a human.

gpt-realtime-2 speech-to-speech as the primary path

handoff_to_human as a function-calling tool, not a fallback timeout

pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror

Every component has a separately measurable contract.

Realtime decision model

Telephony ingress

Hybrid retrieval

Cross-encoder rerank

TTS first-frame

Handoff to human

The realtime voice agent, round-trip.

Voice bot + OpenAI Realtime stack — named tools, named versions.

Production shape, under the hood.

How fast is the OpenAI Realtime API?

How much does a voice bot cost?

What runs every week, and who owns it.

Monday flagged-turn review

Trace retention

On-call rotation

CSAT sample review

The timeline including the week we sat on our hands.

Discovery + frozen eval set

Stack bake-off

Help-center RAG + tool surface

Shadow run — multilingual latency spike

Production cutover + cost lock-in

How we know it works.

Voice bot vs chatbot — what's the difference?

Voice bot vs IVR — what's the difference?

The four shapes we turn down before scoping a pilot.

HIPAA-equivalent regulated voice

Accent or dialect not in the eval

Low-bandwidth / feature-phone callers

Recording obligations the agent can't satisfy

What buyers ask first. Real answers, no hedging.

Where this case study points back to.

OpenAI Development

AI Voice Agents

AI Chatbot Development

AI Agent Development

All AI Case Studies

AI Consulting

AI Development Services

Want a voice bot like this for your AI contact center?

A voice bot for a SaaS AI contact center,
built on the OpenAI Realtime API.

A tier-1 voice queue
that wasn't worth a human.

Voice bot pipeline — six stages,
one branch out to a human.

Every component has a
separately measurable contract.

The realtime voice agent,
round-trip.

Voice bot + OpenAI Realtime stack — named tools,
named versions.

Production shape,
under the hood.

What runs every week,
and who owns it.

The timeline
including the week we sat on our hands.

How we know
it works.

The four shapes we turn down
before scoping a pilot.

What buyers ask first.
Real answers, no hedging.

Where this case study
points back to.

Want a voice bot like this
for your AI contact center?