OpenAI Realtime API
Speech-to-speechSub-600ms TTFT · barge-in · in-app voice copilots
AI voice agent development company shipping production AI voice agents on three stacks: realtime voice ai on the OpenAI Realtime API (speech-to-speech), chained STT+LLM+TTS, and Deepgram Voice Agent. Sub-600ms first-token latency, $0.005–$0.06/min honest cost range, eval-gated against your real call recordings. Telephony (Twilio · Telnyx · LiveKit · Daily) and mobile-native voice inside Flutter / iOS / Android / web. First pilot live in 5–7 weeks, behind a feature flag, with a walk-away point if the metric won't move.
A sub-second AI voice agent is a real-time conversational system that completes a full speech-in to speech-out round-trip in roughly 600 milliseconds, the latency budget at which a phone call feels natural rather than robotic. Built on streaming speech-to-speech models like OpenAI Realtime (gpt-realtime-2) or Gemini Live, it pairs server-side voice activity detection for interruption handling with function-calling against business systems mid-turn. Unlike interactive voice response (IVR) trees, a voice agent reasons over open-ended speech rather than matching a fixed phone-tree branch. Unlike traditional Whisper-plus-TTS chains, the streaming model handles speech end-to-end without per-leg latency stacking. Production deployments use Twilio for telephony, a hybrid retrieval index (pgvector + BM25) for grounded answers, and Langfuse for per-call eval logging.
Every voice ai agent pattern below has been shipped from this exact playbook: telephony inbound and outbound, in-app voice copilots, IVR replacement, multilingual support, voice-first ecommerce. Each one comes with an eval suite, audit logging, barge-in tuning, and a per-minute cost target — not a demo reel.
The crown-jewel use case for voice ai for customer service. A real phone number (Twilio, Telnyx, or your existing SIP trunk) answered by a voice ai agent that handles tier-1 calls: appointment changes, order status, billing balance, store hours, password resets. Confidence gate at 0.7; anything below escalates to a human queue with a structured handoff summary, not a re-asked greeting. We ship these with barge-in, hold detection, and DTMF fallback for callers who hit a frustration ceiling.
Outbound voice agents for appointment reminders, payment-due nudges, NPS surveys, and lead qualification. Per-call cost lands at $0.05–$0.15 (vs $4–$8 for a human callback) and you control the cadence + script + escalation rules. Compliance baked in: TCPA-aware time windows, opt-out detection on every turn, full call recording with consent prompts where required. Built on a chained stack to keep cost low at outbound scale.
Voice agents that live inside your iOS / Android / Flutter / web app, not a phone call. Conversational ai voice for hands-free interfaces in food delivery, automotive, field service, and accessibility surfaces. Mobile-native delivery is a differentiator: we wrote the UI kit a lot of Flutter screens run on, so the voice surface ships with proper interrupt handling and mic-permission UX, not a half-baked WebRTC bolt-on.
Direct replacement of legacy IVR menus with a single open-ended voice ai agent that routes by intent instead of by keypad. Average call shortens 40–90 seconds when callers can just say what they want, and the same agent can answer the question the caller would have been routed to anyway. Plays cleanly inside existing contact-center platforms (Genesys, Five9, Amazon Connect) via SIP REFER or media-streaming.
Voice agents that operate across English, Spanish, Hindi, Portuguese, Arabic, Mandarin, French, and German out of the box. Chained stacks shine here: Deepgram or Whisper for STT in the caller's language, GPT-5 or Sonnet 4.6 for the reasoning step (both genuinely multilingual without separate models), and ElevenLabs or Cartesia for TTS that doesn't sound like a 2018 phone tree. Language detection on the first turn, no menu required.
Voice agents for hands-busy commerce — kitchen-side reordering, drive-thru, in-car, accessibility-first browsing. Grounded in your Shopify / WooCommerce / commercetools catalog with function calls into order-management, payment-tokenization, and 3PL APIs. Realtime API is the right choice here when TTFT under 600ms is non-negotiable for the natural conversational feel customers expect.
The single biggest decision in voice ai development is which architecture you pick. Most listicles dodge it. OpenAI Realtime API, chained STT+LLM+TTS, and unified-vendor stacks each win different workloads. Per-dimension honest comparison below; the audit picks for your workload.
Numbers are typical production traces from shipped pilots. Per-workload picks vary with your eval data and call volume.
Five tactics stacked, ordered by impact on chained voice stacks. Most voice pilots see per-minute cost drop 60–80% at the same eval-suite quality after this pass. Realtime API has a tighter optimization surface (~$0.06/min floor); chained is where the real cost engineering lives. This optimization pass is included in every voice agent pilot, post-cutover.
The voice agent is half the story; the telephony layer underneath is the other half. Twilio for ubiquity (twilio voice ai via Media Streams is the most common production path we ship), Telnyx for margin on high-volume outbound, LiveKit for in-app voice copilots that share a codebase across mobile + web + phone, Daily for the most opinionated open-source stack via Pipecat. Pick a platform to see the integration code, auth model, and timeline.
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Connect
# 1. Incoming call → TwiML returns <Connect><Stream/> to our WS endpoint
def incoming_call(request):
response = VoiceResponse()
connect = Connect()
connect.stream(url="wss://voice.api.example.com/media")
response.append(connect)
return str(response)
# 2. Media WS receives 8kHz µ-law audio frames every 20ms
# 3. We re-encode to PCM, send into OpenAI Realtime or Whisper
# 4. TTS audio (ElevenLabs Flash) streams back to Twilio media WS
# 5. Barge-in cancels in-flight TTS within ~280ms via VAD signal
A chained voice ai agent is three vendors stacked — STT in, LLM in the middle, TTS out. Each layer has 2–3 production-grade choices in 2026 with meaningful tradeoffs on cost, latency, and language coverage. Here's the default stack we ship; we re-pick per workload when the eval data demands it.
Deepgram Nova-3 is our default for telephony (8kHz µ-law, streaming, sub-200ms partial transcripts, multilingual). Whisper-large-v3 still wins on rare-language coverage and noisy environments. We pick per workload. OpenAI Realtime API rolls its own STT internally so this layer disappears entirely when you go speech-to-speech. We benchmark on your real call audio before recommending; vendor benchmarks lie about phone-quality audio constantly.
GPT-5 / Sonnet 4.6 for grounded multi-turn reasoning, function calling, and tool use (the part that turns a voice ai agent into something that does work, not just talks). GPT-5-mini or Haiku 4.5 for cheap classify + routing layers in front. Realtime API ties you to OpenAI's reasoning model — fine when latency dominates the requirement, less fine when your eval data says Claude wins for your domain. Model-agnostic by default; locked-in only when you ask for it.
ElevenLabs Flash v2.5 for sub-200ms TTFT and natural prosody (the default for in-app voice copilots and premium telephony). Cartesia Sonic for the cheapest-per-character production-grade voice (winning on long outbound calls where per-minute cost dominates). OpenAI TTS (gpt-4o-mini-tts) when you want a single-vendor stack. The choice is a $/min + brand-voice decision, not a quality one. All three sound human in 2026.
The disqualifying signal when shortlisting any voice ai company is asking how they measure quality and getting marketing-speak back. Here are the four numbers we publish on every voice agent pilot, measured on a held-out eval set from your real call recordings. Below the targets, we don't roll out. We tune.
On a held-out eval set of 200+ real call snippets from your domain, what fraction does the agent classify and respond to correctly? Most production voice ai agents we see land at 84–93% on tier-1 intents after tuning. Below 80% on a frozen eval set means the prompt or retrieval needs work before rollout, not after.
How often does the agent invent a policy, a balance, a hours-of-operation, or a product feature that doesn't exist? Measured by sampled human review (200 calls/week minimum during pilot). Target is <2 per 100 calls before production rollout. Anything above 5 means your RAG grounding is missing the source documents the agent needs.
When the caller interrupts the agent mid-sentence, how fast does the agent stop talking? Target: <300ms p50, <500ms p95. Above 500ms and the conversation feels robotic; callers start over-talking instead of waiting. This is the metric that decides whether a voice agent feels alive or feels like 2015 IVR.
From call start to the moment the agent does the thing the caller wanted (booked the appointment, looked up the order, queued the refund). Target depends on the workflow, but we publish the baseline you'll be measured against during the audit. Most replacements of human tier-1 calls cut MTTA by 40–70% just by removing the queue wait.
Same pricing as our other engagements. Most voice ai company shortlists hide pricing. We publish it. Audit first to scope architecture + telephony, run a 5–7 week pilot on the highest-ROI workflow, then continuous if you want to ship the next 2–3.
Find the voice workflow worth shipping, in the right architecture, before any build commitment.
One voice ai agent shipped end-to-end on your chosen architecture and telephony stack, with eval data, not a demo reel.
Embedded squad shipping new voice workflows + tuning the live ones.
Four stages, milestone-billed, with a walk-away point at the vendor-pick decision. Most voice agent failures happen because the team picked the architecture by ideology, not eval. Both are in week 1 and week 2 here, not after the pilot bill arrives.
Harvest 50–200 real call snippets from your call recordings (or run a structured intake if you're greenfield). Build the eval set the voice ai agent will be measured against — intent-match accuracy on each call type, expected MTTA per workflow. Scope locked: which call types in, which out, who escalates to whom.
Wire the chosen telephony stack (Twilio · Telnyx · LiveKit · Daily) into a sandbox number. Run STT options against your real call audio (Whisper vs Deepgram vs Realtime) and pick per WER on YOUR audio, not vendor benchmark audio. TTS A/B for brand voice. RAG corpus ingested if grounding is in scope.
Wire the full anatomy: telephony → STT → LLM (with tool use + RAG) → TTS → guardrails → log. Tune the hard parts: barge-in thresholds (<300ms p50), endpointing (when has the caller stopped talking?), hold-music handling, DTMF fallback, escalation handoff with structured summary. Behind a feature flag on a sub-100-call cohort.
Run the 4-metric eval against shadow traffic for 2 weeks. Hallucination review by human sample. Roll out 10% → 50% → 100% gated by eval movement. Token-optimization pass post-cutover: route to cheaper models per turn, cache the system prompt, idle-trim. Most voice pilots see per-minute cost drop 60–80% at the same eval-suite quality.
Three anonymized capability patterns drawn from real voice ai agent engagements — one on the Realtime API (mobile-native), one chained (outbound telephony), one chained-multilingual (IVR replacement). Named references shared under NDA once we know what you're building.
Tier-1 voice queue averaging 4-minute wait at peak. Five inbound questions accounted for 62% of call volume. Existing IVR bouncing 80%+ to a human. The binding constraint was sub-700ms first-token; anything slower and US callers report a robot.
gpt-realtime-2 speech-to-speech over a help-center pgvector RAG, Twilio + Cloudflare Workers for sub-60ms ingress, handoff_to_human as a function-calling tool with confidence gating at 0.7. Whisper fallback for accent + noise. Kill point caught + fixed at week 5 (multilingual cache invalidation).
Customers re-ordering favourites mid-commute or hands-busy in the kitchen had a 7-tap nested-menu reorder flow. Drop-off at step 3 was 38%. Voice was the obvious primitive, but every hosted voice SDK added 200–400ms of round-trip that broke the conversational feel.
OpenAI Realtime API integrated directly into the Flutter shell via a thin Dart-to-WebRTC bridge. Function calls into the existing order-management API (no rewrite required). Sub-600ms TTFT measured end-to-end on the user's actual device, not on a wired desktop. Barge-in handled natively. Mic-permission UX and on-screen visual feedback shipped in the same release.
Multi-location clinic running 4,000 outbound reminder calls weekly through a human contact-center, each call $4–6 fully-loaded. Reschedule rate captured at the call was <30% because reps couldn't see the calendar live.
Chained stack (Telnyx + Deepgram Nova-3 + GPT-5-mini + ElevenLabs Flash) for outbound cost economics. Function calls into the EHR's scheduling API to offer real open slots live during the call. TCPA-aware calling windows, opt-out detection per turn, audit logging of every consent prompt. Escalation to a human queue when the caller asked anything outside reschedule scope.
Legacy IVR for a multi-country retail brand routed calls across 6 languages with 4-level keypad menus. Average pre-agent wait was 90 seconds; 22% of callers abandoned before reaching a human; non-English callers fared worst because the menu was English-first.
Single open-ended voice agent answers in the caller's language (detected on first turn) and routes by intent, not by keypad. Chained stack: Whisper-large-v3 for STT (best rare-language coverage), Sonnet 4.6 for multilingual reasoning, ElevenLabs Multilingual v2 for TTS. Integrated into the existing Genesys contact center via SIP REFER so the human-handoff queues didn't change.
Voice is rarely the only AI surface you're building. These sibling pages cover the adjacent decisions: text channels, multi-step agents, the Realtime API itself, and mobile-native delivery.
The text-channel sibling — when async beats voice or voice is wrong primitive.
Multi-step agents with tool use — voice is one interface to the same agentic stack.
Realtime API depth and openai realtime api development — the speech-to-speech path.
Plug your voice agent into Salesforce, Zendesk, HubSpot, EHR, OMS.
Mobile-native voice copilots inside your Flutter / iOS / Android app.
When the voice turn is one step in a larger agentic process automation pipeline.
Pre-build discovery audit when voice may or may not be the right primitive (often, it's not — we'll say so).
When voice is one channel in a wider custom AI build — app shell, retrieval, agent layer, monitoring. The umbrella AI development company pillar.
Voice agents run the same multi-step planning loop as text agents. The reliability rubric (pass@1, error recovery, mean steps) carries over directly.
Book a free AI voice agent audit. We'll review your call recordings or use-case shortlist, recommend the architecture (Realtime · chained · unified) and telephony stack (Twilio · Telnyx · LiveKit · Daily), project per-minute cost in $/min terms, design the 4-metric eval set, and give you a 90-day voice agent roadmap. No deck, no obligation to build.