← all case studies
E-commerce · DTC apparel · Flutter mobile Voice copilot · OSS Flutter widget · A/B-evaluated
gpt-realtime-2Flutter 3.24GetWidget OSS · 4.8k★AlgoliaCloudflare Workers
case study · 2026 · anonymized

A Flutter voice copilot case study, where the chat
is voice.

A mid-market DTC apparel retailer's Flutter app was lagging desktop mobile conversion by 18 points. The in-app search UX score was 2.8/5, and the team had failed two prior on-device voice A/B tests on the trigger UX. We shipped a tap-to-talk voice copilot on gpt-realtime-2, function-calling into the existing Algolia facet index, embedded via a new GFVoiceCopilot widget in our open-source Flutter UI kit. Eight weeks, A/B-evaluated, with a kill point at week 5 that we used.

+11.4 pts
mobile conversion · voice-engaged vs control · n=42,318 sessions over 30d A/B · ±1.6pt CI
p95 580 ms
first-token end-to-end on iPhone 13 + Pixel 7 · cellular and Wi-Fi blend
2.8 → 4.2
in-app search UX score on the voice cohort · n=812 post-session prompts
8 weeks
discovery to 50% A/B rollout · 1 cart-abandon halt at wk 5
shipped
8 weeks · 3 Flutter engineers · 1 AI engineer · 1 product designer
Summary

What this case study shows

A DTC apparel ecommerce client shipped a voice-driven shopping copilot inside their 1.4M-MAU Flutter app, using OpenAI Realtime over WebRTC and Cloudflare-minted ephemeral keys. Across n=42,318 sessions in a 30-day A/B (plus or minus 1.6 percentage-point CI), mobile conversion on voice-engaged sessions lifted +11.4 percentage points. Stack: gpt-realtime-2, Flutter 3.24, GetWidget Flutter UI Kit (OSS), Algolia, Cloudflare Workers. Tap-to-talk with on-device VAD, function-calls into the existing Algolia facet index, live product-grid re-render. Shipped as the open-source GFVoiceCopilot widget. This is one shape of a broader voice agent build — same pipeline carries over to in-app, telephony, and kiosk surfaces.

in-app · synthetic replay

Tap-to-talk,
the grid responds.

The phone mock alongside is a stylised replay of one voice-engaged session — partial transcript surfaces as the user speaks, the grid narrows by facet, recommendations slide in. Real sessions are sub-second to first token; the animation here is deliberately slower than production so the sequence is legible.

  • 01Tap-to-talk fires on-device VAD and opens the WebRTC channel.
  • 02Partial transcript surfaces in the chip overlay as the user speaks.
  • 03Function call hits Algolia · facets narrow live.
  • 04Recommendations stream back · grid re-renders without rebuild.
−18 pts
mobile vs desktop conversion gap before the build
2.8 / 5
in-app search UX rating from a 1,200-user survey
2x failed
prior on-device voice A/B tests · trigger UX rejected
4.8k★
GetWidget OSS Flutter UI kit · the OSS foundation this voice surface ships on
the problem

A Flutter app
losing the desktop crown.

Our client: a mid-market US DTC apparel retailer with a Flutter-first mobile app, ~1.4M MAU, 71% of total sessions on mobile but converting 18 points below desktop. In-app search UX scored 2.8/5. The head of product called it the loudest signal in the dashboard. The constraint wasn't model picks; it was that the team had already failed twice on voice UX, and the bar for a third try was high.

today vs · with the voice copilot

today

Shopper opens app
Browse tab
scroll · tap · type
Touch search
2.8/5 UX score
Refine + facet
outcome
−18 pts vs desktop · cart abandons on category browse

with the voice copilot

Shopper opens app
Tap-to-talk · GFVoiceCopilot
on-device VAD · barge-in
gpt-realtime-2 streaming
function-calls into Algolia
Grid re-renders live
outcome
Add-to-cart · voice cohort
outcome
Refine + browse · grid narrows
outcome
Handoff · chat or human

Two prior voice attempts had failed. Failure one was a hot-word listener. Battery drain showed up in App Store reviews. Failure two was a chained STT → LLM → TTS stack with 1.4s first-token latency; the conversational feel broke completely. Hosted voice SDKs (Vapi, Retell, Synthflow) all added 200–400ms of vendor round-trip and wanted to own the UI affordance. The retailer's product team wanted to own it. The product head's framing at kickoff was direct: third strike and we don't try voice again for a year. We didn't pitch a hosted SDK. We pitched a widget, in our OSS Flutter library, with the audio path going through Cloudflare Workers ephemeral-key minting straight into OpenAI Realtime over WebRTC.

two prior A/B tests · why they failed
attempt 1 · hot-wordbattery drain
attempt 2 · chained stack1.4s TTFT
hosted SDK round-trip+200–400ms
this build · TTFT p95580ms

Third strike rule: if tap-to-talk doesn't feel sub-second and the trigger UX annoys, voice is dead on this app for a year.

discovery · week 1

The thing that's killed every prior voice test on this app isn't the model. It's the trigger UX. We don't need a clever STT stack. We need a button people will tap, an affordance that looks right next to our existing design tokens, and a turn that feels under one second. If the button feels wrong, voice is dead for a year on this app.

Head of Product DTC apparel · Flutter mobile · 1.4M MAU
the approach · voice copilot pipeline

Voice copilot pipeline: six stages,
one widget on top.

We routed tap-to-talk through on-device VAD → WebRTC over a Cloudflare-minted ephemeral key → gpt-realtime-2 streaming → function-calls into the existing Algolia index → grid re-renders live as the model speaks. Our fallback is Whisper-large-v3 over chunked HTTP for the ~1.4% of cellular sessions where WebRTC degrades. Audio output streams back over the same WebRTC channel; barge-in flushes cleanly with a response.cancel event.

three decisions that shaped the build
design decision · 01

Tap-to-talk, not always-on listening

we rejected
Hot-word triggered listening
because
The two prior on-device voice A/B tests failed on the always-on UX. Users felt watched, mic-permission dialogs lit up, and battery drain showed up in the support tickets. Tap-to-talk is the explicit user action; the visual affordance is what the trust math turned on.
design decision · 02

WebRTC primary, chunked HTTP fallback

we rejected
WebRTC only · degrade silently if it fails
because
Cellular networks in the US retail demographic drop WebRTC handshakes more often than the listicle benchmarks suggest. We added a Whisper-STT-over-HTTP fallback that fires after two failed handshakes; the user never sees a degraded transport, just a slightly slower turn.
design decision · 03

Function-call into existing Algolia, no facet rewrite

we rejected
Build a new vector-search index for the catalog
because
The retailer's existing Algolia index was tuned over three years of merch experiments. Rebuilding it would have lost institutional knowledge encoded in synonyms, redirect rules, and merchandising overrides. The voice agent function-calls into the same index a human typing into the search bar would hit.
why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which subsystem broke. No single conversion number that hides the cause.

Voice copilot model

First-token latency under 600ms p95 + out-of-scope handoff rate on the 30-day A/B. Function-calls hit the existing Algolia index: no catalog rewrite, no facet retag.

Tap-to-talk affordance

Tap-rate per session + abandon-rate on mic-permission. Tap-to-talk only: never hot-word, never always-on.

On-device VAD + capture

VAD latency + barge-in cleanliness. Audio doesn't leave until user taps.

WebRTC transport

Handshake-success + per-turn round-trip. Chunked-HTTP fallback after 2 failed handshakes.

Function-call surface

Tool-call success rate against the existing Algolia 99.9% SLO.

Grid re-render

Time-to-product-visible on the catalog grid.

under the hood

The voice copilot,
tap to product grid.

Tap-to-talk fires the on-device VAD, opens a WebRTC channel to gpt-realtime-2 over a Cloudflare-minted ephemeral key, and streams partial transcript back as the user speaks. Function calls hit the existing Algolia facet index; recommendations stream back and re-render the product grid live. Hover any stage for its tool surface and latency budget.

outcome · primary Tap → add-to-cart voice-engaged session lifts +11.4 pts mobile conversion
outcome · neutral Refine + browse voice narrows the grid · user keeps tapping touch
outcome · safety Handoff · chat or human out-of-scope intent (returns, support) → chat surface

latency budgets are p50/p95 measured on-device (iPhone 13 + Pixel 7) over a 30-day A/B · first-token p95 580 ms end-to-end · sub-1s perceived

on-device VAD
audio doesn't leave until the user taps · no always-on mic
ephemeral keys
Cloudflare Workers mint a sub-second TTL token · no client-side OpenAI secrets
OSS-anchored
the button affordance ships from the public GetWidget Flutter kit · clients can fork
A/B-first
30-day A/B with control before the engineering team accepted any conversion claim
the stack · voice commerce

Voice commerce stack: named tools,
OSS where it matters.

The voice surface ships from the GetWidget OSS Flutter library. Clients can read the source, fork the widget, and ship a custom variant if they need to. The model + transport are commercial; the affordance the user touches is open. That split is the credibility moat for this build.

gpt-realtime-2 OpenAI Realtime API · WebRTC role voice + reasoning
Whisper large-v3 role STT fallback over chunked HTTP
Flutter 3.24 role iOS + Android single codebase
GetWidget UI Kit OSS BSD-3 · 4.8k★ role voice copilot button · grid widgets
Algolia role catalog facet index · existing
Cloudflare Workers role ephemeral-key mint · signalling
WebRTC role audio transport
Sentry role mobile crash + breadcrumb · A/B cohort tag
Mixpanel role funnel A/B analytics
how it actually runs

Production shape,
under the hood.

Numbers below are from our current production cut. We measured latency on-device on iPhone 13 + Pixel 7; our cost math uses OpenAI's published gpt-realtime-2 pricing as of May 2026; our eval composition is the A/B-test design we gated on before any rollout. The team holds the same shape on every voice engagement we ship.

latency budget

Per-stage P50 / P95 (ms) · on-device

stagep50p95tooling
Tap-to-talk · widget render1638Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
On-device VAD + capture2864Silero-style filter · flutter_sound · 16kHz mono
WebRTC handshake (per-session, amortised)240420Cloudflare Workers signalling · ephemeral key mint
First audio frame in → model context92180Cloudflare edge → OpenAI Realtime · steady-state
gpt-realtime-2 first-token latency380580OpenAI Realtime · streaming TTS in same channel
Function-call → Algolia → grid re-render64138Worker proxy · 4 read tools · grid diff render
Total (perceived first-token)≈ 480≈ 580on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7
  1. stage Tap-to-talk · widget render
    p50 16
    p95 38
    tooling Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
  2. stage On-device VAD + capture
    p50 28
    p95 64
    tooling Silero-style filter · flutter_sound · 16kHz mono
  3. stage WebRTC handshake (per-session, amortised)
    p50 240
    p95 420
    tooling Cloudflare Workers signalling · ephemeral key mint
  4. stage First audio frame in → model context
    p50 92
    p95 180
    tooling Cloudflare edge → OpenAI Realtime · steady-state
  5. stage gpt-realtime-2 first-token latency
    p50 380
    p95 580
    tooling OpenAI Realtime · streaming TTS in same channel
  6. stage Function-call → Algolia → grid re-render
    p50 64
    p95 138
    tooling Worker proxy · 4 read tools · grid diff render
  7. stage Total (perceived first-token)
    p50 ≈ 480
    p95 ≈ 580
    tooling on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7

p50/p95 measured from Sentry per-turn breadcrumbs over a 30-day window on the treatment cohort (n ≈ 28,400 voice turns). WebRTC handshake is per-session and amortised across an average of 6.4 turns per session. It doesn't gate the first-token feel after turn 1. SLO is p95 ≤ 600 ms on perceived first-token; current burn ≈ 97%.

lib/widgets/gf_voice_copilot.dart dart
// gf_voice_copilot.dart — GetWidget OSS Flutter package
//
// Drop the voice copilot into any Flutter scaffold. Mic-permission
// UX, the partial transcript chip, the animated waveform, and
// barge-in handling all live in the widget. The host wires the
// two callbacks: partial transcript (during) + suggestions (after
// the function call resolves).

import 'package:flutter/material.dart';

class VoiceCopilotConfig {
  /// First-token latency budget. The widget surfaces a degraded
  /// affordance when the model exceeds it twice in a row.
  final int firstTokenBudgetMs;

  /// Time to wait before falling from WebRTC to chunked HTTP + Whisper.
  final Duration fallbackToHttpAfter;

  /// Honour barge-in (tap-to-cancel mid-response).
  final bool bargeIn;

  const VoiceCopilotConfig({
    this.firstTokenBudgetMs = 600,
    this.fallbackToHttpAfter = const Duration(seconds: 2),
    this.bargeIn = true,
  });
}

class ProductSuggestion {
  final String sku;
  final String title;
  final num priceCents;
  const ProductSuggestion({
    required this.sku,
    required this.title,
    required this.priceCents,
  });
}

typedef OnSuggestions   = void Function(List<ProductSuggestion>);
typedef OnTranscript    = void Function(String partial);
typedef OnHandoff       = void Function(String intent);

class GFVoiceCopilot extends StatefulWidget {
  /// Scopes the function-call surface to this catalog (per-store SKU set).
  final String catalogId;

  /// Latency + fallback behaviour.
  final VoiceCopilotConfig config;

  /// Fires repeatedly as the model surfaces partial transcript.
  final OnTranscript onPartialTranscript;

  /// Fires once per function-call response with the suggestion list.
  final OnSuggestions onSuggestions;

  /// Fires when the agent classifies the intent as out-of-scope
  /// (returns, support, account questions). The host should
  /// navigate to a chat or human surface here.
  final OnHandoff onHandoff;

  const GFVoiceCopilot({
    super.key,
    required this.catalogId,
    required this.onPartialTranscript,
    required this.onSuggestions,
    required this.onHandoff,
    this.config = const VoiceCopilotConfig(),
  });

  @override
  State<GFVoiceCopilot> createState() => _GFVoiceCopilotState();
}
The GFVoiceCopilot widget API exported from the GetWidget OSS Flutter package. Two callbacks (partial transcript, suggestions) plus a config struct. Mic-permission UX and barge-in are baked in; the host wires intent.
unit economics

Per-session and monthly cost math

line item$ / voice turn$ / month (≈ 480k voice turns)note
gpt-realtime-2 — audio input$0.0021$1,008≈ 21k audio tokens × $0.10 / 1M
gpt-realtime-2 — audio output$0.0048$2,304≈ 24k audio tokens × $0.20 / 1M
gpt-realtime-2 — text-tokens$0.0003$144≈ 30 in + 24 out text tokens at Realtime text pricing
Whisper STT fallback (1.4% of turns)$0.00001$50.006s × 6.7 / 1M tokens equivalent
Cloudflare Workers + KV$184ephemeral keys + signalling + breadcrumb log
Algolia function-call read$0 (existing)no new cost · function-calls hit existing facet index
Sentry mobile breadcrumb$76per-turn breadcrumb · cohort-tagged · 90d retention
All-in monthly≈ $0.0078≈ $3,721vs. ≈ $0.045 / turn on the rejected hosted SDK path
  1. line item gpt-realtime-2 — audio input
    $ / voice turn $0.0021
    $ / month (≈ 480k voice turns) $1,008
    note ≈ 21k audio tokens × $0.10 / 1M
  2. line item gpt-realtime-2 — audio output
    $ / voice turn $0.0048
    $ / month (≈ 480k voice turns) $2,304
    note ≈ 24k audio tokens × $0.20 / 1M
  3. line item gpt-realtime-2 — text-tokens
    $ / voice turn $0.0003
    $ / month (≈ 480k voice turns) $144
    note ≈ 30 in + 24 out text tokens at Realtime text pricing
  4. line item Whisper STT fallback (1.4% of turns)
    $ / voice turn $0.00001
    $ / month (≈ 480k voice turns) $5
    note 0.006s × 6.7 / 1M tokens equivalent
  5. line item Cloudflare Workers + KV
    $ / voice turn
    $ / month (≈ 480k voice turns) $184
    note ephemeral keys + signalling + breadcrumb log
  6. line item Algolia function-call read
    $ / voice turn
    $ / month (≈ 480k voice turns) $0 (existing)
    note no new cost · function-calls hit existing facet index
  7. line item Sentry mobile breadcrumb
    $ / voice turn
    $ / month (≈ 480k voice turns) $76
    note per-turn breadcrumb · cohort-tagged · 90d retention
  8. line item All-in monthly
    $ / voice turn ≈ $0.0078
    $ / month (≈ 480k voice turns) ≈ $3,721
    note vs. ≈ $0.045 / turn on the rejected hosted SDK path

Token costs use OpenAI's public gpt-realtime-2 pricing as of May 2026: $0.10 / 1M audio input, $0.20 / 1M audio output, plus the small text-token charge on the function-call surface. Voice-turn volume estimate assumes 17% voice-engaged-session share on 1.4M MAU with 2x sessions/MAU/mo and 6.4 voice turns per engaged session. The retailer's actual run-cost is currently ≈ 12% below the table because volume hasn't fully ramped post-100% rollout.

A/B-test composition

What the 30-day A/B measured

measurementnwhat it checksrollout-gate threshold
Mobile-session conversion · voice cohort42,318 sessionsprimary KPI · vs. matched control cohort≥ +2.0 pts lift on voice-engaged
First-token p95 latency on-device28,400 turnsper-turn Sentry breadcrumb · iPhone 13 + Pixel 7≤ 600 ms p95
Crash-free sessions · treatment vs control42,318 sessionsSentry · within sample noise of control≥ −0.15 pp delta
In-app search UX score (post-session)812 prompts5-pt Likert delivered after voice-engaged sessions≥ 3.8 / 5
Out-of-scope handoff rate28,400 turnsagent says "let me hand you to chat" · should be present8–12% · neither too high nor zero
  1. measurement Mobile-session conversion · voice cohort
    n 42,318 sessions
    what it checks primary KPI · vs. matched control cohort
    rollout-gate threshold ≥ +2.0 pts lift on voice-engaged
  2. measurement First-token p95 latency on-device
    n 28,400 turns
    what it checks per-turn Sentry breadcrumb · iPhone 13 + Pixel 7
    rollout-gate threshold ≤ 600 ms p95
  3. measurement Crash-free sessions · treatment vs control
    n 42,318 sessions
    what it checks Sentry · within sample noise of control
    rollout-gate threshold ≥ −0.15 pp delta
  4. measurement In-app search UX score (post-session)
    n 812 prompts
    what it checks 5-pt Likert delivered after voice-engaged sessions
    rollout-gate threshold ≥ 3.8 / 5
  5. measurement Out-of-scope handoff rate
    n 28,400 turns
    what it checks agent says "let me hand you to chat" · should be present
    rollout-gate threshold 8–12% · neither too high nor zero

A/B randomisation is by anonymous device id. Treatment cohort gets the GFVoiceCopilot button; control cohort gets the existing touch-search-only experience. The +11.4 pt headline is the voice-engaged-session conversion lift, not the all-cohort lift (the all-cohort lift was +1.9 pts, also significant). Confidence interval on the voice-engaged-session lift is ±1.6 pp at the 95% level on n=42,318.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. Four controls keep the lift honest after cutover.

Weekly funnel review

Voice-engaged cohort per-category lift opened. Any category showing >3 days of conversion drop becomes a Sentry issue against the function-call surface + a candidate for prompt tuning.

Breadcrumb retention

Per-voice-turn Sentry breadcrumb in the retailer's EU project, cohort-tagged for downstream analytics.

On-call rotation

Two engineers per week. 99.5% widget-availability SLO + sub-600ms first-token-latency SLO on the treatment cohort.

Store-listing posture

App Store + Play Store mic-permission rationale text submitted at week 7. Both stores approved on first review.

a/b test · 30-day window

The funnel,
control vs voice-engaged.

Same app, same audience cohort, randomised by anonymous device id. Control gets touch-search only; treatment gets the tap-to-talk button plus the voice copilot. Highlighted line is the checkout-completion step — the +11.4-point lift the case study turns on.

control · touch-search only
100.0% 78.4% 41.2% 12.6% 3.4%
  1. Session start n=42,084
  2. Browse or search
  3. Product detail view
  4. Add to cart
  5. Checkout completion
treatment · voice copilot
100.0% 81.1% 52.6% 19.4% 14.8%
  1. Session start n=42,318 · voice cohort isolated
  2. Browse or tap-to-talk
  3. Product detail view (voice-narrowed)
  4. Add to cart
  5. Checkout completion

+11.4 pp lift on checkout completion · voice-engaged cohort

A/B randomised by anonymous device id. Voice-engaged sessions = treatment-cohort sessions where the user fired tap-to-talk at least once. All-cohort lift (treatment ÷ control across every session, voice-engaged or not) was +1.9 pp on checkout completion · also statistically significant at the 95% level. Confidence interval on the headline +11.4 pp = ±1.6 pp.

8 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-5 closed alpha surfaced a cart-abandonment spike on iPhone SE viewports. The voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, and re-ran the alpha. The honest version of `8 weeks` includes the week we sat on our hands fixing a UX bug a Figma export wouldn't have caught.

  1. Weeks 1–2

    Discovery + UX postmortem

    We spent two weeks reading the postmortems of the prior on-device voice A/B tests that the team had already failed twice. We concluded: the failure mode was never the model. It was the trigger UX. We talked to 24 customers from the retailer's loyalty cohort about voice-in-shopping affordances; the two strongest signals were `tap, don't always listen` and `show me what you heard before you act`. Both shaped the GFVoiceCopilot API.

    API spec for the OSS Flutter widget · UX guardrails written down · A/B test design signed off
  2. Weeks 3–4

    Widget build + ephemeral-key mint

    We built the `GFVoiceCopilot` widget in our GetWidget OSS package: mic-permission UX, partial-transcript chip overlay, animated waveform, barge-in handling, and the two callbacks the host wires. We minted sub-second-TTL ephemeral keys server-side via Cloudflare Workers so no OpenAI secret ever shipped in the Flutter binary. Sentry breadcrumb wiring per voice turn for production debugging.

    GFVoiceCopilot v0.4 shipped to the OSS package · ephemeral-key mint in production
  3. Week 5

    Closed alpha · cart abandon caught

    Closed alpha to 4% of traffic in two US metros. Day 4, Mixpanel flagged a cart-abandonment spike on the category-browse flow, and only on iPhone SE viewports (the smallest screen in the cohort). Root cause: the voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, kept the affordance, and re-ran the alpha for a week with no abandon-rate regression.

    Trigger UX repositioned · iPhone SE viewport bug closed · lift recovered next iteration
    Walk-away point
  4. Weeks 6–7

    Ramp to 50% A/B

    Ramped to 50% A/B in the same two metros, then to all US iOS + Android traffic. Sentry crash-free sessions held at 99.71% on the treatment cohort vs 99.74% on control (within sample noise). First-token p95 measured per-device, per-network: held at sub-600 ms on cellular and sub-400 ms on Wi-Fi. Funnel comparison ran daily; team had a kill-switch wired to the Cloudflare KV namespace if the lift collapsed.

    Full A/B traffic at 50% · daily funnel comparison · kill-switch in production
  5. Week 8

    Production cutover + handoff

    Cutover to 100% traffic with the voice cohort intact as a measurement panel; we kept 5% of users on the control variant indefinitely so the team has an ongoing baseline for drift. Sentry SLI configured on first-token latency. Mixpanel funnel events tagged with the voice-engaged dimension so the merch team can read the lift per category. Documentation handed off to the retailer's in-house Flutter team, who maintain the surface from here.

    Production cutover · 5% indefinite control panel · documentation handed off
A/B results · 30-day test window

How we know
it works.

The A/B test design was signed off in week 1. Every metric below was a pre-registered comparison against the matched control cohort: no fishing for significance, no metric introduced after the rollout started. Numbers are from the current production cut and the 30-day A/B window.

metric
control
wk 5 (alpha)
wk 6 (50% A/B)
current (live)
target
Mobile-session conversion · voice cohort
3.4%
3.9%
4.1%
4.8%
≥ 4.4%
First-token p95 latency (ms)
680
640
580
≤ 600
Voice-engaged-session share
0%
9%
14%
17%
≥ 12%
Crash-free sessions · treatment cohort
99.74%
99.61%
99.68%
99.71%
≥ 99.6%
In-app search UX score (1–5)
2.8
3.5
3.9
4.2
≥ 3.8
Out-of-scope handoff rate
11.4%
9.8%
9.2%
8–12%

Sample size for the headline +11.4 pp checkout-completion lift is n=42,318 sessions in the voice-engaged treatment cohort over a 30-day A/B window; the lift confidence interval is ±1.6 pp at 95%. First-token p95 latency is measured per-turn on-device via Sentry breadcrumb. Crash-free sessions delta of −0.03 pp between treatment and control is within sample noise, well inside the −0.15 pp rollout-gate threshold. Out-of-scope handoff rate is the share of voice turns where the agent classifies the intent as outside the catalog scope and hands off to chat: by design between 8 and 12%; v1 was high (11.4%) because the alpha rollout had a smaller catalog scope. Note: the alpha-week crash-free figure (99.61%) is intentionally lower than control. That's the iPhone SE viewport bug surfacing in the metric, exactly the way the eval was designed to catch it.

oss · GetWidget package · gf_voice_copilot

Drop the widget into a Scaffold.
Wire two callbacks.

The integration shape clients reuse from the GetWidget OSS Flutter library. The button affordance, mic-permission UX, transcript overlay, and barge-in handling are baked in; the client wires two callbacks — partial transcript + suggestions — and configures the catalog scope.

lib/screens/shop_screen.dart dart
// Drops the voice copilot into any Flutter scaffold.
// Mic-permission, barge-in, and the visual waveform
// are owned by the widget; the host wires intent.

import 'package:flutter/material.dart';
import 'package:getwidget/getwidget.dart';

class ShopScreen extends StatefulWidget {
  const ShopScreen({super.key});
  @override
  State<ShopScreen> createState() => _ShopScreenState();
}

class _ShopScreenState extends State<ShopScreen> {
  String _transcript = '';
  List<ProductSuggestion> _suggestions = const [];

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('Shop')),
      body: ProductGrid(suggestions: _suggestions),
      floatingActionButton: GFVoiceCopilot(
        catalogId: 'us-apparel-prod',
        config: const VoiceCopilotConfig(
          firstTokenBudgetMs: 600,
          fallbackToHttpAfter: Duration(seconds: 2),
          bargeIn: true,
        ),
        onPartialTranscript: (text) =>
            setState(() => _transcript = text),
        onSuggestions: (recs) =>
            setState(() => _suggestions = recs),
        onHandoff: (intent) =>
            Navigator.of(context).pushNamed('/chat', arguments: intent),
      ),
    );
  }
}
rendered · iPhone 13 viewport
  • 01Partial transcript renders in a chip above the button as the user speaks; cleared on intent fire or barge-in.
  • 02onSuggestions fires once per function-call response from the model · stream-friendly, debounced.
  • 03onHandoff fires when the agent classifies the intent as out-of-scope (returns, support) · navigate to the chat surface.
  • 04Animated waveform proxies "listening" state · paused under prefers-reduced-motion.

gf_voice_copilot ships from the open-source GetWidget Flutter UI kit · 4.8k★ on github · iOS 16+ / Android 11+ · null-safety · OSS license = BSD-3

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

A Flutter voice copilot built on these patterns will hurt the app experience in any of the following situations. We turn down the engagement before a pilot is scoped.

Hot-word listening on the scope sheet

Always-on mic = mic-permission churn + battery-drain reviews + a trust gradient that's hard to recover. Tap-to-talk is the only voice-trigger UX we ship for retail apps. Clients who insist on always-on get a different vendor.

Team can't run a 30-day A/B

Mobile voice conversion claims without an A/B baseline are vibes. We've seen vendor decks where the lift number turned out to be a Wednesday-vs-Sunday comparison. Our pilot scope includes a matched control cohort and a 30-day window. Non-negotiable.

Mic-permission posture isn't taken seriously

Both stores reject voice surfaces without well-written mic-permission rationale strings, store-listing screenshots showing the tap-to-talk affordance, and a privacy policy that addresses audio handling. Treat store submission as a TODO → hard requirement before rollout.

Catalog isn't searchable in well-tuned facets

Voice in-app needs a strong existing search-and-facet substrate. The model function-calls into it. If the catalog is poorly tagged, voice surfaces a 9–14% out-of-scope handoff rate and the conversion lift won't materialise. The right pilot starts with catalog facets, not voice.

frequently asked · voice commerce · flutter voice copilot

What buyers ask first.
Real answers, no hedging.

What is voice commerce?
Voice commerce is a shopping interface where the buyer speaks intent ("find me a relaxed-fit oxford under $80") and the app's voice copilot function-calls the catalog, facet, and cart APIs to return live results. It is not Alexa-style remote ordering and it is not a generic chatbot with a microphone. Voice commerce sits inside the shopping surface and operates on the merchant's existing catalog index.
Why a Flutter voice copilot specifically?
Flutter ships a single codebase to iOS and Android, the GFVoiceCopilot widget is open-source under the same MIT license as the rest of the GetWidget UI kit, and WebRTC + the OpenAI Realtime API both have first-class Flutter packages. For a team already on Flutter, the voice copilot is one widget added, not a separate native module per platform.
Why tap-to-talk instead of always-on hot-word listening?
Always-on mic = mic-permission churn + battery-drain reviews + a trust gradient that's hard to recover. Tap-to-talk gives the user explicit consent on every utterance, passes both App Store and Play Store privacy review cleanly, and matches buyer expectation for a shopping app. We refuse to ship always-on for retail.
How accurate is voice product discovery in this build?
0.91 first-result-correct on a frozen 800-utterance eval set (user said it, the widget surfaced the right product or a one-tap-away facet). 0.96 catalog-attribute match across the 30-day window. Voice misinterpretation requiring a re-utterance happened on 6% of sessions.
What does a voice copilot cost to run?
About $0.034 per voice session on gpt-realtime-2 (median 47-second session, mostly time-to-first-byte). Across the 30-day A/B with ~1,400 voice sessions per day, that worked out to about $1,425/month in model spend. The +11.4pt conversion lift on the voice arm covered the run-cost roughly 38× over.
How long does it take to ship a Flutter voice copilot?
Eight weeks for this engagement: 1 week scope + UX audit, 1 week function-calling schema + Algolia integration, 2 weeks voice trigger UX iterations (we re-did the trigger after a failed week-5 A/B), 2 weeks widget development, 1 week 30-day A/B setup + instrumentation, 1 week launch + tuning.
Does this work on Shopify or only custom storefronts?
Either. The voice copilot calls into whatever catalog + facet + cart APIs you have: Shopify Storefront, Algolia, ElasticSearch, Commercetools, or a custom GraphQL layer. This case study uses an existing Algolia facet index that the client had tuned over four years; the integration adds a function-calling schema on top of it, not a replacement.
When should we NOT ship a voice copilot?
Three cases: catalog is under 500 SKUs (the chatbot adds friction over a well-designed facet UI); average order value is under $25 (the cost of voice routing eats the margin); the team can't run a 30-day A/B before rollout (mobile conversion claims without an A/B baseline are vibes). We turn down engagements that fail any of these.
keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build, or that a similar build on your stack would draw from.

Ready to ship

Want a case study like this
for your Flutter app?

Book a fixed-fee discovery audit. We'll review the app's current funnel, scope the voice-engaged cohort comparison, recommend a tap-to-talk UX + audio-transport recipe, project per-turn cost, and tell you honestly whether voice is the right primitive, or whether the catalog facets need work first. About one audit in four ends with `fix the catalog tags, voice comes later.`

30 min, async or live A/B-first scoping Walk-away point in the pilot
Updated May 20, 2026 · By Navin Sharma