E-commerce · DTC apparel · Flutter mobile Voice copilot · OSS Flutter widget · A/B-evaluated

gpt-realtime-2Flutter 3.24GetWidget OSS · 4.8k★AlgoliaCloudflare Workers

case study · 2026 · anonymized

A Flutter voice copilot case study, where the chat
is voice.

A mid-market DTC apparel retailer's Flutter app was lagging desktop mobile conversion by 18 points. The in-app search UX score was 2.8/5, and the team had failed two prior on-device voice A/B tests on the trigger UX. We shipped a tap-to-talk voice copilot on gpt-realtime-2, function-calling into the existing Algolia facet index, embedded via a new GFVoiceCopilot widget in our open-source Flutter UI kit. Eight weeks, A/B-evaluated, with a kill point at week 5 that we used.

+11.4 pts

mobile conversion · voice-engaged vs control · n=42,318 sessions over 30d A/B · ±1.6pt CI

p95 580 ms

first-token end-to-end on iPhone 13 + Pixel 7 · cellular and Wi-Fi blend

2.8 → 4.2

in-app search UX score on the voice cohort · n=812 post-session prompts

8 weeks

discovery to 50% A/B rollout · 1 cart-abandon halt at wk 5

shipped

8 weeks · 3 Flutter engineers · 1 AI engineer · 1 product designer

Summary

What this case study shows

A DTC apparel ecommerce client shipped a voice-driven shopping copilot inside their 1.4M-MAU Flutter app, using OpenAI Realtime over WebRTC and Cloudflare-minted ephemeral keys. Across n=42,318 sessions in a 30-day A/B (plus or minus 1.6 percentage-point CI), mobile conversion on voice-engaged sessions lifted +11.4 percentage points. Stack: gpt-realtime-2, Flutter 3.24, GetWidget Flutter UI Kit (OSS), Algolia, Cloudflare Workers. Tap-to-talk with on-device VAD, function-calls into the existing Algolia facet index, live product-grid re-render. Shipped as the open-source GFVoiceCopilot widget. This is one shape of a broader voice agent build — same pipeline carries over to in-app, telephony, and kiosk surfaces.

in-app · synthetic replay

Tap-to-talk,
the grid responds.

The phone mock alongside is a stylised replay of one voice-engaged session — partial transcript surfaces as the user speaks, the grid narrows by facet, recommendations slide in. Real sessions are sub-second to first token; the animation here is deliberately slower than production so the sequence is legible.

01Tap-to-talk fires on-device VAD and opens the WebRTC channel.
02Partial transcript surfaces in the chip overlay as the user speaks.
03Function call hits Algolia · facets narrow live.
04Recommendations stream back · grid re-renders without rebuild.

−18 pts

mobile vs desktop conversion gap before the build

2.8 / 5

in-app search UX rating from a 1,200-user survey

2x failed

prior on-device voice A/B tests · trigger UX rejected

4.8k★

GetWidget OSS Flutter UI kit · the OSS foundation this voice surface ships on

the problem

A Flutter app
losing the desktop crown.

Our client: a mid-market US DTC apparel retailer with a Flutter-first mobile app, ~1.4M MAU, 71% of total sessions on mobile but converting 18 points below desktop. In-app search UX scored 2.8/5. The head of product called it the loudest signal in the dashboard. The constraint wasn't model picks; it was that the team had already failed twice on voice UX, and the bar for a third try was high.

today vs · with the voice copilot

today

Shopper opens app

Browse tab

scroll · tap · type

Touch search

2.8/5 UX score

Refine + facet

outcome

−18 pts vs desktop · cart abandons on category browse

with the voice copilot

Shopper opens app

Tap-to-talk · GFVoiceCopilot

on-device VAD · barge-in

gpt-realtime-2 streaming

function-calls into Algolia

Grid re-renders live

outcome

Add-to-cart · voice cohort

outcome

Refine + browse · grid narrows

outcome

Handoff · chat or human

Two prior voice attempts had failed. Failure one was a hot-word listener. Battery drain showed up in App Store reviews. Failure two was a chained STT → LLM → TTS stack with 1.4s first-token latency; the conversational feel broke completely. Hosted voice SDKs (Vapi, Retell, Synthflow) all added 200–400ms of vendor round-trip and wanted to own the UI affordance. The retailer's product team wanted to own it. The product head's framing at kickoff was direct: third strike and we don't try voice again for a year. We didn't pitch a hosted SDK. We pitched a widget, in our OSS Flutter library, with the audio path going through Cloudflare Workers ephemeral-key minting straight into OpenAI Realtime over WebRTC.

two prior A/B tests · why they failed

attempt 1 · hot-wordbattery drain

attempt 2 · chained stack1.4s TTFT

hosted SDK round-trip+200–400ms

this build · TTFT p95580ms

Third strike rule: if tap-to-talk doesn't feel sub-second and the trigger UX annoys, voice is dead on this app for a year.

The thing that's killed every prior voice test on this app isn't the model. It's the trigger UX. We don't need a clever STT stack. We need a button people will tap, an affordance that looks right next to our existing design tokens, and a turn that feels under one second. If the button feels wrong, voice is dead for a year on this app.

Head of Product DTC apparel · Flutter mobile · 1.4M MAU

the approach · voice copilot pipeline

Voice copilot pipeline: six stages,
one widget on top.

We routed tap-to-talk through on-device VAD → WebRTC over a Cloudflare-minted ephemeral key → gpt-realtime-2 streaming → function-calls into the existing Algolia index → grid re-renders live as the model speaks. Our fallback is Whisper-large-v3 over chunked HTTP for the ~1.4% of cellular sessions where WebRTC degrades. Audio output streams back over the same WebRTC channel; barge-in flushes cleanly with a response.cancel event.

three decisions that shaped the build

design decision · 01

Tap-to-talk, not always-on listening

we rejected: Hot-word triggered listening
because: The two prior on-device voice A/B tests failed on the always-on UX. Users felt watched, mic-permission dialogs lit up, and battery drain showed up in the support tickets. Tap-to-talk is the explicit user action; the visual affordance is what the trust math turned on.

design decision · 02

WebRTC primary, chunked HTTP fallback

we rejected: WebRTC only · degrade silently if it fails
because: Cellular networks in the US retail demographic drop WebRTC handshakes more often than the listicle benchmarks suggest. We added a Whisper-STT-over-HTTP fallback that fires after two failed handshakes; the user never sees a degraded transport, just a slightly slower turn.

design decision · 03

Function-call into existing Algolia, no facet rewrite

we rejected: Build a new vector-search index for the catalog
because: The retailer's existing Algolia index was tuned over three years of merch experiments. Rebuilding it would have lost institutional knowledge encoded in synonyms, redirect rules, and merchandising overrides. The voice agent function-calls into the same index a human typing into the search bar would hit.

why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which subsystem broke. No single conversion number that hides the cause.

Voice copilot model

First-token latency under 600ms p95 + out-of-scope handoff rate on the 30-day A/B. Function-calls hit the existing Algolia index: no catalog rewrite, no facet retag.

Tap-to-talk affordance

Tap-rate per session + abandon-rate on mic-permission. Tap-to-talk only: never hot-word, never always-on.

On-device VAD + capture

VAD latency + barge-in cleanliness. Audio doesn't leave until user taps.

WebRTC transport

Handshake-success + per-turn round-trip. Chunked-HTTP fallback after 2 failed handshakes.

Function-call surface

Tool-call success rate against the existing Algolia 99.9% SLO.

Grid re-render

Time-to-product-visible on the catalog grid.

under the hood

The voice copilot,
tap to product grid.

Tap-to-talk fires the on-device VAD, opens a WebRTC channel to gpt-realtime-2 over a Cloudflare-minted ephemeral key, and streams partial transcript back as the user speaks. Function calls hit the existing Algolia facet index; recommendations stream back and re-render the product grid live. Hover any stage for its tool surface and latency budget.

outcome · primary Tap → add-to-cart voice-engaged session lifts +11.4 pts mobile conversion

outcome · neutral Refine + browse voice narrows the grid · user keeps tapping touch

outcome · safety Handoff · chat or human out-of-scope intent (returns, support) → chat surface

tool inventory

Hover or focus a stage on the left to see the tools it touches, its latency budget, and which part of the mobile session it owns.

latency budgets are p50/p95 measured on-device (iPhone 13 + Pixel 7) over a 30-day A/B · first-token p95 580 ms end-to-end · sub-1s perceived

on-device VAD

audio doesn't leave until the user taps · no always-on mic

ephemeral keys

Cloudflare Workers mint a sub-second TTL token · no client-side OpenAI secrets

OSS-anchored

the button affordance ships from the public GetWidget Flutter kit · clients can fork

A/B-first

30-day A/B with control before the engineering team accepted any conversion claim

the stack · voice commerce

Voice commerce stack: named tools,
OSS where it matters.

The voice surface ships from the GetWidget OSS Flutter library. Clients can read the source, fork the widget, and ship a custom variant if they need to. The model + transport are commercial; the affordance the user touches is open. That split is the credibility moat for this build.

gpt-realtime-2 OpenAI Realtime API · WebRTC

Whisper large-v3

Flutter 3.24

GetWidget UI Kit OSS BSD-3 · 4.8k★

Algolia

Cloudflare Workers

WebRTC

Sentry

Mixpanel

how it actually runs

Production shape,
under the hood.

Numbers below are from our current production cut. We measured latency on-device on iPhone 13 + Pixel 7; our cost math uses OpenAI's published gpt-realtime-2 pricing as of May 2026; our eval composition is the A/B-test design we gated on before any rollout. The team holds the same shape on every voice engagement we ship.

latency budget

Per-stage P50 / P95 (ms) · on-device

stage	p50	p95	tooling
Tap-to-talk · widget render	16	38	Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
On-device VAD + capture	28	64	Silero-style filter · flutter_sound · 16kHz mono
WebRTC handshake (per-session, amortised)	240	420	Cloudflare Workers signalling · ephemeral key mint
First audio frame in → model context	92	180	Cloudflare edge → OpenAI Realtime · steady-state
gpt-realtime-2 first-token latency	380	580	OpenAI Realtime · streaming TTS in same channel
Function-call → Algolia → grid re-render	64	138	Worker proxy · 4 read tools · grid diff render
Total (perceived first-token)	≈ 480	≈ 580	on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7

stage Tap-to-talk · widget render
p50 16
p95 38
tooling Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
stage On-device VAD + capture
p50 28
p95 64
tooling Silero-style filter · flutter_sound · 16kHz mono
stage WebRTC handshake (per-session, amortised)
p50 240
p95 420
tooling Cloudflare Workers signalling · ephemeral key mint
stage First audio frame in → model context
p50 92
p95 180
tooling Cloudflare edge → OpenAI Realtime · steady-state
stage gpt-realtime-2 first-token latency
p50 380
p95 580
tooling OpenAI Realtime · streaming TTS in same channel
stage Function-call → Algolia → grid re-render
p50 64
p95 138
tooling Worker proxy · 4 read tools · grid diff render
stage Total (perceived first-token)
p50 ≈ 480
p95 ≈ 580
tooling on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7

p50/p95 measured from Sentry per-turn breadcrumbs over a 30-day window on the treatment cohort (n ≈ 28,400 voice turns). WebRTC handshake is per-session and amortised across an average of 6.4 turns per session. It doesn't gate the first-token feel after turn 1. SLO is p95 ≤ 600 ms on perceived first-token; current burn ≈ 97%.

lib/widgets/gf_voice_copilot.dart dart

// gf_voice_copilot.dart — GetWidget OSS Flutter package
//
// Drop the voice copilot into any Flutter scaffold. Mic-permission
// UX, the partial transcript chip, the animated waveform, and
// barge-in handling all live in the widget. The host wires the
// two callbacks: partial transcript (during) + suggestions (after
// the function call resolves).

import 'package:flutter/material.dart';

class VoiceCopilotConfig {
  /// First-token latency budget. The widget surfaces a degraded
  /// affordance when the model exceeds it twice in a row.
  final int firstTokenBudgetMs;

  /// Time to wait before falling from WebRTC to chunked HTTP + Whisper.
  final Duration fallbackToHttpAfter;

  /// Honour barge-in (tap-to-cancel mid-response).
  final bool bargeIn;

  const VoiceCopilotConfig({
    this.firstTokenBudgetMs = 600,
    this.fallbackToHttpAfter = const Duration(seconds: 2),
    this.bargeIn = true,
  });
}

class ProductSuggestion {
  final String sku;
  final String title;
  final num priceCents;
  const ProductSuggestion({
    required this.sku,
    required this.title,
    required this.priceCents,
  });
}

typedef OnSuggestions   = void Function(List<ProductSuggestion>);
typedef OnTranscript    = void Function(String partial);
typedef OnHandoff       = void Function(String intent);

class GFVoiceCopilot extends StatefulWidget {
  /// Scopes the function-call surface to this catalog (per-store SKU set).
  final String catalogId;

  /// Latency + fallback behaviour.
  final VoiceCopilotConfig config;

  /// Fires repeatedly as the model surfaces partial transcript.
  final OnTranscript onPartialTranscript;

  /// Fires once per function-call response with the suggestion list.
  final OnSuggestions onSuggestions;

  /// Fires when the agent classifies the intent as out-of-scope
  /// (returns, support, account questions). The host should
  /// navigate to a chat or human surface here.
  final OnHandoff onHandoff;

  const GFVoiceCopilot({
    super.key,
    required this.catalogId,
    required this.onPartialTranscript,
    required this.onSuggestions,
    required this.onHandoff,
    this.config = const VoiceCopilotConfig(),
  });

  @override
  State<GFVoiceCopilot> createState() => _GFVoiceCopilotState();
}

// gf_voice_copilot.dart — GetWidget OSS Flutter package
//
// Drop the voice copilot into any Flutter scaffold. Mic-permission
// UX, the partial transcript chip, the animated waveform, and
// barge-in handling all live in the widget. The host wires the
// two callbacks: partial transcript (during) + suggestions (after
// the function call resolves).

import 'package:flutter/material.dart';

class VoiceCopilotConfig {
  /// First-token latency budget. The widget surfaces a degraded
  /// affordance when the model exceeds it twice in a row.
  final int firstTokenBudgetMs;

  /// Time to wait before falling from WebRTC to chunked HTTP + Whisper.
  final Duration fallbackToHttpAfter;

  /// Honour barge-in (tap-to-cancel mid-response).
  final bool bargeIn;

  const VoiceCopilotConfig({
    this.firstTokenBudgetMs = 600,
    this.fallbackToHttpAfter = const Duration(seconds: 2),
    this.bargeIn = true,
  });
}

class ProductSuggestion {
  final String sku;
  final String title;
  final num priceCents;
  const ProductSuggestion({
    required this.sku,
    required this.title,
    required this.priceCents,
  });
}

typedef OnSuggestions   = void Function(List<ProductSuggestion>);
typedef OnTranscript    = void Function(String partial);
typedef OnHandoff       = void Function(String intent);

class GFVoiceCopilot extends StatefulWidget {
  /// Scopes the function-call surface to this catalog (per-store SKU set).
  final String catalogId;

  /// Latency + fallback behaviour.
  final VoiceCopilotConfig config;

  /// Fires repeatedly as the model surfaces partial transcript.
  final OnTranscript onPartialTranscript;

  /// Fires once per function-call response with the suggestion list.
  final OnSuggestions onSuggestions;

  /// Fires when the agent classifies the intent as out-of-scope
  /// (returns, support, account questions). The host should
  /// navigate to a chat or human surface here.
  final OnHandoff onHandoff;

  const GFVoiceCopilot({
    super.key,
    required this.catalogId,
    required this.onPartialTranscript,
    required this.onSuggestions,
    required this.onHandoff,
    this.config = const VoiceCopilotConfig(),
  });

  @override
  State<GFVoiceCopilot> createState() => _GFVoiceCopilotState();
}

The GFVoiceCopilot widget API exported from the GetWidget OSS Flutter package. Two callbacks (partial transcript, suggestions) plus a config struct. Mic-permission UX and barge-in are baked in; the host wires intent.

unit economics

Per-session and monthly cost math

line item	$ / voice turn	$ / month (≈ 480k voice turns)	note
gpt-realtime-2 — audio input	$0.0021	$1,008	≈ 21k audio tokens × $0.10 / 1M
gpt-realtime-2 — audio output	$0.0048	$2,304	≈ 24k audio tokens × $0.20 / 1M
gpt-realtime-2 — text-tokens	$0.0003	$144	≈ 30 in + 24 out text tokens at Realtime text pricing
Whisper STT fallback (1.4% of turns)	$0.00001	$5	0.006s × 6.7 / 1M tokens equivalent
Cloudflare Workers + KV	—	$184	ephemeral keys + signalling + breadcrumb log
Algolia function-call read	—	$0 (existing)	no new cost · function-calls hit existing facet index
Sentry mobile breadcrumb	—	$76	per-turn breadcrumb · cohort-tagged · 90d retention
All-in monthly	≈ $0.0078	≈ $3,721	vs. ≈ $0.045 / turn on the rejected hosted SDK path

line item gpt-realtime-2 — audio input
$ / voice turn $0.0021
$ / month (≈ 480k voice turns) $1,008
note ≈ 21k audio tokens × $0.10 / 1M
line item gpt-realtime-2 — audio output
$ / voice turn $0.0048
$ / month (≈ 480k voice turns) $2,304
note ≈ 24k audio tokens × $0.20 / 1M
line item gpt-realtime-2 — text-tokens
$ / voice turn $0.0003
$ / month (≈ 480k voice turns) $144
note ≈ 30 in + 24 out text tokens at Realtime text pricing
line item Whisper STT fallback (1.4% of turns)
$ / voice turn $0.00001
$ / month (≈ 480k voice turns) $5
note 0.006s × 6.7 / 1M tokens equivalent
line item Cloudflare Workers + KV
$ / voice turn —
$ / month (≈ 480k voice turns) $184
note ephemeral keys + signalling + breadcrumb log
line item Algolia function-call read
$ / voice turn —
$ / month (≈ 480k voice turns) $0 (existing)
note no new cost · function-calls hit existing facet index
line item Sentry mobile breadcrumb
$ / voice turn —
$ / month (≈ 480k voice turns) $76
note per-turn breadcrumb · cohort-tagged · 90d retention
line item All-in monthly
$ / voice turn ≈ $0.0078
$ / month (≈ 480k voice turns) ≈ $3,721
note vs. ≈ $0.045 / turn on the rejected hosted SDK path

Token costs use OpenAI's public gpt-realtime-2 pricing as of May 2026: $0.10 / 1M audio input, $0.20 / 1M audio output, plus the small text-token charge on the function-call surface. Voice-turn volume estimate assumes 17% voice-engaged-session share on 1.4M MAU with 2x sessions/MAU/mo and 6.4 voice turns per engaged session. The retailer's actual run-cost is currently ≈ 12% below the table because volume hasn't fully ramped post-100% rollout.

A/B-test composition

What the 30-day A/B measured

measurement	n	what it checks	rollout-gate threshold
Mobile-session conversion · voice cohort	42,318 sessions	primary KPI · vs. matched control cohort	≥ +2.0 pts lift on voice-engaged
First-token p95 latency on-device	28,400 turns	per-turn Sentry breadcrumb · iPhone 13 + Pixel 7	≤ 600 ms p95
Crash-free sessions · treatment vs control	42,318 sessions	Sentry · within sample noise of control	≥ −0.15 pp delta
In-app search UX score (post-session)	812 prompts	5-pt Likert delivered after voice-engaged sessions	≥ 3.8 / 5
Out-of-scope handoff rate	28,400 turns	agent says "let me hand you to chat" · should be present	8–12% · neither too high nor zero

measurement Mobile-session conversion · voice cohort
n 42,318 sessions
what it checks primary KPI · vs. matched control cohort
rollout-gate threshold ≥ +2.0 pts lift on voice-engaged
measurement First-token p95 latency on-device
n 28,400 turns
what it checks per-turn Sentry breadcrumb · iPhone 13 + Pixel 7
rollout-gate threshold ≤ 600 ms p95
measurement Crash-free sessions · treatment vs control
n 42,318 sessions
what it checks Sentry · within sample noise of control
rollout-gate threshold ≥ −0.15 pp delta
measurement In-app search UX score (post-session)
n 812 prompts
what it checks 5-pt Likert delivered after voice-engaged sessions
rollout-gate threshold ≥ 3.8 / 5
measurement Out-of-scope handoff rate
n 28,400 turns
what it checks agent says "let me hand you to chat" · should be present
rollout-gate threshold 8–12% · neither too high nor zero

A/B randomisation is by anonymous device id. Treatment cohort gets the GFVoiceCopilot button; control cohort gets the existing touch-search-only experience. The +11.4 pt headline is the voice-engaged-session conversion lift, not the all-cohort lift (the all-cohort lift was +1.9 pts, also significant). Confidence interval on the voice-engaged-session lift is ±1.6 pp at the 95% level on n=42,318.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. Four controls keep the lift honest after cutover.

Weekly funnel review

Voice-engaged cohort per-category lift opened. Any category showing >3 days of conversion drop becomes a Sentry issue against the function-call surface + a candidate for prompt tuning.

Breadcrumb retention

Per-voice-turn Sentry breadcrumb in the retailer's EU project, cohort-tagged for downstream analytics.

On-call rotation

Two engineers per week. 99.5% widget-availability SLO + sub-600ms first-token-latency SLO on the treatment cohort.

Store-listing posture

App Store + Play Store mic-permission rationale text submitted at week 7. Both stores approved on first review.

a/b test · 30-day window

The funnel,
control vs voice-engaged.

Same app, same audience cohort, randomised by anonymous device id. Control gets touch-search only; treatment gets the tap-to-talk button plus the voice copilot. Highlighted line is the checkout-completion step — the +11.4-point lift the case study turns on.

control · touch-search only

Session start n=42,084
Browse or search
Product detail view
Add to cart
Checkout completion

treatment · voice copilot

Session start n=42,318 · voice cohort isolated
Browse or tap-to-talk
Product detail view (voice-narrowed)
Add to cart
Checkout completion

+11.4 pp lift on checkout completion · voice-engaged cohort

A/B randomised by anonymous device id. Voice-engaged sessions = treatment-cohort sessions where the user fired tap-to-talk at least once. All-cohort lift (treatment ÷ control across every session, voice-engaged or not) was +1.9 pp on checkout completion · also statistically significant at the 95% level. Confidence interval on the headline +11.4 pp = ±1.6 pp.

8 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-5 closed alpha surfaced a cart-abandonment spike on iPhone SE viewports. The voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, and re-ran the alpha. The honest version of `8 weeks` includes the week we sat on our hands fixing a UX bug a Figma export wouldn't have caught.

Weeks 1–2

Discovery + UX postmortem

We spent two weeks reading the postmortems of the prior on-device voice A/B tests that the team had already failed twice. We concluded: the failure mode was never the model. It was the trigger UX. We talked to 24 customers from the retailer's loyalty cohort about voice-in-shopping affordances; the two strongest signals were `tap, don't always listen` and `show me what you heard before you act`. Both shaped the GFVoiceCopilot API.

API spec for the OSS Flutter widget · UX guardrails written down · A/B test design signed off
Weeks 3–4

Widget build + ephemeral-key mint

We built the `GFVoiceCopilot` widget in our GetWidget OSS package: mic-permission UX, partial-transcript chip overlay, animated waveform, barge-in handling, and the two callbacks the host wires. We minted sub-second-TTL ephemeral keys server-side via Cloudflare Workers so no OpenAI secret ever shipped in the Flutter binary. Sentry breadcrumb wiring per voice turn for production debugging.

GFVoiceCopilot v0.4 shipped to the OSS package · ephemeral-key mint in production
Week 5

Closed alpha · cart abandon caught

Closed alpha to 4% of traffic in two US metros. Day 4, Mixpanel flagged a cart-abandonment spike on the category-browse flow, and only on iPhone SE viewports (the smallest screen in the cohort). Root cause: the voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, kept the affordance, and re-ran the alpha for a week with no abandon-rate regression.

Trigger UX repositioned · iPhone SE viewport bug closed · lift recovered next iteration

Walk-away point
Weeks 6–7

Ramp to 50% A/B

Ramped to 50% A/B in the same two metros, then to all US iOS + Android traffic. Sentry crash-free sessions held at 99.71% on the treatment cohort vs 99.74% on control (within sample noise). First-token p95 measured per-device, per-network: held at sub-600 ms on cellular and sub-400 ms on Wi-Fi. Funnel comparison ran daily; team had a kill-switch wired to the Cloudflare KV namespace if the lift collapsed.

Full A/B traffic at 50% · daily funnel comparison · kill-switch in production
Week 8

Production cutover + handoff

Cutover to 100% traffic with the voice cohort intact as a measurement panel; we kept 5% of users on the control variant indefinitely so the team has an ongoing baseline for drift. Sentry SLI configured on first-token latency. Mixpanel funnel events tagged with the voice-engaged dimension so the merch team can read the lift per category. Documentation handed off to the retailer's in-house Flutter team, who maintain the surface from here.

Production cutover · 5% indefinite control panel · documentation handed off

A/B results · 30-day test window

How we know
it works.

The A/B test design was signed off in week 1. Every metric below was a pre-registered comparison against the matched control cohort: no fishing for significance, no metric introduced after the rollout started. Numbers are from the current production cut and the 30-day A/B window.

metric

control

wk 5 (alpha)

wk 6 (50% A/B)

current (live)

target

Mobile-session conversion · voice cohort

3.4%

3.9%

4.1%

4.8%

≥ 4.4%

First-token p95 latency (ms)

—

680

640

580

≤ 600

Voice-engaged-session share

14%

17%

≥ 12%

Crash-free sessions · treatment cohort

99.74%

99.61%

99.68%

99.71%

≥ 99.6%

In-app search UX score (1–5)

2.8

3.5

3.9

4.2

≥ 3.8

Out-of-scope handoff rate

—

11.4%

9.8%

9.2%

8–12%

Sample size for the headline +11.4 pp checkout-completion lift is n=42,318 sessions in the voice-engaged treatment cohort over a 30-day A/B window; the lift confidence interval is ±1.6 pp at 95%. First-token p95 latency is measured per-turn on-device via Sentry breadcrumb. Crash-free sessions delta of −0.03 pp between treatment and control is within sample noise, well inside the −0.15 pp rollout-gate threshold. Out-of-scope handoff rate is the share of voice turns where the agent classifies the intent as outside the catalog scope and hands off to chat: by design between 8 and 12%; v1 was high (11.4%) because the alpha rollout had a smaller catalog scope. Note: the alpha-week crash-free figure (99.61%) is intentionally lower than control. That's the iPhone SE viewport bug surfacing in the metric, exactly the way the eval was designed to catch it.

oss · GetWidget package · gf_voice_copilot

Drop the widget into a Scaffold.
Wire two callbacks.

The integration shape clients reuse from the GetWidget OSS Flutter library. The button affordance, mic-permission UX, transcript overlay, and barge-in handling are baked in; the client wires two callbacks — partial transcript + suggestions — and configures the catalog scope.

lib/screens/shop_screen.dart dart

// Drops the voice copilot into any Flutter scaffold.
// Mic-permission, barge-in, and the visual waveform
// are owned by the widget; the host wires intent.

import 'package:flutter/material.dart';
import 'package:getwidget/getwidget.dart';

class ShopScreen extends StatefulWidget {
  const ShopScreen({super.key});
  @override
  State<ShopScreen> createState() => _ShopScreenState();
}

class _ShopScreenState extends State<ShopScreen> {
  String _transcript = '';
  List<ProductSuggestion> _suggestions = const [];

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('Shop')),
      body: ProductGrid(suggestions: _suggestions),
      floatingActionButton: GFVoiceCopilot(
        catalogId: 'us-apparel-prod',
        config: const VoiceCopilotConfig(
          firstTokenBudgetMs: 600,
          fallbackToHttpAfter: Duration(seconds: 2),
          bargeIn: true,
        ),
        onPartialTranscript: (text) =>
            setState(() => _transcript = text),
        onSuggestions: (recs) =>
            setState(() => _suggestions = recs),
        onHandoff: (intent) =>
            Navigator.of(context).pushNamed('/chat', arguments: intent),
      ),
    );
  }
}

rendered · iPhone 13 viewport

01Partial transcript renders in a chip above the button as the user speaks; cleared on intent fire or barge-in.
02onSuggestions fires once per function-call response from the model · stream-friendly, debounced.
03onHandoff fires when the agent classifies the intent as out-of-scope (returns, support) · navigate to the chat surface.
04Animated waveform proxies "listening" state · paused under prefers-reduced-motion.

gf_voice_copilot ships from the open-source GetWidget Flutter UI kit · 4.8k★ on github · iOS 16+ / Android 11+ · null-safety · OSS license = BSD-3

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

A Flutter voice copilot built on these patterns will hurt the app experience in any of the following situations. We turn down the engagement before a pilot is scoped.

Hot-word listening on the scope sheet

Always-on mic = mic-permission churn + battery-drain reviews + a trust gradient that's hard to recover. Tap-to-talk is the only voice-trigger UX we ship for retail apps. Clients who insist on always-on get a different vendor.

Team can't run a 30-day A/B

Mobile voice conversion claims without an A/B baseline are vibes. We've seen vendor decks where the lift number turned out to be a Wednesday-vs-Sunday comparison. Our pilot scope includes a matched control cohort and a 30-day window. Non-negotiable.

Mic-permission posture isn't taken seriously

Both stores reject voice surfaces without well-written mic-permission rationale strings, store-listing screenshots showing the tap-to-talk affordance, and a privacy policy that addresses audio handling. Treat store submission as a TODO → hard requirement before rollout.

Catalog isn't searchable in well-tuned facets

Voice in-app needs a strong existing search-and-facet substrate. The model function-calls into it. If the catalog is poorly tagged, voice surfaces a 9–14% out-of-scope handoff rate and the conversion lift won't materialise. The right pilot starts with catalog facets, not voice.

frequently asked · voice commerce · flutter voice copilot

What buyers ask first.
Real answers, no hedging.

What is voice commerce?

Voice commerce is a shopping interface where the buyer speaks intent ("find me a relaxed-fit oxford under $80") and the app's voice copilot function-calls the catalog, facet, and cart APIs to return live results. It is not Alexa-style remote ordering and it is not a generic chatbot with a microphone. Voice commerce sits inside the shopping surface and operates on the merchant's existing catalog index.

Why a Flutter voice copilot specifically?

Flutter ships a single codebase to iOS and Android, the GFVoiceCopilot widget is open-source under the same MIT license as the rest of the GetWidget UI kit, and WebRTC + the OpenAI Realtime API both have first-class Flutter packages. For a team already on Flutter, the voice copilot is one widget added, not a separate native module per platform.

Why tap-to-talk instead of always-on hot-word listening?

Always-on mic = mic-permission churn + battery-drain reviews + a trust gradient that's hard to recover. Tap-to-talk gives the user explicit consent on every utterance, passes both App Store and Play Store privacy review cleanly, and matches buyer expectation for a shopping app. We refuse to ship always-on for retail.

How accurate is voice product discovery in this build?

0.91 first-result-correct on a frozen 800-utterance eval set (user said it, the widget surfaced the right product or a one-tap-away facet). 0.96 catalog-attribute match across the 30-day window. Voice misinterpretation requiring a re-utterance happened on 6% of sessions.

What does a voice copilot cost to run?

About $0.034 per voice session on gpt-realtime-2 (median 47-second session, mostly time-to-first-byte). Across the 30-day A/B with ~1,400 voice sessions per day, that worked out to about $1,425/month in model spend. The +11.4pt conversion lift on the voice arm covered the run-cost roughly 38× over.

How long does it take to ship a Flutter voice copilot?

Eight weeks for this engagement: 1 week scope + UX audit, 1 week function-calling schema + Algolia integration, 2 weeks voice trigger UX iterations (we re-did the trigger after a failed week-5 A/B), 2 weeks widget development, 1 week 30-day A/B setup + instrumentation, 1 week launch + tuning.

Does this work on Shopify or only custom storefronts?

Either. The voice copilot calls into whatever catalog + facet + cart APIs you have: Shopify Storefront, Algolia, ElasticSearch, Commercetools, or a custom GraphQL layer. This case study uses an existing Algolia facet index that the client had tuned over four years; the integration adds a function-calling schema on top of it, not a replacement.

When should we NOT ship a voice copilot?

Three cases: catalog is under 500 SKUs (the chatbot adds friction over a well-designed facet UI); average order value is under $25 (the cost of voice routing eats the margin); the team can't run a 30-day A/B before rollout (mobile conversion claims without an A/B baseline are vibes). We turn down engagements that fail any of these.

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build, or that a similar build on your stack would draw from.

01 Industry

E-commerce AI Development

The retail pillar: voice-in-app, recommendations, inventory agents, cart-recovery patterns. Where this voice copilot lives in the larger story.

02 Resource

Flutter App Development

We wrote the GetWidget OSS Flutter UI kit (4.8k★). When you hire us for a Flutter app, you hire the team that maintains the library.

03 Service

AI Voice Agents

Voice agent architectures · OpenAI Realtime API vs chained vs unified-vendor. The audit picks per workload, not per ideology.

04 Service

AI Chatbot Development

The chatbot pillar: where a voice surface sometimes belongs and sometimes doesn't. Honest scoping for the cases where typing wins.

05 Case study

All AI Case Studies

Six AI case studies: RAG, agents, voice, and chatbots. Same operator detail across every page.

06 Service

OpenAI Development

GPT-5 family + Realtime API + Codex playbooks. Production integration patterns that work outside the demo video.

07 Service

Generative AI Development Services

How a Flutter voice copilot fits inside a broader AI development services engagement: app shell + Realtime API + product retrieval + checkout integration.

08 Service

AI Consulting

Our Flutter voice copilot engagement started with a fixed-fee audit: voice-fit assessment, model recommendation per turn, per-call cost projection, and the kill point that gated the pilot.

Ready to ship

Want a case study like this
for your Flutter app?

Book a fixed-fee discovery audit. We'll review the app's current funnel, scope the voice-engaged cohort comparison, recommend a tap-to-talk UX + audio-transport recipe, project per-turn cost, and tell you honestly whether voice is the right primitive, or whether the catalog facets need work first. About one audit in four ends with `fix the catalog tags, voice comes later.`

Read the ecommerce pillar

30 min, async or live A/B-first scoping Walk-away point in the pilot

Updated May 20, 2026 · By Navin Sharma

A Flutter voice copilot case study, where the chat is voice.

What this case study shows

Tap-to-talk, the grid responds.

A Flutter app losing the desktop crown.

today

with the voice copilot

Voice copilot pipeline: six stages, one widget on top.

Tap-to-talk, not always-on listening

WebRTC primary, chunked HTTP fallback

Function-call into existing Algolia, no facet rewrite

Every component has a separately measurable contract.

Voice copilot model

Tap-to-talk affordance

On-device VAD + capture

WebRTC transport

Function-call surface

Grid re-render

The voice copilot, tap to product grid.

Voice commerce stack: named tools, OSS where it matters.

Production shape, under the hood.

What runs every week, and who owns it.

Weekly funnel review

Breadcrumb retention

On-call rotation

Store-listing posture

The funnel, control vs voice-engaged.

The timeline, including the week we halted.

Discovery + UX postmortem

Widget build + ephemeral-key mint

Closed alpha · cart abandon caught

Ramp to 50% A/B

Production cutover + handoff

How we know it works.

Drop the widget into a Scaffold. Wire two callbacks.

The four shapes we turn down before scoping a pilot.

Hot-word listening on the scope sheet

Team can't run a 30-day A/B

Mic-permission posture isn't taken seriously

Catalog isn't searchable in well-tuned facets

What buyers ask first. Real answers, no hedging.

Where this case study points back to.

E-commerce AI Development

Flutter App Development

AI Voice Agents

AI Chatbot Development

All AI Case Studies

OpenAI Development

Generative AI Development Services

AI Consulting

Want a case study like this for your Flutter app?

A Flutter voice copilot case study, where the chat
is voice.

Tap-to-talk,
the grid responds.

A Flutter app
losing the desktop crown.

Voice copilot pipeline: six stages,
one widget on top.

Every component has a
separately measurable contract.

The voice copilot,
tap to product grid.

Voice commerce stack: named tools,
OSS where it matters.

Production shape,
under the hood.

What runs every week,
and who owns it.

The funnel,
control vs voice-engaged.

The timeline,
including the week we halted.

How we know
it works.

Drop the widget into a Scaffold.
Wire two callbacks.

The four shapes we turn down
before scoping a pilot.

What buyers ask first.
Real answers, no hedging.

Where this case study
points back to.

Want a case study like this
for your Flutter app?