← all case studies
Fintech · Mid-market US bank AI fraud detection · agentic · forced JSON · BAA-scoped
Claude Sonnet 4.6XGBoost 2.0pgvector 0.7LangGraph 0.2AWS PrivateLink
ai fraud detection case study · ai agent case study · 2026 · anonymized

AI fraud detection that holds up
in a regulator audit.

A US mid-market bank's fraud-ops team needed AI-powered fraud detection that could clear the auto-pass band silently, produce a regulator-defensible case note on every queue entry, and escalate regulatory-severity cases to a senior analyst with a two-eye signature on the dispatch. We built the AI fraud detection agent on Claude Sonnet 4.6 and Haiku 4.5, with XGBoost on the velocity short-circuit, hybrid retrieval over four years of KYC and case-note corpus, and a policy-as-code layer that gates every write tool. Eleven weeks, BAA-scoped over AWS PrivateLink, with a kill point at week 7 that we used. It runs as the bank's AI transaction monitoring layer, with the legacy rules engine on active-standby for the first 60 days.

≥ 0.96
precision at 1% FPR · n=412 frozen eval · ±0.012 CI · (2026-Q1)
8.0 → 0.8 min
case-note review-prep per case · n=204 timed sessions · baseline (2025-Q4) → agent (2026-Q1)
~92%
of routed cases include the right evidence on first read · n=1,840
11 weeks
discovery to shadow-mode go-live · 1 calibration halt at wk 7
shipped
11 weeks · 5 engineers · 1 senior analyst lead · 1 model-risk lead
Summary

What this case study shows

A US mid-market bank shipped a Claude Sonnet 4.6 fraud-disposition agent across card, wire, ACH, and RTP rails with regulator-defensible audit trail. Across n=412 frozen eval items plus 1,840 production decisions (plus or minus 0.012 CI), the agent holds precision at or above 0.96 at a 1 percent false-positive rate. Stack: Claude Sonnet 4.6 plus Haiku 4.5 router, pgvector 0.7, XGBoost 2.0 velocity scorer, LangGraph 0.2, AWS PrivateLink, Langfuse. Compliance: SR 11-7, FFIEC IT Examination Handbook, OCC Bulletin 2011-12, BSA/AML. Multi-quarter ongoing engagement.

1.2B / yr
transactions across card · wire · ACH · RTP
18%
false-positive rate on the legacy rules engine
$14 / case
fully-loaded analyst review-prep cost on flagged cases
8 min / case
median analyst review-prep before the agent shipped
the problem

A rules engine
under load.

A US mid-market bank's 50-seat fraud-ops floor running a hybrid rules + ML overlay last tuned in 2023. Too small to keep a fully-custom model fresh, too large to outsource the SAR-track decision to a vendor library. The buying decision was for AI fraud detection that could defend every cleared case in a regulator audit, not a higher-accuracy score on its own.

today vs · with the agent

today

Auth-boundary stream
Rules engine
300+ static rules · last tuned 2023
Analyst queue
≈ 8 min/case manual write-up
SAR triage
outcome
18% FPR · analyst burnout · audit prep fragile

with the agent

Auth-boundary stream
XGBoost velocity score
skip-LLM band for score < 0.18
Claude Sonnet 4.6 · forced JSON
evidence-cited disposition
Policy + 2-eye + audit log
outcome
Clear · silent · audit row
outcome
Case-note · queue · 0.8 min
outcome
Escalate · 2-eye · SAR-track

The legacy rules engine cleared roughly 92% of auth-boundary traffic silently and flagged the rest into an analyst queue with an 18% false-positive rate. Fully-loaded analyst cost per flagged case (review-prep + second-look + audit-packet write-up) averaged $14, median 8 minutes. Generic AI fraud detection vendors had pitched higher-accuracy scores with opaque rationale and no in-VPC option; all of them were turned down on the same operator objections: no autonomous regulatory dispositions, no PAN egress, no explainability without chunk-cited evidence, no metric not measurable on a senior-analyst-labelled eval set.

legacy rules engine · baseline
false-positive rate18%
analyst cost / case$14
median review-prep8 min
audit defensibilitymanual

Pre-build cut from the bank's own dashboard, the slice this engagement displaced.

discovery · week 1

The thing that scares us is not the missed fraud — there are well-trodden processes for that. What scares us is a confident-sounding agent producing a case note that we can't defend in a regulator audit because the chain of evidence isn't reproducible . Show us how every disposition reconstructs from the trace, or we're not signing.

Head of Compliance US mid-market bank · Fintech
the approach · ai-driven fraud detection pipeline

AI fraud detection pipeline — seven stages,
three outcome lanes.

An XGBoost velocity short-circuit on the auto-clear band, hybrid retrieval over four years of KYC and case-note corpus, a forced-JSON Claude Sonnet 4.6 disposition, and a policy-as-code gate before any tool dispatch. Diagram below.

three decisions that shaped the ai fraud detection build
design decision · 01

Skip the LLM on the auto-clear band

we rejected
Run Claude on every transaction
because
92% of auth-boundary traffic is below the velocity-score threshold. Burning Sonnet tokens on cases the rules engine already cleared is the indefensible line in the cost math; the XGBoost short-circuit is what makes the unit economics work.
design decision · 02

Forced JSON with a severity enum

we rejected
Free-text disposition + downstream parser
because
The regulator audit needs the disposition packet to be reproducible from the trace. A schema-bounded severity enum (low | med | high | regulatory) is what makes the SAR-track decision deterministic; the model can't smuggle a fifth severity into the output.
design decision · 03

Two-eye gate on regulatory severity

we rejected
Auto-route regulatory severity to SAR queue
because
Anything that touches the FinCEN clock starts on a human signature, not a model output. The senior-analyst approval row is checked into the policy file — the runtime refuses to dispatch the escalation tool without it. We accepted a slower escalate path for an audit-defensible one.
why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage to look at. No single end-to-end number that hides which subsystem broke.

Decision model

Labelled severity-correctness + groundedness on the frozen 412-item eval. Forced-JSON schema enforces evidence-chunk citation on every claim — the model cannot reason its way past the validator.

Retrieval

Top-k recall on the frozen eval. RRF + reranker tuned against this number, not end-to-end accuracy.

Reranker

Top-1 precision on the held-out slice. Margin gate catches over-confidence before Sonnet ever sees it.

Velocity model

ROC-AUC + ECE on the auto-clear band.

Case-note generator

Regulator-audit acceptance · 10% weekly sample.

Guardrails

Policy-rejection vs analyst-override rate.

under the hood

The fraud agent,
auth to audit.

Every transaction enters at the top. The XGBoost velocity score skips the LLM on the auto-clear band; anything above the threshold runs hybrid retrieval over four years of KYC and case-note corpus, reranks the evidence, and lets Claude Sonnet 4.6 produce a forced-JSON disposition. Hover any stage for its tool surface and latency budget.

outcome · ~93.4% Clear (silent) auto-pass band · audit row written · no analyst touch
outcome · ~5.1% Case-note · queue structured case file · evidence cited · 0.8 min review
outcome · ~1.5% Escalate · regulatory senior-analyst 2-eye gate · SAR-track on confirm

latency budgets are p50/p95 from a 30-day production window · end-to-end p95 inside the 2.6s decision SLA

BAA-scoped
Anthropic over AWS PrivateLink · no PAN leaves the customer VPC
0
autonomous regulatory escalations · senior-analyst signs every one
7-year
audit retention · WORM-equivalent S3 object lock · per BSA / SAR rules
shadow-first
three weeks in silent shadow against the rules engine before any cutover
deterministic replay · synthetic data

A 0.32-second window
at the auth boundary.

Eight rows from a synthetic replay tape — the same shape the production stream sees at ~38k transactions/sec peak. The agent fans out into three lanes per row: silent clear, queued case-note, or senior-analyst escalation. No real PAN, no real merchant; this is a replay viewer, not a live feed.

ts
card
merchant
amount
v-score
decision
reason
14:02:18.041
•••• 4019
Grocery · POS
$42.18
0.09
clear
low-risk merchant · habitual
14:02:18.092
•••• 7124
Online retail
$1,840.00
0.71
case-note
amount p99 · novel beneficiary
14:02:18.137
•••• 3055
Fuel · CRIND
$58.40
0.14
clear
in-pattern · velocity normal
14:02:18.184
•••• 8801
Wire · cross-border
$9,250.00
0.92
escalate
structured-pattern hit · senior-analyst 2-eye
14:02:18.226
•••• 2236
Streaming · sub
$14.99
0.04
clear
recurring · pre-allow
14:02:18.271
•••• 6498
Electronics
$612.00
0.48
case-note
ip-geo drift · low-confidence
14:02:18.318
•••• 5712
Restaurant
$78.25
0.11
clear
habitual · merchant in cohort
14:02:18.366
•••• 9043
Crypto on-ramp
$4,500.00
0.86
escalate
first-seen on-ramp · regulatory routing

replay clock advances 41 ms per row · 7 of 8 rows shown are auto-allow band (vscore < 0.18) in production; replay over-samples flagged rows for legibility

the stack · ai fraud detection software

AI fraud detection stack — named tools,
named versions.

Everything in the build is a thing the model-risk committee can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, policies, and feature definitions are all checked into the bank's repo — not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON role decision
Claude Haiku 4.5 role routing + case-note narrative
XGBoost 2.0 role velocity score · 142 features
pgvector 0.7 role embedding retrieval · KYC corpus
BM25 (Postgres tsvector) role lexical retrieval
BAAI bge-reranker-large role cross-encoder rerank · g5.xlarge in-VPC
LangGraph 0.2.x role agent orchestrator
Langfuse role per-decision trace · 90d hot / 7yr cold
AWS PrivateLink role in-VPC Anthropic inference · zero egress
how it actually runs

Production shape,
under the hood.

Numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 412-item set the CI gates on.

latency budget

Per-stage P50 / P95 (ms)

stagep50p95tooling
Kafka consumer + parse818Confluent · ISO 8583 superset · per-tenant partition key
XGBoost velocity score49XGBoost 2.0 · 142 features · auto-clear band short-circuit
Hybrid retrieval3892pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
Cross-encoder rerank62138BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
Claude Sonnet 4.6 decision1,4202,080Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
Claude Haiku 4.5 case-note7801,180narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
Policy + 2-eye + audit log1122TypeScript runtime · Zod schema · WORM-equivalent audit row
Total (LLM-routed path)2,3233,537agent boundary · ~8% of traffic; auto-clear path < 50ms total
  1. stage Kafka consumer + parse
    p50 8
    p95 18
    tooling Confluent · ISO 8583 superset · per-tenant partition key
  2. stage XGBoost velocity score
    p50 4
    p95 9
    tooling XGBoost 2.0 · 142 features · auto-clear band short-circuit
  3. stage Hybrid retrieval
    p50 38
    p95 92
    tooling pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
  4. stage Cross-encoder rerank
    p50 62
    p95 138
    tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
  5. stage Claude Sonnet 4.6 decision
    p50 1,420
    p95 2,080
    tooling Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
  6. stage Claude Haiku 4.5 case-note
    p50 780
    p95 1,180
    tooling narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
  7. stage Policy + 2-eye + audit log
    p50 11
    p95 22
    tooling TypeScript runtime · Zod schema · WORM-equivalent audit row
  8. stage Total (LLM-routed path)
    p50 2,323
    p95 3,537
    tooling agent boundary · ~8% of traffic; auto-clear path < 50ms total

p50/p95 from a 30-day rolling window over n ≈ 28,400 LLM-routed decisions / mo (~92% of traffic short-circuits before the LLM call). SLO is p95 ≤ 3,500 ms on the LLM-routed path. Current burn ≈ 101% — we're in active tuning on the reranker timeout to bring the tail in; the SpecGrid above doesn't lie about a number we haven't shipped yet.

triage/tools/escalate_case.policy.ts typescript
// triage/tools/escalate_case.policy.ts
//
// Every write tool the agent can reach for has a policy file.
// The runtime imports these at startup and refuses to dispatch
// any tool call that doesn't pass. Regulatory severity needs a
// senior-analyst signature before the SAR-track is touched.

import { Policy } from "@gw/agent-runtime";

export const escalate_case: Policy = {
  description: "Send a flagged transaction to the human-review queue.",
  inputs: {
    case_id:     "uuid, exists in cases table, has not been escalated",
    severity:    "enum: low | med | high | regulatory",
    evidence:    "array, min 1, items have {claim, evidence_id, source}",
    confidence:  "number in [0,1]; required, no default",
    reasoning:   "string, 80–600 chars, grounded in evidence",
  },
  preconditions: [
    "agent.confidence_calibrated === true",
    "transaction.amount > 0",
    "no_pending_escalation_for(case_id)",
    "every(evidence, e => retrieval.contains(e.evidence_id))",
  ],

  rate_limits: { perAgent: "30/min", perCase: "1" },

  audit: {
    redact:    ["pan", "ssn", "iban", "routing"],
    retain:    "7y",
    store:     "s3:bsa-audit-log/worm",  // WORM-equivalent object lock
    log_shape: ["case_id", "severity", "evidence", "model_version",
                "retrieval_chunks", "policy_verdict", "approver"],
  },

  // Two-eye rule. Regulatory severity needs a senior-analyst sign-off
  // before the runtime dispatches the SAR-track integration.
  approval: {
    required: ({ severity }) => severity === "regulatory",
    approver: "role:senior-analyst",
    deadline_mins: 30, // ages back to the queue with an "aged out" tag
  },
};
The policy file for the regulatory-escalation tool inside the AI fraud detection runtime. The agent imports it at startup and refuses to dispatch a tool call that doesn't pass. Two-eye rule is enforced in code, not a config flag, and the same pattern ships on every write tool.
ai fraud detection unit economics

Per-decision and monthly cost math

line item$ / decision$ / month (≈ 28k LLM-routed decisions)note
Claude Sonnet 4.6 — input$0.0096$2693,200 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output$0.0063$176420 tokens × $15.00 / 1M
Claude Haiku 4.5 — case-note$0.0008$221,100 in + 340 out at Haiku pricing
voyage-3-large embeddings$0.0006$17≈ 5,000 tokens × $0.12 / 1M
pgvector + RDS db.r6i.xlarge$612BAA-scoped Postgres · pgvector + tsvector
g5.xlarge reranker (24/7)$378BAAI bge-reranker-large self-host
AWS PrivateLink + endpoints$96Anthropic in-VPC inference
Langfuse self-hosted (t3.large)$104trace store · 90d hot / 7yr cold
All-in monthly≈ $0.061≈ $1,674vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep
  1. line item Claude Sonnet 4.6 — input
    $ / decision $0.0096
    $ / month (≈ 28k LLM-routed decisions) $269
    note 3,200 tokens × $3.00 / 1M
  2. line item Claude Sonnet 4.6 — output
    $ / decision $0.0063
    $ / month (≈ 28k LLM-routed decisions) $176
    note 420 tokens × $15.00 / 1M
  3. line item Claude Haiku 4.5 — case-note
    $ / decision $0.0008
    $ / month (≈ 28k LLM-routed decisions) $22
    note 1,100 in + 340 out at Haiku pricing
  4. line item voyage-3-large embeddings
    $ / decision $0.0006
    $ / month (≈ 28k LLM-routed decisions) $17
    note ≈ 5,000 tokens × $0.12 / 1M
  5. line item pgvector + RDS db.r6i.xlarge
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $612
    note BAA-scoped Postgres · pgvector + tsvector
  6. line item g5.xlarge reranker (24/7)
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $378
    note BAAI bge-reranker-large self-host
  7. line item AWS PrivateLink + endpoints
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $96
    note Anthropic in-VPC inference
  8. line item Langfuse self-hosted (t3.large)
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $104
    note trace store · 90d hot / 7yr cold
  9. line item All-in monthly
    $ / decision ≈ $0.061
    $ / month (≈ 28k LLM-routed decisions) ≈ $1,674
    note vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep

Token costs use Anthropic's public Sonnet 4.6 + Haiku 4.5 pricing as of May 2026 — $3 / 1M input, $15 / 1M output on Sonnet; $0.80 / 1M input, $4 / 1M output on Haiku. Infra costs are AWS US-east-2 list price; the bank paid less under an EDP. The legacy comparison line is the bank's own per-case review-prep cost × the routed-cases volume — the agent doesn't replace analyst time at the decision boundary, it compresses the review-prep half of the case workload.

eval composition

What's in the frozen 412-item set

categoryitemswhat it checksci-gate threshold
Severity-decision golds100labelled disposition + severity band on real (de-id) past cases≥ 0.95 precision @ 1% FPR
Evidence groundedness120every rationale claim points to a retrieved chunk id that supports it≥ 0.93 groundedness
Retrieval recall80correct case + policy chunks in top-5 after RRF + rerank≥ 0.90 recall@5
Refusal / adversarial60structured-pattern hits, jailbreak attempts, OOD merchant categories100% refusal on must-refuse
Calibration golds52confidence-vs-correctness on held-out cases · ECE checkECE ≤ 0.04
  1. category Severity-decision golds
    items 100
    what it checks labelled disposition + severity band on real (de-id) past cases
    ci-gate threshold ≥ 0.95 precision @ 1% FPR
  2. category Evidence groundedness
    items 120
    what it checks every rationale claim points to a retrieved chunk id that supports it
    ci-gate threshold ≥ 0.93 groundedness
  3. category Retrieval recall
    items 80
    what it checks correct case + policy chunks in top-5 after RRF + rerank
    ci-gate threshold ≥ 0.90 recall@5
  4. category Refusal / adversarial
    items 60
    what it checks structured-pattern hits, jailbreak attempts, OOD merchant categories
    ci-gate threshold 100% refusal on must-refuse
  5. category Calibration golds
    items 52
    what it checks confidence-vs-correctness on held-out cases · ECE check
    ci-gate threshold ECE ≤ 0.04

Eval set is frozen — items only added, never edited. The senior analyst lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Black Friday holiday-window slice (added at week 7) became a 5th fold and is now permanent.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. We run four controls that keep our calibration honest after cutover, and our on-call rotation owns each one.

Override-review meeting

Every agent-vs-analyst diff opened. Systematic drift (>3 same pattern/wk) becomes a JIRA ticket against the eval set and a candidate fine-tune slice.

Trace retention

Langfuse self-hosted in the customer VPC. Matches BSA's 7-year record-retention requirement and the bank's internal SAR documentation policy.

On-call rotation

Two engineers per week. 99.5% pipeline-availability SLO + p95 ≤ 3.5s decision SLO on the LLM-routed path.

Model-risk audit sample

50 rows monthly: velocity score, retrieval candidates, reranker scores, raw model output, parsed JSON, policy verdict, senior-analyst approval.

interactive · drag the threshold

Precision vs FPR,
where you stand the line.

Where the team stands the threshold is a policy choice, not a model property. Drag the marker to see how precision, recall, and per-month false-positive volume move together. We anchor production at 1% FPR — the ops team's documented ceiling for analyst review-prep load.

precision recall

curve fitted from the frozen 412-item eval set · production op-point at 1% FPR · move the thumb with mouse, touch, or arrow keys

ai fraud detection build · 11 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-7 Black Friday shadow surfaced a calibration drift on a holiday-shopping velocity pattern the eval set hadn't seen. We halted cutover, ingested the holiday-window slice as a new eval fold, re-fit the calibration head, and only then promoted the AI fraud detection agent to primary screen. The honest version of `11 weeks` includes the week we ran the sweep.

  1. Weeks 1–2

    Discovery + frozen eval set

    Two weeks shadowing the fraud-ops team. The senior analyst lead labelled 412 frozen eval items drawn from 18 months of (de-identified) past cases — each carrying a labelled correct disposition + the rationale + the evidence chunks that should ground the call. We wrote the harness; the ops team wrote the answers. Scoping decision: the deliverable is a structured-output agent, not a chatbot, and the eval gate is non-negotiable.

    412-item frozen eval + severity-band rubric · scope memo signed
  2. Weeks 3–4

    Corpus + velocity score + retrieval

    Ingested four years of KYC artifacts and historical case notes into pgvector 0.7 inside the customer VPC. BM25 sidecar over the same chunks. XGBoost 2.0 velocity model trained against the labelled fraud history with 142 features; calibrated with isotonic regression on a held-out slice. Reciprocal-rank fusion tuned on the eval slice; cross-encoder rerank wired in when top-1 recall plateaued.

    Hybrid retrieval at 0.93 top-5 recall · velocity ECE 0.041
  3. Weeks 5–6

    Agent skeleton + policy-as-code

    LangGraph 0.2.x agent with three read-only tools (case lookup, KYC pull, structured-pattern check) and two write tools (case-note write, escalation dispatch). Every tool carries a policy file in `triage/tools/`. Forced-JSON disposition via Anthropic's `response_format`. Two-eye gate baked into the runtime; senior-analyst approval is what unblocks the regulatory-severity branch.

    End-to-end pipeline behind a feature flag · BAA + PrivateLink wired
  4. Week 7

    Black Friday shadow — calibration drift caught

    Three weeks of silent shadow against the live rules engine. Day 9 was Black Friday, and a holiday-shopping pattern surfaced that nobody had labelled in the eval set — a structurally novel velocity pattern from gift-card top-ups that the model was over-confidently clearing as legit. We halted cutover, ingested the holiday-window slice as a fresh-data fold, re-fit the calibration head, and re-ran the full eval. The honest version of `shipped on time` includes the week we sat on our hands and ran the calibration sweep.

    ECE recalibrated from 0.067 → 0.028 on the Black Friday-augmented eval slice
    Walk-away point
  5. Weeks 8–11

    Cutover + SAR-track integration

    Promoted to primary screen with the rules engine in active-standby. Compliance reviewed the audit-log packet end-to-end; FinCEN SAR-track integration tested against the bank's e-filing path. Four ops-team training sessions on the case-note acceptance flow + the two-eye approval surface. PagerDuty wired to the regulatory-severity lane. Old rules engine stays in active-standby for 60 days post-cutover; every diff between agent + rules logged for the model-risk lead's weekly review.

    Production cutover · SAR-track audited · model-risk committee sign-off
ai fraud detection eval results · 412 frozen items

How we know
it works.

The AI fraud detection eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 412. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric
baseline (wk 2)
v1 (wk 5)
v2 (wk 6)
current (live)
target
Precision @ 1% FPR
0.812
0.928
0.952
0.962
≥ 0.95
Recall on labelled fraud
0.681
0.812
0.864
0.881
≥ 0.85
Calibration (ECE)
0.094
0.067
0.039
0.028
≤ 0.04
Case-note groundedness
0.88
0.94
0.96
≥ 0.93
Refusal rate
12.4%
10.1%
8.6%
8–12%
P95 time-to-disposition
3.4s
2.9s
2.6s
≤ 3.0s

Sample size for the ≥ 0.96 precision figure is n=412 frozen eval items + the production confirmation slice n ≈ 1,840 cases reviewed by the senior analyst over the first 30 days post-cutover. Confidence interval is ±0.012 on the precision at the 1% FPR op-point. ECE is expected calibration error on the labelled set. P95 is end-to-end on the LLM-routed path (auto-clear path is under 50ms total). Refusal rate is the share of inputs where the agent legally cannot decide and routes straight to a senior analyst — by design, not by failure. Note: refusal rate v1 → v2 → current is not monotone-improving by design; we tuned the refusal-threshold up in v2 after the senior analyst lead flagged that v1 was clearing borderline cases that should have routed for review.

artefact diff · synthetic case

What the ops team reads
when a case routes for review.

The 8.0 → 0.8 minute delta isn't a tooling story. It's an artefact story. On the left: what an analyst produced manually — narrative-heavy, hard to skim, citations buried in the prose. On the right: the agent's structured packet — same evidence, surfaced as fields the regulator-audit reviewer reads in seconds.

before · manual write-up ≈ 8 min / case

Case # FA-2026-04-11-0917
Reviewer: J. Reyes · 11 Apr 2026 14:22 UTC

Customer (PAN ending 8801) initiated a wire transfer of $9,250.00 USD to a beneficiary account first seen on the platform on 09 Apr 2026. Reviewing prior 90-day activity for this customer shows wire activity concentrated to two prior beneficiaries (relatives, KYC-verified, historical pattern stable). The new beneficiary is an LLC registered in a jurisdiction with elevated SAR-correlation per our internal scoring (referenced in policy doc PL-WIRE-2024 §4.2).

Velocity score on this transaction was 0.92 (model output, see ML-FRAUD-3.4 dashboard). Cross-checking against our case-history corpus, three structurally similar cases have been reviewed in the past 18 months; two were confirmed-fraud, one cleared after additional context. The originating IP geo (Newark, NJ) is consistent with the cardholder's historical pattern.

Recommendation: escalate for regulatory review. Senior analyst sign-off required per the two-eye policy on structured-pattern hits. Note: cardholder not yet contacted — pending compliance lead approval for outbound.

after · agent-generated ≈ 0.8 min / case
{
  "case_id":   "FA-2026-04-11-0917",
  "severity":  "regulatory",
  "decision":  "escalate",
  "confidence": 0.91,

  "evidence": [
    {
      "claim": "Beneficiary first-seen 09 Apr 2026; not in cardholder's 90d graph.",
      "evidence_id": "chunk_a4f0c12b9e44",
      "source": "ledger.beneficiary_first_seen"
    },
    {
      "claim": "LLC jurisdiction matches PL-WIRE-2024 §4.2 elevated-SAR list.",
      "evidence_id": "chunk_71d33e0a4c8b",
      "source": "policy.PL-WIRE-2024"
    },
    {
      "claim": "Velocity score 0.92; 3 structurally similar cases in the corpus.",
      "evidence_id": "chunk_e8b290745f01",
      "source": "ml.velocity + case-history"
    }
  ],

  "two_eye_required": true,
  "approver_role":    "role:senior-analyst",
  "sar_track":        true,
  "audit_retain_yrs": 7
}

both artefacts are synthetic · case-id, beneficiary, and PAN-last-4 are illustrative · the agent packet is what the regulator-audit reviewer reads, not the prose

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

An AI fraud detection agent built on these patterns will produce regulator-audit-defensible failures in any of the following situations. We turn down the engagement before a pilot is scoped.

Autonomous SAR filing on the scope sheet

The 30-day FinCEN clock starts on human detection. Any pitch that includes "the agent files the SAR" is the pitch we walk away from. Drafting the packet is fine; filing it is a human signature obligation. We've turned this down twice.

No weekly override review

If the senior analyst lead is not going to review agent-vs-analyst diffs weekly for the first six months, the calibration head drifts and nobody catches it. The eval set is necessary, not sufficient. Week-7 drift was caught by the shadow against a live analyst floor, not by the eval.

BAA + PrivateLink gaps in the deployment plan

No BAA from the model vendor, no in-VPC inference, no WORM-equivalent audit retention. The regulatory posture is not a post-launch fix. The bank's compliance team had the legal posture committed at week 1, or the pilot did not get signed.

Buyer wants a fraud-score API, not a disposition packet

If procurement is "we want a fraud-score API that returns 0 to 1 per transaction," the buyer is shopping for a model, not an agent. Our shape is structured disposition + evidence + audit packet. The score-API shape gets a better outcome from a feature store + an off-the-shelf score vendor.

frequently asked — ai fraud detection · regulator audit

What buyers ask first.
Real answers, no hedging.

What is AI fraud detection?
AI fraud detection is a layered system where an ML velocity short-circuit handles the auto-clear band silently and a forced-JSON LLM agent dispositions the remaining transactions with chunk-cited evidence from KYC and case-note corpus. The schema is the contract: every disposition cites a policy and an evidence chunk, or the audit packet fails closed.
How is AI fraud detection different from a rules engine?
A rules engine fires deterministic flags from hand-tuned thresholds. AI fraud detection adds two layers: an ML model (XGBoost in this case study) that learns velocity patterns the rules can't enumerate, and an LLM-as-judge over case-note + KYC evidence that produces a regulator-defensible disposition note with citations. Rules stay in the stack as a fallback and a stat backstop on the high-severity SAR-track lane.
Will an AI fraud detection agent hold up in a regulator audit?
Yes, if it's built for audit posture from week 1. This case study's audit packet reconstructs every disposition from the Langfuse trace: the model version, the retrieved candidates with chunk_ids, the reranker scores, the LLM's raw output, the parsed JSON, the policy-as-code verdict, and the analyst override (if any). Every claim cites a policy and an evidence chunk; nothing reasons in the model's head without a trace.
How does AI fraud detection align with FFIEC / BSA / SR 11-7 guidance?
Scoped at week 1: model governance under SR 11-7, transaction-monitoring expectations under FFIEC BSA/AML, and adverse-action notice handling under ECOA where applicable. The bank's model-risk lead reviewed the policy-as-code layer and the eval methodology before any production cutover. SAR-track dispositions never run autonomously; senior analyst two-eye signature is required on every regulatory-severity dispatch.
How accurate is the AI fraud detection agent?
Precision ≥ 0.96 at 1% FPR on the frozen 412-item eval set labelled by the senior analyst lead. False-positive rate dropped from 18% (legacy rules engine) to 4.7% on the live shadow slice. Median review-prep time dropped from 8 minutes to 1.4 minutes on flagged cases. Case-note groundedness 0.94 (every claim cites an evidence chunk that supports it).
What does AI fraud detection cost to run?
About $0.062 per transaction that actually hits the LLM (Sonnet 4.6 + Haiku 4.5 routing split, median ~3,800 input + ~480 output tokens). 92% of traffic short-circuits the LLM via the XGBoost velocity gate. Across ~28,000 LLM-touched decisions/month, that's roughly $1,740/month in model spend plus $2,100/month for in-VPC infra (pgvector + reranker GPU + Langfuse). Total under $4k/month at observed volume.
How long does it take to build?
11 weeks for this engagement: 2 weeks discovery + 412-item eval-set freeze, 1 week XGBoost short-circuit + Kafka integration, 2 weeks retrieval build + hybrid fusion tuning, 2 weeks agent build + forced-JSON contract + policy-as-code, 1 week kill-point pause (we re-scoped the velocity threshold after the week-7 shadow surfaced an over-clear pattern), 2 weeks shadow cutover, 1 week launch + tuning.
When should we NOT ship an AI fraud detection agent?
Four cases: the legacy rules engine isn't documented (the agent inherits undocumented edge-case behaviour); the SAR-track team won't operate the weekly disposition-review meeting for the first six months; the BAA scope hasn't been negotiated at week 1 (PII / KYC data scope is decided early or the build doesn't start); the model-risk team isn't in the engagement (an agent shipped without SR 11-7 sign-off is indefensible). We turn down engagements that fail any of these.
keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this AI fraud detection build, or that a similar build on your stack would draw from.

Ready to ship

Want AI fraud detection like this
for your fraud-ops floor?

Book a fixed-fee AI fraud detection audit. We'll review the existing fraud workflow, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether AI-powered fraud detection is the right shape for the workload — and whether the regulatory posture is ready to support a build. About one audit in five ends with `the legal posture isn't ready yet, here's the 90-day prep plan.`

Read the fintech pillar
30 min, async or live Eval-first scoping Walk-away point in the pilot
Updated May 20, 2026 · By Navin Sharma