Fintech · Mid-market US bank AI fraud detection · agentic · forced JSON · BAA-scoped

Claude Sonnet 4.6XGBoost 2.0pgvector 0.7LangGraph 0.2AWS PrivateLink

ai fraud detection case study · ai agent case study · 2026 · anonymized

AI fraud detection that holds up
in a regulator audit.

A US mid-market bank's fraud-ops team needed AI-powered fraud detection that could clear the auto-pass band silently, produce a regulator-defensible case note on every queue entry, and escalate regulatory-severity cases to a senior analyst with a two-eye signature on the dispatch. We built the AI fraud detection agent on Claude Sonnet 4.6 and Haiku 4.5, with XGBoost on the velocity short-circuit, hybrid retrieval over four years of KYC and case-note corpus, and a policy-as-code layer that gates every write tool. Eleven weeks, BAA-scoped over AWS PrivateLink, with a kill point at week 7 that we used. It runs as the bank's AI transaction monitoring layer, with the legacy rules engine on active-standby for the first 60 days.

≥ 0.96

precision at 1% FPR · n=412 frozen eval · ±0.012 CI · (2026-Q1)

8.0 → 0.8 min

case-note review-prep per case · n=204 timed sessions · baseline (2025-Q4) → agent (2026-Q1)

~92%

of routed cases include the right evidence on first read · n=1,840

11 weeks

discovery to shadow-mode go-live · 1 calibration halt at wk 7

shipped

11 weeks · 5 engineers · 1 senior analyst lead · 1 model-risk lead

Summary

What this case study shows

A US mid-market bank shipped a Claude Sonnet 4.6 fraud-disposition agent across card, wire, ACH, and RTP rails with regulator-defensible audit trail. Across n=412 frozen eval items plus 1,840 production decisions (plus or minus 0.012 CI), the agent holds precision at or above 0.96 at a 1 percent false-positive rate. Stack: Claude Sonnet 4.6 plus Haiku 4.5 router, pgvector 0.7, XGBoost 2.0 velocity scorer, LangGraph 0.2, AWS PrivateLink, Langfuse. Compliance: SR 11-7, FFIEC IT Examination Handbook, OCC Bulletin 2011-12, BSA/AML. Multi-quarter ongoing engagement.

1.2B / yr

transactions across card · wire · ACH · RTP

18%

false-positive rate on the legacy rules engine

$14 / case

fully-loaded analyst review-prep cost on flagged cases

8 min / case

median analyst review-prep before the agent shipped

the problem

A rules engine
under load.

A US mid-market bank's 50-seat fraud-ops floor running a hybrid rules + ML overlay last tuned in 2023. Too small to keep a fully-custom model fresh, too large to outsource the SAR-track decision to a vendor library. The buying decision was for AI fraud detection that could defend every cleared case in a regulator audit, not a higher-accuracy score on its own.

today vs · with the agent

today

Auth-boundary stream

Rules engine

300+ static rules · last tuned 2023

Analyst queue

≈ 8 min/case manual write-up

SAR triage

outcome

18% FPR · analyst burnout · audit prep fragile

with the agent

Auth-boundary stream

XGBoost velocity score

skip-LLM band for score < 0.18

Claude Sonnet 4.6 · forced JSON

evidence-cited disposition

Policy + 2-eye + audit log

outcome

Clear · silent · audit row

outcome

Case-note · queue · 0.8 min

outcome

Escalate · 2-eye · SAR-track

The legacy rules engine cleared roughly 92% of auth-boundary traffic silently and flagged the rest into an analyst queue with an 18% false-positive rate. Fully-loaded analyst cost per flagged case (review-prep + second-look + audit-packet write-up) averaged $14, median 8 minutes. Generic AI fraud detection vendors had pitched higher-accuracy scores with opaque rationale and no in-VPC option; all of them were turned down on the same operator objections: no autonomous regulatory dispositions, no PAN egress, no explainability without chunk-cited evidence, no metric not measurable on a senior-analyst-labelled eval set.

legacy rules engine · baseline

false-positive rate18%

analyst cost / case$14

median review-prep8 min

audit defensibilitymanual

Pre-build cut from the bank's own dashboard, the slice this engagement displaced.

The thing that scares us is not the missed fraud — there are well-trodden processes for that. What scares us is a confident-sounding agent producing a case note that we can't defend in a regulator audit because the chain of evidence isn't reproducible . Show us how every disposition reconstructs from the trace, or we're not signing.

Head of Compliance US mid-market bank · Fintech

the approach · ai-driven fraud detection pipeline

AI fraud detection pipeline — seven stages,
three outcome lanes.

An XGBoost velocity short-circuit on the auto-clear band, hybrid retrieval over four years of KYC and case-note corpus, a forced-JSON Claude Sonnet 4.6 disposition, and a policy-as-code gate before any tool dispatch. Diagram below.

three decisions that shaped the ai fraud detection build

design decision · 01

Skip the LLM on the auto-clear band

we rejected: Run Claude on every transaction
because: 92% of auth-boundary traffic is below the velocity-score threshold. Burning Sonnet tokens on cases the rules engine already cleared is the indefensible line in the cost math; the XGBoost short-circuit is what makes the unit economics work.

design decision · 02

Forced JSON with a severity enum

we rejected: Free-text disposition + downstream parser
because: The regulator audit needs the disposition packet to be reproducible from the trace. A schema-bounded severity enum (low | med | high | regulatory) is what makes the SAR-track decision deterministic; the model can't smuggle a fifth severity into the output.

design decision · 03

Two-eye gate on regulatory severity

we rejected: Auto-route regulatory severity to SAR queue
because: Anything that touches the FinCEN clock starts on a human signature, not a model output. The senior-analyst approval row is checked into the policy file — the runtime refuses to dispatch the escalation tool without it. We accepted a slower escalate path for an audit-defensible one.

why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage to look at. No single end-to-end number that hides which subsystem broke.

Decision model

Labelled severity-correctness + groundedness on the frozen 412-item eval. Forced-JSON schema enforces evidence-chunk citation on every claim — the model cannot reason its way past the validator.

Retrieval

Top-k recall on the frozen eval. RRF + reranker tuned against this number, not end-to-end accuracy.

Reranker

Top-1 precision on the held-out slice. Margin gate catches over-confidence before Sonnet ever sees it.

Velocity model

ROC-AUC + ECE on the auto-clear band.

Case-note generator

Regulator-audit acceptance · 10% weekly sample.

Guardrails

Policy-rejection vs analyst-override rate.

under the hood

The fraud agent,
auth to audit.

Every transaction enters at the top. The XGBoost velocity score skips the LLM on the auto-clear band; anything above the threshold runs hybrid retrieval over four years of KYC and case-note corpus, reranks the evidence, and lets Claude Sonnet 4.6 produce a forced-JSON disposition. Hover any stage for its tool surface and latency budget.

outcome · ~93.4% Clear (silent) auto-pass band · audit row written · no analyst touch

outcome · ~5.1% Case-note · queue structured case file · evidence cited · 0.8 min review

outcome · ~1.5% Escalate · regulatory senior-analyst 2-eye gate · SAR-track on confirm

tool inventory

Hover or focus a stage on the left to see its tool surface, latency budget, and the data it touches.

latency budgets are p50/p95 from a 30-day production window · end-to-end p95 inside the 2.6s decision SLA

BAA-scoped

Anthropic over AWS PrivateLink · no PAN leaves the customer VPC

autonomous regulatory escalations · senior-analyst signs every one

7-year

audit retention · WORM-equivalent S3 object lock · per BSA / SAR rules

shadow-first

three weeks in silent shadow against the rules engine before any cutover

deterministic replay · synthetic data

A 0.32-second window
at the auth boundary.

Eight rows from a synthetic replay tape — the same shape the production stream sees at ~38k transactions/sec peak. The agent fans out into three lanes per row: silent clear, queued case-note, or senior-analyst escalation. No real PAN, no real merchant; this is a replay viewer, not a live feed.

card

merchant

amount

v-score

decision

reason

14:02:18.041

•••• 4019

Grocery · POS

$42.18

0.09

clear

low-risk merchant · habitual

14:02:18.092

•••• 7124

Online retail

$1,840.00

0.71

case-note

amount p99 · novel beneficiary

14:02:18.137

•••• 3055

Fuel · CRIND

$58.40

0.14

clear

in-pattern · velocity normal

14:02:18.184

•••• 8801

Wire · cross-border

$9,250.00

0.92

escalate

structured-pattern hit · senior-analyst 2-eye

14:02:18.226

•••• 2236

Streaming · sub

$14.99

0.04

clear

recurring · pre-allow

14:02:18.271

•••• 6498

Electronics

$612.00

0.48

case-note

ip-geo drift · low-confidence

14:02:18.318

•••• 5712

Restaurant

$78.25

0.11

clear

habitual · merchant in cohort

14:02:18.366

•••• 9043

Crypto on-ramp

$4,500.00

0.86

escalate

first-seen on-ramp · regulatory routing

replay clock advances 41 ms per row · 7 of 8 rows shown are auto-allow band (vscore < 0.18) in production; replay over-samples flagged rows for legibility

the stack · ai fraud detection software

AI fraud detection stack — named tools,
named versions.

Everything in the build is a thing the model-risk committee can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, policies, and feature definitions are all checked into the bank's repo — not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON

Claude Haiku 4.5

XGBoost 2.0

pgvector 0.7

BM25 (Postgres tsvector)

BAAI bge-reranker-large

LangGraph 0.2.x

Langfuse

AWS PrivateLink

how it actually runs

Production shape,
under the hood.

Numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 412-item set the CI gates on.

latency budget

Per-stage P50 / P95 (ms)

stage	p50	p95	tooling
Kafka consumer + parse	8	18	Confluent · ISO 8583 superset · per-tenant partition key
XGBoost velocity score	4	9	XGBoost 2.0 · 142 features · auto-clear band short-circuit
Hybrid retrieval	38	92	pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
Cross-encoder rerank	62	138	BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
Claude Sonnet 4.6 decision	1,420	2,080	Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
Claude Haiku 4.5 case-note	780	1,180	narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
Policy + 2-eye + audit log	11	22	TypeScript runtime · Zod schema · WORM-equivalent audit row
Total (LLM-routed path)	2,323	3,537	agent boundary · ~8% of traffic; auto-clear path < 50ms total

stage Kafka consumer + parse
p50 8
p95 18
tooling Confluent · ISO 8583 superset · per-tenant partition key
stage XGBoost velocity score
p50 4
p95 9
tooling XGBoost 2.0 · 142 features · auto-clear band short-circuit
stage Hybrid retrieval
p50 38
p95 92
tooling pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
stage Cross-encoder rerank
p50 62
p95 138
tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
stage Claude Sonnet 4.6 decision
p50 1,420
p95 2,080
tooling Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
stage Claude Haiku 4.5 case-note
p50 780
p95 1,180
tooling narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
stage Policy + 2-eye + audit log
p50 11
p95 22
tooling TypeScript runtime · Zod schema · WORM-equivalent audit row
stage Total (LLM-routed path)
p50 2,323
p95 3,537
tooling agent boundary · ~8% of traffic; auto-clear path < 50ms total

p50/p95 from a 30-day rolling window over n ≈ 28,400 LLM-routed decisions / mo (~92% of traffic short-circuits before the LLM call). SLO is p95 ≤ 3,500 ms on the LLM-routed path. Current burn ≈ 101% — we're in active tuning on the reranker timeout to bring the tail in; the SpecGrid above doesn't lie about a number we haven't shipped yet.

triage/tools/escalate_case.policy.ts typescript

// triage/tools/escalate_case.policy.ts
//
// Every write tool the agent can reach for has a policy file.
// The runtime imports these at startup and refuses to dispatch
// any tool call that doesn't pass. Regulatory severity needs a
// senior-analyst signature before the SAR-track is touched.

import { Policy } from "@gw/agent-runtime";

export const escalate_case: Policy = {
  description: "Send a flagged transaction to the human-review queue.",
  inputs: {
    case_id:     "uuid, exists in cases table, has not been escalated",
    severity:    "enum: low | med | high | regulatory",
    evidence:    "array, min 1, items have {claim, evidence_id, source}",
    confidence:  "number in [0,1]; required, no default",
    reasoning:   "string, 80–600 chars, grounded in evidence",
  },
  preconditions: [
    "agent.confidence_calibrated === true",
    "transaction.amount > 0",
    "no_pending_escalation_for(case_id)",
    "every(evidence, e => retrieval.contains(e.evidence_id))",
  ],

  rate_limits: { perAgent: "30/min", perCase: "1" },

  audit: {
    redact:    ["pan", "ssn", "iban", "routing"],
    retain:    "7y",
    store:     "s3:bsa-audit-log/worm",  // WORM-equivalent object lock
    log_shape: ["case_id", "severity", "evidence", "model_version",
                "retrieval_chunks", "policy_verdict", "approver"],
  },

  // Two-eye rule. Regulatory severity needs a senior-analyst sign-off
  // before the runtime dispatches the SAR-track integration.
  approval: {
    required: ({ severity }) => severity === "regulatory",
    approver: "role:senior-analyst",
    deadline_mins: 30, // ages back to the queue with an "aged out" tag
  },
};

// triage/tools/escalate_case.policy.ts
//
// Every write tool the agent can reach for has a policy file.
// The runtime imports these at startup and refuses to dispatch
// any tool call that doesn't pass. Regulatory severity needs a
// senior-analyst signature before the SAR-track is touched.

import { Policy } from "@gw/agent-runtime";

export const escalate_case: Policy = {
  description: "Send a flagged transaction to the human-review queue.",
  inputs: {
    case_id:     "uuid, exists in cases table, has not been escalated",
    severity:    "enum: low | med | high | regulatory",
    evidence:    "array, min 1, items have {claim, evidence_id, source}",
    confidence:  "number in [0,1]; required, no default",
    reasoning:   "string, 80–600 chars, grounded in evidence",
  },
  preconditions: [
    "agent.confidence_calibrated === true",
    "transaction.amount > 0",
    "no_pending_escalation_for(case_id)",
    "every(evidence, e => retrieval.contains(e.evidence_id))",
  ],

  rate_limits: { perAgent: "30/min", perCase: "1" },

  audit: {
    redact:    ["pan", "ssn", "iban", "routing"],
    retain:    "7y",
    store:     "s3:bsa-audit-log/worm",  // WORM-equivalent object lock
    log_shape: ["case_id", "severity", "evidence", "model_version",
                "retrieval_chunks", "policy_verdict", "approver"],
  },

  // Two-eye rule. Regulatory severity needs a senior-analyst sign-off
  // before the runtime dispatches the SAR-track integration.
  approval: {
    required: ({ severity }) => severity === "regulatory",
    approver: "role:senior-analyst",
    deadline_mins: 30, // ages back to the queue with an "aged out" tag
  },
};

The policy file for the regulatory-escalation tool inside the AI fraud detection runtime. The agent imports it at startup and refuses to dispatch a tool call that doesn't pass. Two-eye rule is enforced in code, not a config flag, and the same pattern ships on every write tool.

ai fraud detection unit economics

Per-decision and monthly cost math

line item	$ / decision	$ / month (≈ 28k LLM-routed decisions)	note
Claude Sonnet 4.6 — input	$0.0096	$269	3,200 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output	$0.0063	$176	420 tokens × $15.00 / 1M
Claude Haiku 4.5 — case-note	$0.0008	$22	1,100 in + 340 out at Haiku pricing
voyage-3-large embeddings	$0.0006	$17	≈ 5,000 tokens × $0.12 / 1M
pgvector + RDS db.r6i.xlarge	—	$612	BAA-scoped Postgres · pgvector + tsvector
g5.xlarge reranker (24/7)	—	$378	BAAI bge-reranker-large self-host
AWS PrivateLink + endpoints	—	$96	Anthropic in-VPC inference
Langfuse self-hosted (t3.large)	—	$104	trace store · 90d hot / 7yr cold
All-in monthly	≈ $0.061	≈ $1,674	vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep

line item Claude Sonnet 4.6 — input
$ / decision $0.0096
$ / month (≈ 28k LLM-routed decisions) $269
note 3,200 tokens × $3.00 / 1M
line item Claude Sonnet 4.6 — output
$ / decision $0.0063
$ / month (≈ 28k LLM-routed decisions) $176
note 420 tokens × $15.00 / 1M
line item Claude Haiku 4.5 — case-note
$ / decision $0.0008
$ / month (≈ 28k LLM-routed decisions) $22
note 1,100 in + 340 out at Haiku pricing
line item voyage-3-large embeddings
$ / decision $0.0006
$ / month (≈ 28k LLM-routed decisions) $17
note ≈ 5,000 tokens × $0.12 / 1M
line item pgvector + RDS db.r6i.xlarge
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $612
note BAA-scoped Postgres · pgvector + tsvector
line item g5.xlarge reranker (24/7)
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $378
note BAAI bge-reranker-large self-host
line item AWS PrivateLink + endpoints
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $96
note Anthropic in-VPC inference
line item Langfuse self-hosted (t3.large)
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $104
note trace store · 90d hot / 7yr cold
line item All-in monthly
$ / decision ≈ $0.061
$ / month (≈ 28k LLM-routed decisions) ≈ $1,674
note vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep

Token costs use Anthropic's public Sonnet 4.6 + Haiku 4.5 pricing as of May 2026 — $3 / 1M input, $15 / 1M output on Sonnet; $0.80 / 1M input, $4 / 1M output on Haiku. Infra costs are AWS US-east-2 list price; the bank paid less under an EDP. The legacy comparison line is the bank's own per-case review-prep cost × the routed-cases volume — the agent doesn't replace analyst time at the decision boundary, it compresses the review-prep half of the case workload.

eval composition

What's in the frozen 412-item set

category	items	what it checks	ci-gate threshold
Severity-decision golds	100	labelled disposition + severity band on real (de-id) past cases	≥ 0.95 precision @ 1% FPR
Evidence groundedness	120	every rationale claim points to a retrieved chunk id that supports it	≥ 0.93 groundedness
Retrieval recall	80	correct case + policy chunks in top-5 after RRF + rerank	≥ 0.90 recall@5
Refusal / adversarial	60	structured-pattern hits, jailbreak attempts, OOD merchant categories	100% refusal on must-refuse
Calibration golds	52	confidence-vs-correctness on held-out cases · ECE check	ECE ≤ 0.04

category Severity-decision golds
items 100
what it checks labelled disposition + severity band on real (de-id) past cases
ci-gate threshold ≥ 0.95 precision @ 1% FPR
category Evidence groundedness
items 120
what it checks every rationale claim points to a retrieved chunk id that supports it
ci-gate threshold ≥ 0.93 groundedness
category Retrieval recall
items 80
what it checks correct case + policy chunks in top-5 after RRF + rerank
ci-gate threshold ≥ 0.90 recall@5
category Refusal / adversarial
items 60
what it checks structured-pattern hits, jailbreak attempts, OOD merchant categories
ci-gate threshold 100% refusal on must-refuse
category Calibration golds
items 52
what it checks confidence-vs-correctness on held-out cases · ECE check
ci-gate threshold ECE ≤ 0.04

Eval set is frozen — items only added, never edited. The senior analyst lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Black Friday holiday-window slice (added at week 7) became a 5th fold and is now permanent.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. We run four controls that keep our calibration honest after cutover, and our on-call rotation owns each one.

Override-review meeting

Every agent-vs-analyst diff opened. Systematic drift (>3 same pattern/wk) becomes a JIRA ticket against the eval set and a candidate fine-tune slice.

Trace retention

Langfuse self-hosted in the customer VPC. Matches BSA's 7-year record-retention requirement and the bank's internal SAR documentation policy.

On-call rotation

Two engineers per week. 99.5% pipeline-availability SLO + p95 ≤ 3.5s decision SLO on the LLM-routed path.

Model-risk audit sample

50 rows monthly: velocity score, retrieval candidates, reranker scores, raw model output, parsed JSON, policy verdict, senior-analyst approval.

interactive · drag the threshold

Precision vs FPR,
where you stand the line.

Where the team stands the threshold is a policy choice, not a model property. Drag the marker to see how precision, recall, and per-month false-positive volume move together. We anchor production at 1% FPR — the ops team's documented ceiling for analyst review-prep load.

precision recall

curve fitted from the frozen 412-item eval set · production op-point at 1% FPR · move the thumb with mouse, touch, or arrow keys

ai fraud detection build · 11 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-7 Black Friday shadow surfaced a calibration drift on a holiday-shopping velocity pattern the eval set hadn't seen. We halted cutover, ingested the holiday-window slice as a new eval fold, re-fit the calibration head, and only then promoted the AI fraud detection agent to primary screen. The honest version of `11 weeks` includes the week we ran the sweep.

Weeks 1–2

Discovery + frozen eval set

Two weeks shadowing the fraud-ops team. The senior analyst lead labelled 412 frozen eval items drawn from 18 months of (de-identified) past cases — each carrying a labelled correct disposition + the rationale + the evidence chunks that should ground the call. We wrote the harness; the ops team wrote the answers. Scoping decision: the deliverable is a structured-output agent, not a chatbot, and the eval gate is non-negotiable.

412-item frozen eval + severity-band rubric · scope memo signed
Weeks 3–4

Corpus + velocity score + retrieval

Ingested four years of KYC artifacts and historical case notes into pgvector 0.7 inside the customer VPC. BM25 sidecar over the same chunks. XGBoost 2.0 velocity model trained against the labelled fraud history with 142 features; calibrated with isotonic regression on a held-out slice. Reciprocal-rank fusion tuned on the eval slice; cross-encoder rerank wired in when top-1 recall plateaued.

Hybrid retrieval at 0.93 top-5 recall · velocity ECE 0.041
Weeks 5–6

Agent skeleton + policy-as-code

LangGraph 0.2.x agent with three read-only tools (case lookup, KYC pull, structured-pattern check) and two write tools (case-note write, escalation dispatch). Every tool carries a policy file in `triage/tools/`. Forced-JSON disposition via Anthropic's `response_format`. Two-eye gate baked into the runtime; senior-analyst approval is what unblocks the regulatory-severity branch.

End-to-end pipeline behind a feature flag · BAA + PrivateLink wired
Week 7

Black Friday shadow — calibration drift caught

Three weeks of silent shadow against the live rules engine. Day 9 was Black Friday, and a holiday-shopping pattern surfaced that nobody had labelled in the eval set — a structurally novel velocity pattern from gift-card top-ups that the model was over-confidently clearing as legit. We halted cutover, ingested the holiday-window slice as a fresh-data fold, re-fit the calibration head, and re-ran the full eval. The honest version of `shipped on time` includes the week we sat on our hands and ran the calibration sweep.

ECE recalibrated from 0.067 → 0.028 on the Black Friday-augmented eval slice

Walk-away point
Weeks 8–11

Cutover + SAR-track integration

Promoted to primary screen with the rules engine in active-standby. Compliance reviewed the audit-log packet end-to-end; FinCEN SAR-track integration tested against the bank's e-filing path. Four ops-team training sessions on the case-note acceptance flow + the two-eye approval surface. PagerDuty wired to the regulatory-severity lane. Old rules engine stays in active-standby for 60 days post-cutover; every diff between agent + rules logged for the model-risk lead's weekly review.

Production cutover · SAR-track audited · model-risk committee sign-off

ai fraud detection eval results · 412 frozen items

How we know
it works.

The AI fraud detection eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 412. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric

baseline (wk 2)

v1 (wk 5)

v2 (wk 6)

current (live)

target

Precision @ 1% FPR

0.812

0.928

0.952

0.962

≥ 0.95

Recall on labelled fraud

0.681

0.812

0.864

0.881

≥ 0.85

Calibration (ECE)

0.094

0.067

0.039

0.028

≤ 0.04

Case-note groundedness

—

0.88

0.94

0.96

≥ 0.93

Refusal rate

—

12.4%

10.1%

8.6%

8–12%

P95 time-to-disposition

—

3.4s

2.9s

2.6s

≤ 3.0s

Sample size for the ≥ 0.96 precision figure is n=412 frozen eval items + the production confirmation slice n ≈ 1,840 cases reviewed by the senior analyst over the first 30 days post-cutover. Confidence interval is ±0.012 on the precision at the 1% FPR op-point. ECE is expected calibration error on the labelled set. P95 is end-to-end on the LLM-routed path (auto-clear path is under 50ms total). Refusal rate is the share of inputs where the agent legally cannot decide and routes straight to a senior analyst — by design, not by failure. Note: refusal rate v1 → v2 → current is not monotone-improving by design; we tuned the refusal-threshold up in v2 after the senior analyst lead flagged that v1 was clearing borderline cases that should have routed for review.

artefact diff · synthetic case

What the ops team reads
when a case routes for review.

The 8.0 → 0.8 minute delta isn't a tooling story. It's an artefact story. On the left: what an analyst produced manually — narrative-heavy, hard to skim, citations buried in the prose. On the right: the agent's structured packet — same evidence, surfaced as fields the regulator-audit reviewer reads in seconds.

before · manual write-up ≈ 8 min / case

Case # FA-2026-04-11-0917
Reviewer: J. Reyes · 11 Apr 2026 14:22 UTC

Customer (PAN ending 8801) initiated a wire transfer of $9,250.00 USD to a beneficiary account first seen on the platform on 09 Apr 2026. Reviewing prior 90-day activity for this customer shows wire activity concentrated to two prior beneficiaries (relatives, KYC-verified, historical pattern stable). The new beneficiary is an LLC registered in a jurisdiction with elevated SAR-correlation per our internal scoring (referenced in policy doc PL-WIRE-2024 §4.2).

Velocity score on this transaction was 0.92 (model output, see ML-FRAUD-3.4 dashboard). Cross-checking against our case-history corpus, three structurally similar cases have been reviewed in the past 18 months; two were confirmed-fraud, one cleared after additional context. The originating IP geo (Newark, NJ) is consistent with the cardholder's historical pattern.

Recommendation: escalate for regulatory review. Senior analyst sign-off required per the two-eye policy on structured-pattern hits. Note: cardholder not yet contacted — pending compliance lead approval for outbound.

after · agent-generated ≈ 0.8 min / case

{
  "case_id":   "FA-2026-04-11-0917",
  "severity":  "regulatory",
  "decision":  "escalate",
  "confidence": 0.91,

  "evidence": [
    {
      "claim": "Beneficiary first-seen 09 Apr 2026; not in cardholder's 90d graph.",
      "evidence_id": "chunk_a4f0c12b9e44",
      "source": "ledger.beneficiary_first_seen"
    },
    {
      "claim": "LLC jurisdiction matches PL-WIRE-2024 §4.2 elevated-SAR list.",
      "evidence_id": "chunk_71d33e0a4c8b",
      "source": "policy.PL-WIRE-2024"
    },
    {
      "claim": "Velocity score 0.92; 3 structurally similar cases in the corpus.",
      "evidence_id": "chunk_e8b290745f01",
      "source": "ml.velocity + case-history"
    }
  ],

  "two_eye_required": true,
  "approver_role":    "role:senior-analyst",
  "sar_track":        true,
  "audit_retain_yrs": 7
}

both artefacts are synthetic · case-id, beneficiary, and PAN-last-4 are illustrative · the agent packet is what the regulator-audit reviewer reads, not the prose

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

An AI fraud detection agent built on these patterns will produce regulator-audit-defensible failures in any of the following situations. We turn down the engagement before a pilot is scoped.

Autonomous SAR filing on the scope sheet

The 30-day FinCEN clock starts on human detection. Any pitch that includes "the agent files the SAR" is the pitch we walk away from. Drafting the packet is fine; filing it is a human signature obligation. We've turned this down twice.

No weekly override review

If the senior analyst lead is not going to review agent-vs-analyst diffs weekly for the first six months, the calibration head drifts and nobody catches it. The eval set is necessary, not sufficient. Week-7 drift was caught by the shadow against a live analyst floor, not by the eval.

BAA + PrivateLink gaps in the deployment plan

No BAA from the model vendor, no in-VPC inference, no WORM-equivalent audit retention. The regulatory posture is not a post-launch fix. The bank's compliance team had the legal posture committed at week 1, or the pilot did not get signed.

Buyer wants a fraud-score API, not a disposition packet

If procurement is "we want a fraud-score API that returns 0 to 1 per transaction," the buyer is shopping for a model, not an agent. Our shape is structured disposition + evidence + audit packet. The score-API shape gets a better outcome from a feature store + an off-the-shelf score vendor.

frequently asked — ai fraud detection · regulator audit

What buyers ask first.
Real answers, no hedging.

What is AI fraud detection?

AI fraud detection is a layered system where an ML velocity short-circuit handles the auto-clear band silently and a forced-JSON LLM agent dispositions the remaining transactions with chunk-cited evidence from KYC and case-note corpus. The schema is the contract: every disposition cites a policy and an evidence chunk, or the audit packet fails closed.

How is AI fraud detection different from a rules engine?

A rules engine fires deterministic flags from hand-tuned thresholds. AI fraud detection adds two layers: an ML model (XGBoost in this case study) that learns velocity patterns the rules can't enumerate, and an LLM-as-judge over case-note + KYC evidence that produces a regulator-defensible disposition note with citations. Rules stay in the stack as a fallback and a stat backstop on the high-severity SAR-track lane.

Will an AI fraud detection agent hold up in a regulator audit?

Yes, if it's built for audit posture from week 1. This case study's audit packet reconstructs every disposition from the Langfuse trace: the model version, the retrieved candidates with chunk_ids, the reranker scores, the LLM's raw output, the parsed JSON, the policy-as-code verdict, and the analyst override (if any). Every claim cites a policy and an evidence chunk; nothing reasons in the model's head without a trace.

How does AI fraud detection align with FFIEC / BSA / SR 11-7 guidance?

Scoped at week 1: model governance under SR 11-7, transaction-monitoring expectations under FFIEC BSA/AML, and adverse-action notice handling under ECOA where applicable. The bank's model-risk lead reviewed the policy-as-code layer and the eval methodology before any production cutover. SAR-track dispositions never run autonomously; senior analyst two-eye signature is required on every regulatory-severity dispatch.

How accurate is the AI fraud detection agent?

Precision ≥ 0.96 at 1% FPR on the frozen 412-item eval set labelled by the senior analyst lead. False-positive rate dropped from 18% (legacy rules engine) to 4.7% on the live shadow slice. Median review-prep time dropped from 8 minutes to 1.4 minutes on flagged cases. Case-note groundedness 0.94 (every claim cites an evidence chunk that supports it).

What does AI fraud detection cost to run?

About $0.062 per transaction that actually hits the LLM (Sonnet 4.6 + Haiku 4.5 routing split, median ~3,800 input + ~480 output tokens). 92% of traffic short-circuits the LLM via the XGBoost velocity gate. Across ~28,000 LLM-touched decisions/month, that's roughly $1,740/month in model spend plus $2,100/month for in-VPC infra (pgvector + reranker GPU + Langfuse). Total under $4k/month at observed volume.

How long does it take to build?

11 weeks for this engagement: 2 weeks discovery + 412-item eval-set freeze, 1 week XGBoost short-circuit + Kafka integration, 2 weeks retrieval build + hybrid fusion tuning, 2 weeks agent build + forced-JSON contract + policy-as-code, 1 week kill-point pause (we re-scoped the velocity threshold after the week-7 shadow surfaced an over-clear pattern), 2 weeks shadow cutover, 1 week launch + tuning.

When should we NOT ship an AI fraud detection agent?

Four cases: the legacy rules engine isn't documented (the agent inherits undocumented edge-case behaviour); the SAR-track team won't operate the weekly disposition-review meeting for the first six months; the BAA scope hasn't been negotiated at week 1 (PII / KYC data scope is decided early or the build doesn't start); the model-risk team isn't in the engagement (an agent shipped without SR 11-7 sign-off is indefensible). We turn down engagements that fail any of these.

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this AI fraud detection build, or that a similar build on your stack would draw from.

01 Service

AI Agent Development

The agent pillar — ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this AI fraud detection build.

02 Industry

Fintech AI

The fintech pillar — KYC ladders, AML/BSA posture, ECOA Reg B, model-risk inventory aligned to SR 11-7. The regulatory context AI fraud detection in banking lives inside.

03 Service

Claude Development

Sonnet 4.6 + Haiku 4.5 integration patterns the Anthropic case study above uses end-to-end. Forced JSON, response_format schema, BAA + AWS PrivateLink deployment.

04 Service

AI Governance

Policy-as-code, audit-log scaffolding, model-risk inventory templates. The plumbing that made this AI fraud prevention pilot pass the model-risk committee.

05 Case study

All AI Case Studies

Six AI case studies — AI fraud detection, AI triage, RAG, voice bot, AI legal assistant, mobile AI assistant. Same operator detail across every page.

06 Service

AI Consulting

fixed-fee AI fraud detection audit. We map the workflow, scope the eval, and tell you whether it's case-study-shaped.

07 Service

AI Development Services

How a fraud-agent build slots into a broader AI software development company engagement — ML risk model + LLM gray-band reviewer + monitoring.

08 Service

AI Automation Agency

Payment-fraud agent shipped on the auth boundary. The kill-point and human-gate pattern this pillar describes, in production.

Ready to ship

Want AI fraud detection like this
for your fraud-ops floor?

Book a fixed-fee AI fraud detection audit. We'll review the existing fraud workflow, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether AI-powered fraud detection is the right shape for the workload — and whether the regulatory posture is ready to support a build. About one audit in five ends with `the legal posture isn't ready yet, here's the 90-day prep plan.`

Read the fintech pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

Updated May 20, 2026 · By Navin Sharma

AI fraud detection that holds up in a regulator audit.

What this case study shows

A rules engine under load.

today

with the agent

AI fraud detection pipeline — seven stages, three outcome lanes.

Skip the LLM on the auto-clear band

Forced JSON with a severity enum

Two-eye gate on regulatory severity

Every component has a separately measurable contract.

Decision model

Retrieval

Reranker

Velocity model

Case-note generator

Guardrails

The fraud agent, auth to audit.

A 0.32-second window at the auth boundary.

AI fraud detection stack — named tools, named versions.

Production shape, under the hood.

What runs every week, and who owns it.

Override-review meeting

Trace retention

On-call rotation

Model-risk audit sample

Precision vs FPR, where you stand the line.

The timeline, including the week we halted.

Discovery + frozen eval set

Corpus + velocity score + retrieval

Agent skeleton + policy-as-code

Black Friday shadow — calibration drift caught

Cutover + SAR-track integration

How we know it works.

What the ops team reads when a case routes for review.

The four shapes we turn down before scoping a pilot.

Autonomous SAR filing on the scope sheet

No weekly override review

BAA + PrivateLink gaps in the deployment plan

Buyer wants a fraud-score API, not a disposition packet

What buyers ask first. Real answers, no hedging.

Where this case study points back to.

AI Agent Development

Fintech AI

Claude Development

AI Governance

All AI Case Studies

AI Consulting

AI Development Services

AI Automation Agency

Want AI fraud detection like this for your fraud-ops floor?

AI fraud detection that holds up
in a regulator audit.

A rules engine
under load.

AI fraud detection pipeline — seven stages,
three outcome lanes.

Every component has a
separately measurable contract.

The fraud agent,
auth to audit.

A 0.32-second window
at the auth boundary.

AI fraud detection stack — named tools,
named versions.

Production shape,
under the hood.

What runs every week,
and who owns it.

Precision vs FPR,
where you stand the line.

The timeline,
including the week we halted.

How we know
it works.

What the ops team reads
when a case routes for review.

The four shapes we turn down
before scoping a pilot.

What buyers ask first.
Real answers, no hedging.

Where this case study
points back to.

Want AI fraud detection like this
for your fraud-ops floor?