The thing that scares us is not the missed fraud — there are well-trodden processes for that. What scares us is a confident-sounding agent producing a case note that we can't defend in a regulator audit because the chain of evidence isn't reproducible . Show us how every disposition reconstructs from the trace, or we're not signing.
AI fraud detection that holds up
in a regulator audit.
A US mid-market bank's fraud-ops team needed AI-powered fraud detection that could clear the auto-pass band silently, produce a regulator-defensible case note on every queue entry, and escalate regulatory-severity cases to a senior analyst with a two-eye signature on the dispatch. We built the AI fraud detection agent on Claude Sonnet 4.6 and Haiku 4.5, with XGBoost on the velocity short-circuit, hybrid retrieval over four years of KYC and case-note corpus, and a policy-as-code layer that gates every write tool. Eleven weeks, BAA-scoped over AWS PrivateLink, with a kill point at week 7 that we used. It runs as the bank's AI transaction monitoring layer, with the legacy rules engine on active-standby for the first 60 days.
What this case study shows
A US mid-market bank shipped a Claude Sonnet 4.6 fraud-disposition agent across card, wire, ACH, and RTP rails with regulator-defensible audit trail. Across n=412 frozen eval items plus 1,840 production decisions (plus or minus 0.012 CI), the agent holds precision at or above 0.96 at a 1 percent false-positive rate. Stack: Claude Sonnet 4.6 plus Haiku 4.5 router, pgvector 0.7, XGBoost 2.0 velocity scorer, LangGraph 0.2, AWS PrivateLink, Langfuse. Compliance: SR 11-7, FFIEC IT Examination Handbook, OCC Bulletin 2011-12, BSA/AML. Multi-quarter ongoing engagement.
A rules engine
under load.
A US mid-market bank's 50-seat fraud-ops floor running a hybrid rules + ML overlay last tuned in 2023. Too small to keep a fully-custom model fresh, too large to outsource the SAR-track decision to a vendor library. The buying decision was for AI fraud detection that could defend every cleared case in a regulator audit, not a higher-accuracy score on its own.
today
with the agent
The legacy rules engine cleared roughly 92% of auth-boundary traffic silently and flagged the rest into an analyst queue with an 18% false-positive rate. Fully-loaded analyst cost per flagged case (review-prep + second-look + audit-packet write-up) averaged $14, median 8 minutes. Generic AI fraud detection vendors had pitched higher-accuracy scores with opaque rationale and no in-VPC option; all of them were turned down on the same operator objections: no autonomous regulatory dispositions, no PAN egress, no explainability without chunk-cited evidence, no metric not measurable on a senior-analyst-labelled eval set.
Pre-build cut from the bank's own dashboard, the slice this engagement displaced.
AI fraud detection pipeline — seven stages,
three outcome lanes.
An XGBoost velocity short-circuit on the auto-clear band, hybrid retrieval over four years of KYC and case-note corpus, a forced-JSON Claude Sonnet 4.6 disposition, and a policy-as-code gate before any tool dispatch. Diagram below.
Skip the LLM on the auto-clear band
- we rejected
- Run Claude on every transaction
- because
- 92% of auth-boundary traffic is below the velocity-score threshold. Burning Sonnet tokens on cases the rules engine already cleared is the indefensible line in the cost math; the XGBoost short-circuit is what makes the unit economics work.
Forced JSON with a severity enum
- we rejected
- Free-text disposition + downstream parser
- because
- The regulator audit needs the disposition packet to be reproducible from the trace. A schema-bounded severity enum (low | med | high | regulatory) is what makes the SAR-track decision deterministic; the model can't smuggle a fifth severity into the output.
Two-eye gate on regulatory severity
- we rejected
- Auto-route regulatory severity to SAR queue
- because
- Anything that touches the FinCEN clock starts on a human signature, not a model output. The senior-analyst approval row is checked into the policy file — the runtime refuses to dispatch the escalation tool without it. We accepted a slower escalate path for an audit-defensible one.
Every component has a
separately measurable contract.
When something regresses, the per-component metric tells us which stage to look at. No single end-to-end number that hides which subsystem broke.
Decision model
Labelled severity-correctness + groundedness on the frozen 412-item eval. Forced-JSON schema enforces evidence-chunk citation on every claim — the model cannot reason its way past the validator.
Retrieval
Top-k recall on the frozen eval. RRF + reranker tuned against this number, not end-to-end accuracy.
Reranker
Top-1 precision on the held-out slice. Margin gate catches over-confidence before Sonnet ever sees it.
Velocity model
ROC-AUC + ECE on the auto-clear band.
Case-note generator
Regulator-audit acceptance · 10% weekly sample.
Guardrails
Policy-rejection vs analyst-override rate.
The fraud agent,
auth to audit.
Every transaction enters at the top. The XGBoost velocity score skips the LLM on the auto-clear band; anything above the threshold runs hybrid retrieval over four years of KYC and case-note corpus, reranks the evidence, and lets Claude Sonnet 4.6 produce a forced-JSON disposition. Hover any stage for its tool surface and latency budget.
latency budgets are p50/p95 from a 30-day production window · end-to-end p95 inside the 2.6s decision SLA
A 0.32-second window
at the auth boundary.
Eight rows from a synthetic replay tape — the same shape the production stream sees at ~38k transactions/sec peak. The agent fans out into three lanes per row: silent clear, queued case-note, or senior-analyst escalation. No real PAN, no real merchant; this is a replay viewer, not a live feed.
replay clock advances 41 ms per row · 7 of 8 rows shown are auto-allow band (vscore < 0.18) in production; replay over-samples flagged rows for legibility
AI fraud detection stack — named tools,
named versions.
Everything in the build is a thing the model-risk committee can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, policies, and feature definitions are all checked into the bank's repo — not ours.
Production shape,
under the hood.
Numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 412-item set the CI gates on.
Per-stage P50 / P95 (ms)
| stage | p50 | p95 | tooling |
|---|---|---|---|
| Kafka consumer + parse | 8 | 18 | Confluent · ISO 8583 superset · per-tenant partition key |
| XGBoost velocity score | 4 | 9 | XGBoost 2.0 · 142 features · auto-clear band short-circuit |
| Hybrid retrieval | 38 | 92 | pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60 |
| Cross-encoder rerank | 62 | 138 | BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12 |
| Claude Sonnet 4.6 decision | 1,420 | 2,080 | Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens |
| Claude Haiku 4.5 case-note | 780 | 1,180 | narrates from Sonnet's evidence ids · ~1,100 in / ~340 out |
| Policy + 2-eye + audit log | 11 | 22 | TypeScript runtime · Zod schema · WORM-equivalent audit row |
| Total (LLM-routed path) | 2,323 | 3,537 | agent boundary · ~8% of traffic; auto-clear path < 50ms total |
- stage Kafka consumer + parsep50 8p95 18tooling Confluent · ISO 8583 superset · per-tenant partition key
- stage XGBoost velocity scorep50 4p95 9tooling XGBoost 2.0 · 142 features · auto-clear band short-circuit
- stage Hybrid retrievalp50 38p95 92tooling pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
- stage Cross-encoder rerankp50 62p95 138tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
- stage Claude Sonnet 4.6 decisionp50 1,420p95 2,080tooling Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
- stage Claude Haiku 4.5 case-notep50 780p95 1,180tooling narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
- stage Policy + 2-eye + audit logp50 11p95 22tooling TypeScript runtime · Zod schema · WORM-equivalent audit row
- stage Total (LLM-routed path)p50 2,323p95 3,537tooling agent boundary · ~8% of traffic; auto-clear path < 50ms total
p50/p95 from a 30-day rolling window over n ≈ 28,400 LLM-routed decisions / mo (~92% of traffic short-circuits before the LLM call). SLO is p95 ≤ 3,500 ms on the LLM-routed path. Current burn ≈ 101% — we're in active tuning on the reranker timeout to bring the tail in; the SpecGrid above doesn't lie about a number we haven't shipped yet.
// triage/tools/escalate_case.policy.ts
//
// Every write tool the agent can reach for has a policy file.
// The runtime imports these at startup and refuses to dispatch
// any tool call that doesn't pass. Regulatory severity needs a
// senior-analyst signature before the SAR-track is touched.
import { Policy } from "@gw/agent-runtime";
export const escalate_case: Policy = {
description: "Send a flagged transaction to the human-review queue.",
inputs: {
case_id: "uuid, exists in cases table, has not been escalated",
severity: "enum: low | med | high | regulatory",
evidence: "array, min 1, items have {claim, evidence_id, source}",
confidence: "number in [0,1]; required, no default",
reasoning: "string, 80–600 chars, grounded in evidence",
},
preconditions: [
"agent.confidence_calibrated === true",
"transaction.amount > 0",
"no_pending_escalation_for(case_id)",
"every(evidence, e => retrieval.contains(e.evidence_id))",
],
rate_limits: { perAgent: "30/min", perCase: "1" },
audit: {
redact: ["pan", "ssn", "iban", "routing"],
retain: "7y",
store: "s3:bsa-audit-log/worm", // WORM-equivalent object lock
log_shape: ["case_id", "severity", "evidence", "model_version",
"retrieval_chunks", "policy_verdict", "approver"],
},
// Two-eye rule. Regulatory severity needs a senior-analyst sign-off
// before the runtime dispatches the SAR-track integration.
approval: {
required: ({ severity }) => severity === "regulatory",
approver: "role:senior-analyst",
deadline_mins: 30, // ages back to the queue with an "aged out" tag
},
};
Per-decision and monthly cost math
| line item | $ / decision | $ / month (≈ 28k LLM-routed decisions) | note |
|---|---|---|---|
| Claude Sonnet 4.6 — input | $0.0096 | $269 | 3,200 tokens × $3.00 / 1M |
| Claude Sonnet 4.6 — output | $0.0063 | $176 | 420 tokens × $15.00 / 1M |
| Claude Haiku 4.5 — case-note | $0.0008 | $22 | 1,100 in + 340 out at Haiku pricing |
| voyage-3-large embeddings | $0.0006 | $17 | ≈ 5,000 tokens × $0.12 / 1M |
| pgvector + RDS db.r6i.xlarge | — | $612 | BAA-scoped Postgres · pgvector + tsvector |
| g5.xlarge reranker (24/7) | — | $378 | BAAI bge-reranker-large self-host |
| AWS PrivateLink + endpoints | — | $96 | Anthropic in-VPC inference |
| Langfuse self-hosted (t3.large) | — | $104 | trace store · 90d hot / 7yr cold |
| All-in monthly | ≈ $0.061 | ≈ $1,674 | vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep |
- line item Claude Sonnet 4.6 — input$ / decision $0.0096$ / month (≈ 28k LLM-routed decisions) $269note 3,200 tokens × $3.00 / 1M
- line item Claude Sonnet 4.6 — output$ / decision $0.0063$ / month (≈ 28k LLM-routed decisions) $176note 420 tokens × $15.00 / 1M
- line item Claude Haiku 4.5 — case-note$ / decision $0.0008$ / month (≈ 28k LLM-routed decisions) $22note 1,100 in + 340 out at Haiku pricing
- line item voyage-3-large embeddings$ / decision $0.0006$ / month (≈ 28k LLM-routed decisions) $17note ≈ 5,000 tokens × $0.12 / 1M
- line item pgvector + RDS db.r6i.xlarge$ / decision —$ / month (≈ 28k LLM-routed decisions) $612note BAA-scoped Postgres · pgvector + tsvector
- line item g5.xlarge reranker (24/7)$ / decision —$ / month (≈ 28k LLM-routed decisions) $378note BAAI bge-reranker-large self-host
- line item AWS PrivateLink + endpoints$ / decision —$ / month (≈ 28k LLM-routed decisions) $96note Anthropic in-VPC inference
- line item Langfuse self-hosted (t3.large)$ / decision —$ / month (≈ 28k LLM-routed decisions) $104note trace store · 90d hot / 7yr cold
- line item All-in monthly$ / decision ≈ $0.061$ / month (≈ 28k LLM-routed decisions) ≈ $1,674note vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep
Token costs use Anthropic's public Sonnet 4.6 + Haiku 4.5 pricing as of May 2026 — $3 / 1M input, $15 / 1M output on Sonnet; $0.80 / 1M input, $4 / 1M output on Haiku. Infra costs are AWS US-east-2 list price; the bank paid less under an EDP. The legacy comparison line is the bank's own per-case review-prep cost × the routed-cases volume — the agent doesn't replace analyst time at the decision boundary, it compresses the review-prep half of the case workload.
What's in the frozen 412-item set
| category | items | what it checks | ci-gate threshold |
|---|---|---|---|
| Severity-decision golds | 100 | labelled disposition + severity band on real (de-id) past cases | ≥ 0.95 precision @ 1% FPR |
| Evidence groundedness | 120 | every rationale claim points to a retrieved chunk id that supports it | ≥ 0.93 groundedness |
| Retrieval recall | 80 | correct case + policy chunks in top-5 after RRF + rerank | ≥ 0.90 recall@5 |
| Refusal / adversarial | 60 | structured-pattern hits, jailbreak attempts, OOD merchant categories | 100% refusal on must-refuse |
| Calibration golds | 52 | confidence-vs-correctness on held-out cases · ECE check | ECE ≤ 0.04 |
- category Severity-decision goldsitems 100what it checks labelled disposition + severity band on real (de-id) past casesci-gate threshold ≥ 0.95 precision @ 1% FPR
- category Evidence groundednessitems 120what it checks every rationale claim points to a retrieved chunk id that supports itci-gate threshold ≥ 0.93 groundedness
- category Retrieval recallitems 80what it checks correct case + policy chunks in top-5 after RRF + rerankci-gate threshold ≥ 0.90 recall@5
- category Refusal / adversarialitems 60what it checks structured-pattern hits, jailbreak attempts, OOD merchant categoriesci-gate threshold 100% refusal on must-refuse
- category Calibration goldsitems 52what it checks confidence-vs-correctness on held-out cases · ECE checkci-gate threshold ECE ≤ 0.04
Eval set is frozen — items only added, never edited. The senior analyst lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Black Friday holiday-window slice (added at week 7) became a 5th fold and is now permanent.
What runs every week,
and who owns it.
Production ops is part of the build, not an afterthought. We run four controls that keep our calibration honest after cutover, and our on-call rotation owns each one.
Override-review meeting
Every agent-vs-analyst diff opened. Systematic drift (>3 same pattern/wk) becomes a JIRA ticket against the eval set and a candidate fine-tune slice.
Trace retention
Langfuse self-hosted in the customer VPC. Matches BSA's 7-year record-retention requirement and the bank's internal SAR documentation policy.
On-call rotation
Two engineers per week. 99.5% pipeline-availability SLO + p95 ≤ 3.5s decision SLO on the LLM-routed path.
Model-risk audit sample
50 rows monthly: velocity score, retrieval candidates, reranker scores, raw model output, parsed JSON, policy verdict, senior-analyst approval.
Precision vs FPR,
where you stand the line.
Where the team stands the threshold is a policy choice, not a model property. Drag the marker to see how precision, recall, and per-month false-positive volume move together. We anchor production at 1% FPR — the ops team's documented ceiling for analyst review-prep load.
curve fitted from the frozen 412-item eval set · production op-point at 1% FPR · move the thumb with mouse, touch, or arrow keys
The timeline,
including the week we halted.
Five stages, milestone-billed. The week-7 Black Friday shadow surfaced a calibration drift on a holiday-shopping velocity pattern the eval set hadn't seen. We halted cutover, ingested the holiday-window slice as a new eval fold, re-fit the calibration head, and only then promoted the AI fraud detection agent to primary screen. The honest version of `11 weeks` includes the week we ran the sweep.
- Weeks 1–2
Discovery + frozen eval set
Two weeks shadowing the fraud-ops team. The senior analyst lead labelled 412 frozen eval items drawn from 18 months of (de-identified) past cases — each carrying a labelled correct disposition + the rationale + the evidence chunks that should ground the call. We wrote the harness; the ops team wrote the answers. Scoping decision: the deliverable is a structured-output agent, not a chatbot, and the eval gate is non-negotiable.
412-item frozen eval + severity-band rubric · scope memo signed - Weeks 3–4
Corpus + velocity score + retrieval
Ingested four years of KYC artifacts and historical case notes into pgvector 0.7 inside the customer VPC. BM25 sidecar over the same chunks. XGBoost 2.0 velocity model trained against the labelled fraud history with 142 features; calibrated with isotonic regression on a held-out slice. Reciprocal-rank fusion tuned on the eval slice; cross-encoder rerank wired in when top-1 recall plateaued.
Hybrid retrieval at 0.93 top-5 recall · velocity ECE 0.041 - Weeks 5–6
Agent skeleton + policy-as-code
LangGraph 0.2.x agent with three read-only tools (case lookup, KYC pull, structured-pattern check) and two write tools (case-note write, escalation dispatch). Every tool carries a policy file in `triage/tools/`. Forced-JSON disposition via Anthropic's `response_format`. Two-eye gate baked into the runtime; senior-analyst approval is what unblocks the regulatory-severity branch.
End-to-end pipeline behind a feature flag · BAA + PrivateLink wired - Week 7
Black Friday shadow — calibration drift caught
Three weeks of silent shadow against the live rules engine. Day 9 was Black Friday, and a holiday-shopping pattern surfaced that nobody had labelled in the eval set — a structurally novel velocity pattern from gift-card top-ups that the model was over-confidently clearing as legit. We halted cutover, ingested the holiday-window slice as a fresh-data fold, re-fit the calibration head, and re-ran the full eval. The honest version of `shipped on time` includes the week we sat on our hands and ran the calibration sweep.
ECE recalibrated from 0.067 → 0.028 on the Black Friday-augmented eval sliceWalk-away point - Weeks 8–11
Cutover + SAR-track integration
Promoted to primary screen with the rules engine in active-standby. Compliance reviewed the audit-log packet end-to-end; FinCEN SAR-track integration tested against the bank's e-filing path. Four ops-team training sessions on the case-note acceptance flow + the two-eye approval surface. PagerDuty wired to the regulatory-severity lane. Old rules engine stays in active-standby for 60 days post-cutover; every diff between agent + rules logged for the model-risk lead's weekly review.
Production cutover · SAR-track audited · model-risk committee sign-off
How we know
it works.
The AI fraud detection eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 412. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.
Sample size for the ≥ 0.96 precision figure is n=412 frozen eval items + the production confirmation slice n ≈ 1,840 cases reviewed by the senior analyst over the first 30 days post-cutover. Confidence interval is ±0.012 on the precision at the 1% FPR op-point. ECE is expected calibration error on the labelled set. P95 is end-to-end on the LLM-routed path (auto-clear path is under 50ms total). Refusal rate is the share of inputs where the agent legally cannot decide and routes straight to a senior analyst — by design, not by failure. Note: refusal rate v1 → v2 → current is not monotone-improving by design; we tuned the refusal-threshold up in v2 after the senior analyst lead flagged that v1 was clearing borderline cases that should have routed for review.
What the ops team reads
when a case routes for review.
The 8.0 → 0.8 minute delta isn't a tooling story. It's an artefact story. On the left: what an analyst produced manually — narrative-heavy, hard to skim, citations buried in the prose. On the right: the agent's structured packet — same evidence, surfaced as fields the regulator-audit reviewer reads in seconds.
Case # FA-2026-04-11-0917
Reviewer: J. Reyes · 11 Apr 2026 14:22 UTC
Customer (PAN ending 8801) initiated a wire transfer of $9,250.00 USD to a beneficiary account first seen on the platform on 09 Apr 2026. Reviewing prior 90-day activity for this customer shows wire activity concentrated to two prior beneficiaries (relatives, KYC-verified, historical pattern stable). The new beneficiary is an LLC registered in a jurisdiction with elevated SAR-correlation per our internal scoring (referenced in policy doc PL-WIRE-2024 §4.2).
Velocity score on this transaction was 0.92 (model output, see ML-FRAUD-3.4 dashboard). Cross-checking against our case-history corpus, three structurally similar cases have been reviewed in the past 18 months; two were confirmed-fraud, one cleared after additional context. The originating IP geo (Newark, NJ) is consistent with the cardholder's historical pattern.
Recommendation: escalate for regulatory review. Senior analyst sign-off required per the two-eye policy on structured-pattern hits. Note: cardholder not yet contacted — pending compliance lead approval for outbound.
{
"case_id": "FA-2026-04-11-0917",
"severity": "regulatory",
"decision": "escalate",
"confidence": 0.91,
"evidence": [
{
"claim": "Beneficiary first-seen 09 Apr 2026; not in cardholder's 90d graph.",
"evidence_id": "chunk_a4f0c12b9e44",
"source": "ledger.beneficiary_first_seen"
},
{
"claim": "LLC jurisdiction matches PL-WIRE-2024 §4.2 elevated-SAR list.",
"evidence_id": "chunk_71d33e0a4c8b",
"source": "policy.PL-WIRE-2024"
},
{
"claim": "Velocity score 0.92; 3 structurally similar cases in the corpus.",
"evidence_id": "chunk_e8b290745f01",
"source": "ml.velocity + case-history"
}
],
"two_eye_required": true,
"approver_role": "role:senior-analyst",
"sar_track": true,
"audit_retain_yrs": 7
} both artefacts are synthetic · case-id, beneficiary, and PAN-last-4 are illustrative · the agent packet is what the regulator-audit reviewer reads, not the prose
The four shapes we turn down
before scoping a pilot.
An AI fraud detection agent built on these patterns will produce regulator-audit-defensible failures in any of the following situations. We turn down the engagement before a pilot is scoped.
Autonomous SAR filing on the scope sheet
The 30-day FinCEN clock starts on human detection. Any pitch that includes "the agent files the SAR" is the pitch we walk away from. Drafting the packet is fine; filing it is a human signature obligation. We've turned this down twice.
No weekly override review
If the senior analyst lead is not going to review agent-vs-analyst diffs weekly for the first six months, the calibration head drifts and nobody catches it. The eval set is necessary, not sufficient. Week-7 drift was caught by the shadow against a live analyst floor, not by the eval.
BAA + PrivateLink gaps in the deployment plan
No BAA from the model vendor, no in-VPC inference, no WORM-equivalent audit retention. The regulatory posture is not a post-launch fix. The bank's compliance team had the legal posture committed at week 1, or the pilot did not get signed.
Buyer wants a fraud-score API, not a disposition packet
If procurement is "we want a fraud-score API that returns 0 to 1 per transaction," the buyer is shopping for a model, not an agent. Our shape is structured disposition + evidence + audit packet. The score-API shape gets a better outcome from a feature store + an off-the-shelf score vendor.
What buyers ask first.
Real answers, no hedging.
What is AI fraud detection?
How is AI fraud detection different from a rules engine?
Will an AI fraud detection agent hold up in a regulator audit?
How does AI fraud detection align with FFIEC / BSA / SR 11-7 guidance?
How accurate is the AI fraud detection agent?
What does AI fraud detection cost to run?
How long does it take to build?
When should we NOT ship an AI fraud detection agent?
Where this case study
points back to.
Each link below covers a pillar that fed into this AI fraud detection build, or that a similar build on your stack would draw from.
AI Agent Development
The agent pillar — ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this AI fraud detection build.
Fintech AI
The fintech pillar — KYC ladders, AML/BSA posture, ECOA Reg B, model-risk inventory aligned to SR 11-7. The regulatory context AI fraud detection in banking lives inside.
Claude Development
Sonnet 4.6 + Haiku 4.5 integration patterns the Anthropic case study above uses end-to-end. Forced JSON, response_format schema, BAA + AWS PrivateLink deployment.
AI Governance
Policy-as-code, audit-log scaffolding, model-risk inventory templates. The plumbing that made this AI fraud prevention pilot pass the model-risk committee.
All AI Case Studies
Six AI case studies — AI fraud detection, AI triage, RAG, voice bot, AI legal assistant, mobile AI assistant. Same operator detail across every page.
AI Consulting
fixed-fee AI fraud detection audit. We map the workflow, scope the eval, and tell you whether it's case-study-shaped.
AI Development Services
How a fraud-agent build slots into a broader AI software development company engagement — ML risk model + LLM gray-band reviewer + monitoring.
AI Automation Agency
Payment-fraud agent shipped on the auth boundary. The kill-point and human-gate pattern this pillar describes, in production.
Want AI fraud detection like this
for your fraud-ops floor?
Book a fixed-fee AI fraud detection audit. We'll review the existing fraud workflow, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether AI-powered fraud detection is the right shape for the workload — and whether the regulatory posture is ready to support a build. About one audit in five ends with `the legal posture isn't ready yet, here's the 90-day prep plan.`