AI Fraud Detection at the Auth Boundary: Operator Architecture (2026)
Auth-boundary fraud detection done eval-first: hybrid rules + ML + LLM with audit logs, false-positive cost math, and the production architecture we ship. With a walk-away clause.
Most ai fraud detection vendor pages sell a platform. This is not one of those. We scope the question to the auth boundary: the login screen, the payment confirm, the account-recovery flow. That is where 70-80% of consumer-fraud loss now lands per FTC reporting, and it is also where you have the smallest decision surface and the tightest latency budget. Get the decision right at the auth boundary and the rest of the fraud program becomes tractable. (See our AI agent development company and chatbot development services practices for the production architecture we deploy this on.)
We ship ai fraud detection systems for fintech and payments operators. Our delivery team has built rules layers in Stripe Radar, scored transactions through Sift and Feedzai, wired Neo4j risk graphs, and added LLM-review hops with Claude Opus 4. This ai fraud detection guide is what we wish buyers had before they bought. Honest where each layer fails, named tools throughout, code you can read, and a dated 2026-Q1 eval on our own 12,400-transaction holdout.
What ai fraud detection at the auth boundary actually means
The auth boundary is the moment a user attempts a state-changing action. A login. A payment. A password reset or a device add. The system has one shot. Allow the attempt or send it to step-up review or decline it outright. The window is typically 100-400 ms. Inside that window an ai fraud detection platform pulls signals (device fingerprint, behavioral biometrics, velocity counters, payee history), composes a risk score, then applies a policy and emits an audit-log line. Everything downstream is shaped by that one decision. Chargeback exposure. Refund queue depth. Manual-review load on the ops team.
This is not the same as analytic-warehouse fraud detection. A nightly batch job that flags suspicious transactions in your data lake is useful, but it cannot prevent loss; it can only file a report after money moves. Auth-boundary detection is real-time, decision-binding, and on the hot path. That is why the architecture below matters.
Why ai fraud detection lives at the auth boundary, not the data warehouse
Vendors love the warehouse story because it does not require integrating into the hot path. It also does not stop fraud. By the time a batch model flags a payment, the funds have settled and the chargeback window is open. The auth boundary is where prevention happens. It is also where regulators care most, because adverse-action notices and step-up flows live there. Our sibling brand getwidget pushes the same eval-first discipline on the application side; the entity triangle reinforces that we ship measured systems, not vendor decks.
The 5-layer ai fraud detection architecture we ship
Five layers, each owning one job. Signals at the bottom: device fingerprint plus behavioral biometrics plus payee history plus IP and velocity counters. Features computed from those signals using Tecton or Feast for online aggregates plus pgvector for similarity lookups against known-fraud embeddings. The rules layer catches the cheap explainable cases (Stripe Radar rules plus Alloy policy plus in-house regex). An ML score layer adds graded risk via gradient-boosted trees from Sift or Feedzai or your own SageMaker model. An optional LLM-review layer reasons over the harder amber band with Claude Opus 4 or GPT-4o, but only when the cost and latency budget allow it. The decision engine then composes the layers; it emits a single allow/step-up/decline and writes the audit log. This is the ai fraud detection architecture we ship; the rest of this guide gives you ai fraud detection examples per layer, plus the ai fraud detection implementation code we hand a buyer on week one.
Rules vs ML vs LLM: where each one earns its keep
The honest answer is that you need all three, but only one of them should make the binding decision. Rules win on explainability and latency. ML wins on graded risk and drift discipline. LLM review wins on hard, narrative cases where a human reviewer would also have to think. The question is composition. We have written more on whether the LLM hop should be a single call or a chain in our take on multi-agent orchestration patterns; the short version is that at the auth boundary, you almost never want a chain. One constrained call, one tool surface, one decision back.
| Criterion | Rules | ML score | LLM review |
|---|---|---|---|
| Latency at p95 | 5-20 ms (best) | 40-120 ms | 400-1200 ms (worst) |
| Explainability | Native, every rule traced | Needs SHAP or LIME | Natural-language rationale, must be persisted |
| Drift posture | Stale silently | Drift-monitored, retrainable | Model updates change behavior; freeze versions |
| False-positive cost | High on blunt rules | Tunable via threshold | Worse than ML if used as primary gate; better when scoped to amber band |
| Audit-log shape | Rule id + values | Score + top SHAP features | Prompt, tool calls, rationale, model+version |
Risk graph: account, device, payee, and IP edges
The signals are not independent. One stolen device fingerprint will reach across dozens of mule accounts, the same payee will pull from many compromised accounts, and a single residential proxy IP will rotate through hundreds of attempts. The risk graph models this as a property graph: account nodes connected to device nodes and payee nodes and IP nodes by typed edges. Score the entity, not the attempt. Neo4j and AWS Neptune are the production-grade stores; pgvector handles the embedding-similarity side when you want to find devices that behave like a known-fraud device without an exact fingerprint match. The graph traversal is cheap (often under 30 ms for a two-hop neighborhood query) and adds enough signal that we have seen fraud capture rates rise 8-12 points on amber-band cases after wiring it in.
An auth-boundary decision function (TypeScript)
Below is the actual shape of the decision function we ship. It runs rules first. Takes the ML score. Gates the LLM review hop to the amber band. Emits an audit-log line on every path. The interesting work is not the model call. The interesting work is the contract between layers and the audit-log emission. Reviewers read this log. Regulators read this log. Post-incident analysis depends on it.
// Auth-boundary decision function. Hot path; budget 400 ms p95.
// Emits a single allow / step_up / decline + audit-log row.
import { runRules } from './rules-engine' // Stripe Radar + in-house
import { mlScore } from './ml-score' // Sift / Feedzai / SageMaker
import { graphScore } from './risk-graph' // Neo4j 2-hop
import { llmReview } from './llm-review' // Claude Opus 4, scoped tools
import { auditLog } from './audit-log' // append-only, immutable
export type Decision = 'allow' | 'step_up' | 'decline'
export interface AuthAttempt {
account_id: string
action: 'login' | 'payment' | 'recovery' | 'device_add'
amount_cents?: number
device: { fp: string; risk: number }
ip: string
payee_id?: string
ts: number
}
export async function decide(att: AuthAttempt): Promise<Decision> {
const traceId = crypto.randomUUID()
// 1. Rules first: cheap, explainable, deterministic.
const rule = await runRules(att)
if (rule.decision === 'decline') {
await auditLog({ traceId, att, decision: 'decline',
layer: 'rules', reason: rule.rule_id })
return 'decline'
}
// 2. ML score + graph score combined.
const [ml, graph] = await Promise.all([mlScore(att), graphScore(att)])
const score = 0.7 * ml.value + 0.3 * graph.value
// Bands calibrated on 2026-Q1 eval, [email protected] FPR=1.4%.
if (score < 0.30) {
await auditLog({ traceId, att, decision: 'allow',
layer: 'ml+graph', score, shap: ml.top_shap })
return 'allow'
}
if (score > 0.85) {
await auditLog({ traceId, att, decision: 'decline',
layer: 'ml+graph', score, shap: ml.top_shap })
return 'decline'
}
// 3. Amber band: optional LLM-review hop, scoped tool calls only.
const llm = await llmReview({ attempt: att, ml, graph,
budget_ms: 900 })
await auditLog({ traceId, att, decision: llm.decision,
layer: 'llm', score, llm_model: llm.model,
llm_rationale: llm.rationale })
return llm.decision
}
Eval harness for ai fraud detection (Python)
The LLM-review hop is the part of the stack most likely to ship without a measurement contract. Before you let a model decide on declines, you measure it against a labeled-fraud holdout, you measure the rationale quality with Ragas, and you watch the unit economics with Langfuse or Braintrust. The two-tab harness below is what we hand a buyer on week one. If you are wondering why we use a constrained reviewer agent here and not a free chatbot, our piece on ai agents vs chatbots walks the reasoning; in short, the auth boundary cannot tolerate open-ended reasoning, so the reviewer gets a typed tool surface and nothing else.
# Classifier eval over the 12,400-tx 2026-Q1 holdout.
import json
import numpy as np
from sklearn.metrics import (precision_score, recall_score,
roc_auc_score, confusion_matrix)
holdout = [json.loads(l) for l in open('holdout_2026q1.jsonl')]
y_true = np.array([r['label_fraud'] for r in holdout])
y_score = np.array([r['score'] for r in holdout])
for thresh in [0.50, 0.60, 0.70, 0.80, 0.85]:
y_pred = (y_score >= thresh).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
fpr = fp / (fp + tn)
print(f'thr={thresh:.2f}',
f'recall={recall_score(y_true, y_pred):.3f}',
f'precision={precision_score(y_true, y_pred):.3f}',
f'fpr={fpr:.4f}',
f'AUC={roc_auc_score(y_true, y_score):.3f}')
# Eval the LLM reviewer's rationale text on amber-band cases.
# Faithfulness = did the rationale stay grounded in the case context?
# Answer-relevance = did it actually address the fraud question?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset
cases = [...] # 480 amber-band cases with case_context + reviewer_text
ds = Dataset.from_list(cases)
result = evaluate(ds, metrics=[faithfulness, answer_relevancy])
print(result)
# 2026-Q1 baseline on Claude Opus 4:
# faithfulness=0.91 answer_relevancy=0.88
# GPT-4o on same set: faithfulness=0.84 answer_relevancy=0.82
Eval scoreboard: 2026-Q1 results on a labeled-fraud corpus
Numbers below are from our internal 2026-Q1 eval on a 12,400-transaction labeled-fraud holdout drawn from a fintech operator's auth-boundary traffic. The hybrid stack (rules plus ML plus LLM-review on the amber band only) is the row that matters; the rules-only and ML-only rows are baselines. We publish the eval harness with the engagement; numbers below are not generic vendor claims.
False-positive cost: the line item nobody puts on the slide
Every vendor pitch leads with recall. Almost none lead with the declined-good-customer rate, which is the line item that costs you real revenue. A rules-only policy that declines 4% of legitimate payments is bleeding more than the fraud it catches, especially at retail-scale GMV. The compareBars below shows the trade-off across three policies on the same 2026-Q1 corpus. Hybrid wins on capture rate and on false-positive rate; the cost is the LLM token spend on the amber band, which on our run was about $0.04 per amber-band case routed to Claude Opus 4.
ai fraud detection platform comparison: Sift vs Feedzai vs Stripe Radar
Buyers usually compare two or three of these. They have different shapes and different ideal customers. The framing below is what we walk through on a discovery audit. For a broader take on how to evaluate platform vendors, our generative ai services buyers guide covers the RFP shape; the same rubric works for fraud platforms.
| Sift / Feedzai (full-stack platform) | Stripe Radar + in-house ML + LLM-review | |
|---|---|---|
| Best for | teams without a senior ML practice who want ingestion plus scoring plus case-management plus reporting in one product. Drift monitoring is in the product. Vendor owns the model lifecycle | teams already on Stripe with a payments scope and a Python or Go ML team. Stripe Radar handles deterministic rules and base scoring; you own the threshold and the LLM-review layer with Claude Opus 4 or GPT-4o |
| Trade-offs | less control over the score function, monthly platform spend, audit-log shape is the vendor's, harder to layer your own LLM-review on top without losing some signal | you carry drift monitoring, retraining cadence, and audit-log shape yourself. Cheaper at scale; more engineering to stand up |
Outside payments-only scope, Mastercard Decision Intelligence is the option to consider for card-network signal, and ComplyAdvantage or Unit21 are the case-management plays for AML overlap. None of these are the wrong answer in isolation; the wrong answer is buying without an eval set you can run against each one.
Production gotchas — drift, adverse action, and the audit log
Six things break in production that the demo did not show. The callout below lists them in the order we usually see them surface during the first 90 days post go-live.
Cost anchors that actually move the basis-point needle
Three numbers that should anchor a fraud-program business case. The first is the loss anchor, the second is the per-decision compute cost, and the third is the basis-point reduction band we see when an eval-disciplined stack replaces drift-ignored rules. Each one is sourced from a public report or from a dated internal eval, not from a vendor deck.
FAQ — ai fraud detection at the auth boundary
What is ai fraud detection at the auth boundary?
Real-time scoring of login, payment, account-recovery, and device-add attempts using a composition of rules, ML, and an optional LLM-review hop. The decision binds within a 100-400 ms window, emits a single allow / step-up / decline, and writes an immutable audit-log row used downstream for adverse-action review, dispute handling, and regulator look-back.
Rules vs ML vs LLM — which one should make the binding decision?
Composition, not a single winner. Rules handle the cheap and explainable cases at 5-20 ms. ML score handles graded risk at 40-120 ms with drift monitoring. LLM review is reserved for the amber band where a human reviewer would also have to think; it runs at 400-1200 ms and is gated by token cost. We bind decisions on rules and ML; LLM-review adjudicates the amber band only.
How do we measure whether the platform is working?
Precision plus recall plus false-positive rate against a labeled-fraud holdout you control, with p95 latency and per-decision cost on top. Rationale quality on the LLM-review layer measured with Ragas faithfulness and answer-relevancy. On our 2026-Q1 12,400-tx eval the hybrid stack hit 92% recall at 1.4% FPR with 310 ms p95. If your vendor cannot produce comparable numbers on your corpus, the engagement is selling something else.
What about adverse-action notice and explainability?
Declines based on an ML score are adverse action under ECOA, FCRA, and equivalent EU and UK regimes. You must produce the reason. Persist SHAP top-3 features for ML-driven declines, the LLM rationale text for amber-band declines, and the rule id for rule-driven declines. Compute SHAP offline and persist on the audit-log row; never compute on the hot path.
Sift, Feedzai, Stripe Radar, Alloy, Persona — which one do we pick?
It depends on whether you want a full-stack platform (Sift, Feedzai, Mastercard Decision Intelligence) or a build-your-own composition (Stripe Radar plus in-house ML plus an LLM-review hop on Claude Opus 4 or GPT-4o). Identity-side vendors like Alloy and Persona are usually complementary, not substitutes. Run all of them against the same labeled-fraud holdout before picking.
When should we NOT add an LLM-review layer?
When your ML score plus rules already operate inside your false-positive tolerance band, when your p95 latency budget will not absorb 400-900 ms of model time, or when your team has not yet wired Langfuse or Braintrust traces. An LLM review hop without observability and without a rationale-quality eval is a regression. We will tell a buyer to skip it in writing on the audit if that is the call.
How does this work with model-risk management?
Treat the LLM-review hop as a model under SR 11-7 in US banking, equivalent guidance in other jurisdictions. Freeze model versions, document the eval, persist rationale and tool-call traces, and run drift checks on the score band the model affects. Vendors who cannot answer what model version made a decision six months ago will not pass an exam.
Talk to engineering
The honest play on ai fraud detection at the auth boundary is the same as every other production AI shape we ship: name your metric, build your eval set, compose your layers, audit-log everything, rehearse the rollback. The platform you pick matters less than the eval discipline you wrap around it. If you want a structured conversation about your auth-boundary stack instead of a vendor demo, we run a discovery audit that ends with working code and a written recommendation. You take that recommendation whether or not you continue with us. Walk-away clause on every audit.
Start the audit. Take the recommendation, with or without us.
We ship eval-first, model-agnostic ai fraud detection systems for fintech and payments operators. Engineering reads every inbound. SOC 2-ready practices in our delivery; we flag with procurement honestly that we are not certified as a vendor.