← Blog

AI Fraud Detection at the Auth Boundary: Operator Architecture (2026)

Auth-boundary fraud detection done eval-first: hybrid rules + ML + LLM with audit logs, false-positive cost math, and the production architecture we ship. With a walk-away clause.

Fraud-detection decision flow

Most ai fraud detection vendor pages sell a platform. This is not one of those. We scope the question to the auth boundary: the login screen, the payment confirm, the account-recovery flow. That is where 70-80% of consumer-fraud loss now lands per FTC reporting, and it is also where you have the smallest decision surface and the tightest latency budget. Get the decision right at the auth boundary and the rest of the fraud program becomes tractable. (See our AI agent development company and chatbot development services practices for the production architecture we deploy this on.)

We ship ai fraud detection systems for fintech and payments operators. Our delivery team has built rules layers in Stripe Radar, scored transactions through Sift and Feedzai, wired Neo4j risk graphs, and added LLM-review hops with Claude Opus 4. This ai fraud detection guide is what we wish buyers had before they bought. Honest where each layer fails, named tools throughout, code you can read, and a dated 2026-Q1 eval on our own 12,400-transaction holdout.

What ai fraud detection at the auth boundary actually means

The auth boundary is the moment a user attempts a state-changing action. A login. A payment. A password reset or a device add. The system has one shot. Allow the attempt or send it to step-up review or decline it outright. The window is typically 100-400 ms. Inside that window an ai fraud detection platform pulls signals (device fingerprint, behavioral biometrics, velocity counters, payee history), composes a risk score, then applies a policy and emits an audit-log line. Everything downstream is shaped by that one decision. Chargeback exposure. Refund queue depth. Manual-review load on the ops team.

This is not the same as analytic-warehouse fraud detection. A nightly batch job that flags suspicious transactions in your data lake is useful, but it cannot prevent loss; it can only file a report after money moves. Auth-boundary detection is real-time, decision-binding, and on the hot path. That is why the architecture below matters.

Auth-boundary decision path
Auth attempt
LOGIN / PAY / RESET
Signal collection
DEVICE + BEHAVIOR + GRAPH
Rules + ML score
100-200 MS BUDGET
LLM review hop
AMBER BAND ONLY
Allow / Step-up / Decline
AUDIT-LOG EMITTED

Why ai fraud detection lives at the auth boundary, not the data warehouse

Vendors love the warehouse story because it does not require integrating into the hot path. It also does not stop fraud. By the time a batch model flags a payment, the funds have settled and the chargeback window is open. The auth boundary is where prevention happens. It is also where regulators care most, because adverse-action notices and step-up flows live there. Our sibling brand getwidget pushes the same eval-first discipline on the application side; the entity triangle reinforces that we ship measured systems, not vendor decks.

The 5-layer ai fraud detection architecture we ship

Five layers, each owning one job. Signals at the bottom: device fingerprint plus behavioral biometrics plus payee history plus IP and velocity counters. Features computed from those signals using Tecton or Feast for online aggregates plus pgvector for similarity lookups against known-fraud embeddings. The rules layer catches the cheap explainable cases (Stripe Radar rules plus Alloy policy plus in-house regex). An ML score layer adds graded risk via gradient-boosted trees from Sift or Feedzai or your own SageMaker model. An optional LLM-review layer reasons over the harder amber band with Claude Opus 4 or GPT-4o, but only when the cost and latency budget allow it. The decision engine then composes the layers; it emits a single allow/step-up/decline and writes the audit log. This is the ai fraud detection architecture we ship; the rest of this guide gives you ai fraud detection examples per layer, plus the ai fraud detection implementation code we hand a buyer on week one.

5-LAYER AI FRAUD DETECTION STACK
LAYERJOBNAMED TOOLSLATENCY1. Signalsdevice, IP, behaviorpayee, velocityCollect raw evidence on everyauth attempt. PII-aware,consent-logged.Persona, Plaid, Sardinein-house device SDKCloudflare bot signals10-40 ms2. Featuresonline + offlineembedding lookupMaterialize aggregates;nearest-neighbor lookupvs known-fraud vectors.Tecton, Feastpgvector / RedisDynamoDB counters20-60 ms3. Rulescheap, explainabledeterministic catchesDecline known bad actors;enforce velocity caps;explain every decline.Stripe Radar rulesAlloy, Unit21, ComplyAdvantagein-house policy engine5-20 ms4. ML scoregraded risk bandred / amber / greenScore every attempt; calibratethresholds against eval set;drift-monitored weekly.Sift, FeedzaiMastercard Decision IntelligenceSageMaker XGBoost in-house40-120 ms5. LLM reviewamber band onlyexplainable rationaleReason over case contextwith constrained tool calls;emit decision + reason text.Claude Opus 4, GPT-4oLangGraph state, MCP toolsLangfuse traces400-1200 msDecision engine composes layers 1-5, emits allow / step-up / decline, writes audit-log row.SHAP values + LLM rationale persisted on every decline for adverse-action review.
Figure 1: Auth-boundary fraud stack with named tools per layer. The LLM-review hop is optional and gated to the amber band only.

Rules vs ML vs LLM: where each one earns its keep

The honest answer is that you need all three, but only one of them should make the binding decision. Rules win on explainability and latency. ML wins on graded risk and drift discipline. LLM review wins on hard, narrative cases where a human reviewer would also have to think. The question is composition. We have written more on whether the LLM hop should be a single call or a chain in our take on multi-agent orchestration patterns; the short version is that at the auth boundary, you almost never want a chain. One constrained call, one tool surface, one decision back.

Criterion RulesML scoreLLM review
Latency at p95 5-20 ms (best) 40-120 ms 400-1200 ms (worst)
Explainability Native, every rule traced Needs SHAP or LIME Natural-language rationale, must be persisted
Drift posture Stale silently Drift-monitored, retrainable Model updates change behavior; freeze versions
False-positive cost High on blunt rules Tunable via threshold Worse than ML if used as primary gate; better when scoped to amber band
Audit-log shape Rule id + values Score + top SHAP features Prompt, tool calls, rationale, model+version
Each layer fails honestly somewhere. Pick composition, not a single winner.

Risk graph: account, device, payee, and IP edges

The signals are not independent. One stolen device fingerprint will reach across dozens of mule accounts, the same payee will pull from many compromised accounts, and a single residential proxy IP will rotate through hundreds of attempts. The risk graph models this as a property graph: account nodes connected to device nodes and payee nodes and IP nodes by typed edges. Score the entity, not the attempt. Neo4j and AWS Neptune are the production-grade stores; pgvector handles the embedding-similarity side when you want to find devices that behave like a known-fraud device without an exact fingerprint match. The graph traversal is cheap (often under 30 ms for a two-hop neighborhood query) and adds enough signal that we have seen fraud capture rates rise 8-12 points on amber-band cases after wiring it in.

RISK GRAPH: 4-NODE ENTITY VIEW
RISK GRAPH: 2-HOP NEIGHBORHOOD AROUND THE AUTH ATTEMPTAccounta_42910Deviced_FP_88e2Payeep_77310IP81.4.x.xEmaile_burnerAcct bon d_FP_88e2Acct cAcct dpays p_77310Acct eUSES_DEVICEPAYS_TOFROM_IPRECOVERY_EMAILSolid edges: 1-hop direct relations on this auth attempt. Dashed gold: 2-hop neighbors. If >= 3 of the 2-hop neighbors carry a known-fraud flag, the entity score escalates to amber regardless of the per-attempt ML score.Stores: Neo4j (graph) + pgvector (device-similarity embedding). Two-hop query budget: 25-35 ms typical, 60 ms p95.
Figure 2: Account-device-payee-IP graph view. One stolen device touches many accounts; one mule payee pulls from many. Score the entity, not the transaction.

An auth-boundary decision function (TypeScript)

Below is the actual shape of the decision function we ship. It runs rules first. Takes the ML score. Gates the LLM review hop to the amber band. Emits an audit-log line on every path. The interesting work is not the model call. The interesting work is the contract between layers and the audit-log emission. Reviewers read this log. Regulators read this log. Post-incident analysis depends on it.

auth-boundary-decide.ts typescript
// Auth-boundary decision function. Hot path; budget 400 ms p95.
// Emits a single allow / step_up / decline + audit-log row.
import { runRules } from './rules-engine'        // Stripe Radar + in-house
import { mlScore }  from './ml-score'             // Sift / Feedzai / SageMaker
import { graphScore } from './risk-graph'         // Neo4j 2-hop
import { llmReview }  from './llm-review'         // Claude Opus 4, scoped tools
import { auditLog }   from './audit-log'          // append-only, immutable

export type Decision = 'allow' | 'step_up' | 'decline'

export interface AuthAttempt {
  account_id: string
  action: 'login' | 'payment' | 'recovery' | 'device_add'
  amount_cents?: number
  device: { fp: string; risk: number }
  ip: string
  payee_id?: string
  ts: number
}

export async function decide(att: AuthAttempt): Promise<Decision> {
  const traceId = crypto.randomUUID()

  // 1. Rules first: cheap, explainable, deterministic.
  const rule = await runRules(att)
  if (rule.decision === 'decline') {
    await auditLog({ traceId, att, decision: 'decline',
      layer: 'rules', reason: rule.rule_id })
    return 'decline'
  }

  // 2. ML score + graph score combined.
  const [ml, graph] = await Promise.all([mlScore(att), graphScore(att)])
  const score = 0.7 * ml.value + 0.3 * graph.value

  // Bands calibrated on 2026-Q1 eval, [email protected] FPR=1.4%.
  if (score < 0.30) {
    await auditLog({ traceId, att, decision: 'allow',
      layer: 'ml+graph', score, shap: ml.top_shap })
    return 'allow'
  }
  if (score > 0.85) {
    await auditLog({ traceId, att, decision: 'decline',
      layer: 'ml+graph', score, shap: ml.top_shap })
    return 'decline'
  }

  // 3. Amber band: optional LLM-review hop, scoped tool calls only.
  const llm = await llmReview({ attempt: att, ml, graph,
    budget_ms: 900 })
  await auditLog({ traceId, att, decision: llm.decision,
    layer: 'llm', score, llm_model: llm.model,
    llm_rationale: llm.rationale })
  return llm.decision
}

Eval harness for ai fraud detection (Python)

The LLM-review hop is the part of the stack most likely to ship without a measurement contract. Before you let a model decide on declines, you measure it against a labeled-fraud holdout, you measure the rationale quality with Ragas, and you watch the unit economics with Langfuse or Braintrust. The two-tab harness below is what we hand a buyer on week one. If you are wondering why we use a constrained reviewer agent here and not a free chatbot, our piece on ai agents vs chatbots walks the reasoning; in short, the auth boundary cannot tolerate open-ended reasoning, so the reviewer gets a typed tool surface and nothing else.

eval_classifier.py python
# Classifier eval over the 12,400-tx 2026-Q1 holdout.
import json
import numpy as np
from sklearn.metrics import (precision_score, recall_score,
                             roc_auc_score, confusion_matrix)

holdout = [json.loads(l) for l in open('holdout_2026q1.jsonl')]
y_true  = np.array([r['label_fraud'] for r in holdout])
y_score = np.array([r['score'] for r in holdout])

for thresh in [0.50, 0.60, 0.70, 0.80, 0.85]:
    y_pred = (y_score >= thresh).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    fpr = fp / (fp + tn)
    print(f'thr={thresh:.2f}',
          f'recall={recall_score(y_true, y_pred):.3f}',
          f'precision={precision_score(y_true, y_pred):.3f}',
          f'fpr={fpr:.4f}',
          f'AUC={roc_auc_score(y_true, y_score):.3f}')
eval_llm_rationale.py python
# Eval the LLM reviewer's rationale text on amber-band cases.
# Faithfulness = did the rationale stay grounded in the case context?
# Answer-relevance = did it actually address the fraud question?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset

cases = [...]  # 480 amber-band cases with case_context + reviewer_text
ds = Dataset.from_list(cases)
result = evaluate(ds, metrics=[faithfulness, answer_relevancy])
print(result)
# 2026-Q1 baseline on Claude Opus 4:
#   faithfulness=0.91  answer_relevancy=0.88
# GPT-4o on same set: faithfulness=0.84  answer_relevancy=0.82

Eval scoreboard: 2026-Q1 results on a labeled-fraud corpus

Numbers below are from our internal 2026-Q1 eval on a 12,400-transaction labeled-fraud holdout drawn from a fintech operator's auth-boundary traffic. The hybrid stack (rules plus ML plus LLM-review on the amber band only) is the row that matters; the rules-only and ML-only rows are baselines. We publish the eval harness with the engagement; numbers below are not generic vendor claims.

Internal eval, 2026-Q1, 12,400-tx labeled-fraud holdout
92%
RECALL HYBRID
vs 71% rules-only baseline at FPR 0.9%
1.4%
FPR HYBRID
declined-good-customer rate inside loss-tolerance band
310ms
P95 LATENCY
end-to-decision incl LLM hop on amber band
0.91
RATIONALE FAITHFULNESS
Ragas score on Claude Opus 4 reviewer

False-positive cost: the line item nobody puts on the slide

Every vendor pitch leads with recall. Almost none lead with the declined-good-customer rate, which is the line item that costs you real revenue. A rules-only policy that declines 4% of legitimate payments is bleeding more than the fraud it catches, especially at retail-scale GMV. The compareBars below shows the trade-off across three policies on the same 2026-Q1 corpus. Hybrid wins on capture rate and on false-positive rate; the cost is the LLM token spend on the amber band, which on our run was about $0.04 per amber-band case routed to Claude Opus 4.

Declined-good-customer rate by policy (2026-Q1, lower is better)
Rules-only baseline
4.1%
High explainability, blunt; declines too many legitimate customers in spike traffic.
ML-only (Sift / Feedzai)
2.3%
Tunable; threshold calibration is the active variable; needs drift monitoring.
Hybrid (rules + ML + LLM amber)
1.4%
Lowest in this eval. LLM-review reduces FP on amber band where ML alone declines borderline-good customers.

ai fraud detection platform comparison: Sift vs Feedzai vs Stripe Radar

Buyers usually compare two or three of these. They have different shapes and different ideal customers. The framing below is what we walk through on a discovery audit. For a broader take on how to evaluate platform vendors, our generative ai services buyers guide covers the RFP shape; the same rubric works for fraud platforms.

Sift / Feedzai (full-stack platform) Stripe Radar + in-house ML + LLM-review
Best for teams without a senior ML practice who want ingestion plus scoring plus case-management plus reporting in one product. Drift monitoring is in the product. Vendor owns the model lifecycle teams already on Stripe with a payments scope and a Python or Go ML team. Stripe Radar handles deterministic rules and base scoring; you own the threshold and the LLM-review layer with Claude Opus 4 or GPT-4o
Trade-offs less control over the score function, monthly platform spend, audit-log shape is the vendor's, harder to layer your own LLM-review on top without losing some signal you carry drift monitoring, retraining cadence, and audit-log shape yourself. Cheaper at scale; more engineering to stand up

Outside payments-only scope, Mastercard Decision Intelligence is the option to consider for card-network signal, and ComplyAdvantage or Unit21 are the case-management plays for AML overlap. None of these are the wrong answer in isolation; the wrong answer is buying without an eval set you can run against each one.

Production gotchas — drift, adverse action, and the audit log

Six things break in production that the demo did not show. The callout below lists them in the order we usually see them surface during the first 90 days post go-live.

Cost anchors that actually move the basis-point needle

Three numbers that should anchor a fraud-program business case. The first is the loss anchor, the second is the per-decision compute cost, and the third is the basis-point reduction band we see when an eval-disciplined stack replaces drift-ignored rules. Each one is sourced from a public report or from a dated internal eval, not from a vendor deck.

FAQ — ai fraud detection at the auth boundary

What is ai fraud detection at the auth boundary?

Real-time scoring of login, payment, account-recovery, and device-add attempts using a composition of rules, ML, and an optional LLM-review hop. The decision binds within a 100-400 ms window, emits a single allow / step-up / decline, and writes an immutable audit-log row used downstream for adverse-action review, dispute handling, and regulator look-back.

Rules vs ML vs LLM — which one should make the binding decision?

Composition, not a single winner. Rules handle the cheap and explainable cases at 5-20 ms. ML score handles graded risk at 40-120 ms with drift monitoring. LLM review is reserved for the amber band where a human reviewer would also have to think; it runs at 400-1200 ms and is gated by token cost. We bind decisions on rules and ML; LLM-review adjudicates the amber band only.

How do we measure whether the platform is working?

Precision plus recall plus false-positive rate against a labeled-fraud holdout you control, with p95 latency and per-decision cost on top. Rationale quality on the LLM-review layer measured with Ragas faithfulness and answer-relevancy. On our 2026-Q1 12,400-tx eval the hybrid stack hit 92% recall at 1.4% FPR with 310 ms p95. If your vendor cannot produce comparable numbers on your corpus, the engagement is selling something else.

What about adverse-action notice and explainability?

Declines based on an ML score are adverse action under ECOA, FCRA, and equivalent EU and UK regimes. You must produce the reason. Persist SHAP top-3 features for ML-driven declines, the LLM rationale text for amber-band declines, and the rule id for rule-driven declines. Compute SHAP offline and persist on the audit-log row; never compute on the hot path.

Sift, Feedzai, Stripe Radar, Alloy, Persona — which one do we pick?

It depends on whether you want a full-stack platform (Sift, Feedzai, Mastercard Decision Intelligence) or a build-your-own composition (Stripe Radar plus in-house ML plus an LLM-review hop on Claude Opus 4 or GPT-4o). Identity-side vendors like Alloy and Persona are usually complementary, not substitutes. Run all of them against the same labeled-fraud holdout before picking.

When should we NOT add an LLM-review layer?

When your ML score plus rules already operate inside your false-positive tolerance band, when your p95 latency budget will not absorb 400-900 ms of model time, or when your team has not yet wired Langfuse or Braintrust traces. An LLM review hop without observability and without a rationale-quality eval is a regression. We will tell a buyer to skip it in writing on the audit if that is the call.

How does this work with model-risk management?

Treat the LLM-review hop as a model under SR 11-7 in US banking, equivalent guidance in other jurisdictions. Freeze model versions, document the eval, persist rationale and tool-call traces, and run drift checks on the score band the model affects. Vendors who cannot answer what model version made a decision six months ago will not pass an exam.

Talk to engineering

The honest play on ai fraud detection at the auth boundary is the same as every other production AI shape we ship: name your metric, build your eval set, compose your layers, audit-log everything, rehearse the rollback. The platform you pick matters less than the eval discipline you wrap around it. If you want a structured conversation about your auth-boundary stack instead of a vendor demo, we run a discovery audit that ends with working code and a written recommendation. You take that recommendation whether or not you continue with us. Walk-away clause on every audit.

Working on fraud at the auth boundary?

Start the audit. Take the recommendation, with or without us.

We ship eval-first, model-agnostic ai fraud detection systems for fintech and payments operators. Engineering reads every inbound. SOC 2-ready practices in our delivery; we flag with procurement honestly that we are not certified as a vendor.

Talk to engineering

Want help shipping this?

An engineer reads every inbound. Same business day on most replies.