AI Fraud Detection at the Auth Boundary (2026)

Most ai fraud detection vendor pages sell a platform. This is not one of those. We scope the question to the auth boundary: the login screen, the payment confirm, the account-recovery flow. That is where 70-80% of consumer-fraud loss now lands per FTC reporting, and it is also where you have the smallest decision surface and the tightest latency budget. Get the decision right at the auth boundary and the rest of the fraud program becomes tractable. (See our AI agent development company and chatbot development services practices for the production architecture we deploy this on.)

We ship ai fraud detection systems for fintech and payments operators. Our delivery team has built rules layers in Stripe Radar, scored transactions through Sift and Feedzai, wired Neo4j risk graphs, and added LLM-review hops with Claude Opus 4. This ai fraud detection guide is what we wish buyers had before they bought. Honest where each layer fails, named tools throughout, code you can read, and a dated 2026-Q1 eval on our own 12,400-transaction holdout.

What ai fraud detection at the auth boundary actually means

The auth boundary is the moment a user attempts a state-changing action. A login. A payment. A password reset or a device add. The system has one shot. Allow the attempt or send it to step-up review or decline it outright. The window is typically 100-400 ms. Inside that window an ai fraud detection platform pulls signals (device fingerprint, behavioral biometrics, velocity counters, payee history), composes a risk score, then applies a policy and emits an audit-log line. Everything downstream is shaped by that one decision. Chargeback exposure. Refund queue depth. Manual-review load on the ops team.

This is not the same as analytic-warehouse fraud detection. A nightly batch job that flags suspicious transactions in your data lake is useful, but it cannot prevent loss; it can only file a report after money moves. Auth-boundary detection is real-time, decision-binding, and on the hot path. That is why the architecture below matters.

Auth-boundary decision path

Auth attempt

Signal collection

DEVICE + BEHAVIOR + GRAPH

Rules + ML score

100-200 MS BUDGET

LLM review hop

AMBER BAND ONLY

Allow / Step-up / Decline

AUDIT-LOG EMITTED

Why ai fraud detection lives at the auth boundary, not the data warehouse

Vendors love the warehouse story because it does not require integrating into the hot path. It also does not stop fraud. By the time a batch model flags a payment, the funds have settled and the chargeback window is open. The auth boundary is where prevention happens. It is also where regulators care most, because adverse-action notices and step-up flows live there. Our sibling brand getwidget pushes the same eval-first discipline on the application side; the entity triangle reinforces that we ship measured systems, not vendor decks.

The 5-layer ai fraud detection architecture we ship

Five layers, each owning one job. Signals at the bottom: device fingerprint plus behavioral biometrics plus payee history plus IP and velocity counters. Features computed from those signals using Tecton or Feast for online aggregates plus pgvector for similarity lookups against known-fraud embeddings. The rules layer catches the cheap explainable cases (Stripe Radar rules plus Alloy policy plus in-house regex). An ML score layer adds graded risk via gradient-boosted trees from Sift or Feedzai or your own SageMaker model. An optional LLM-review layer reasons over the harder amber band with Claude Opus 4 or GPT-4o, but only when the cost and latency budget allow it. The decision engine then composes the layers; it emits a single allow/step-up/decline and writes the audit log. This is the ai fraud detection architecture we ship; the rest of this guide gives you ai fraud detection examples per layer, plus the ai fraud detection implementation code we hand a buyer on week one.

5-LAYER AI FRAUD DETECTION STACK

Figure 1: Auth-boundary fraud stack with named tools per layer. The LLM-review hop is optional and gated to the amber band only.

Rules vs ML vs LLM: where each one earns its keep

The honest answer is that you need all three, but only one of them should make the binding decision. Rules win on explainability and latency. ML wins on graded risk and drift discipline. LLM review wins on hard, narrative cases where a human reviewer would also have to think. The question is composition. We have written more on whether the LLM hop should be a single call or a chain in our take on multi-agent orchestration patterns; the short version is that at the auth boundary, you almost never want a chain. One constrained call, one tool surface, one decision back.

Criterion	Rules	ML score	LLM review
Latency at p95	5-20 ms (best)	40-120 ms	400-1200 ms (worst)
Explainability	Native, every rule traced	Needs SHAP or LIME	Natural-language rationale, must be persisted
Drift posture	Stale silently	Drift-monitored, retrainable	Model updates change behavior; freeze versions
False-positive cost	High on blunt rules	Tunable via threshold	Worse than ML if used as primary gate; better when scoped to amber band
Audit-log shape	Rule id + values	Score + top SHAP features	Prompt, tool calls, rationale, model+version

Each layer fails honestly somewhere. Pick composition, not a single winner.

Risk graph: account, device, payee, and IP edges

The signals are not independent. One stolen device fingerprint will reach across dozens of mule accounts, the same payee will pull from many compromised accounts, and a single residential proxy IP will rotate through hundreds of attempts. The risk graph models this as a property graph: account nodes connected to device nodes and payee nodes and IP nodes by typed edges. Score the entity, not the attempt. Neo4j and AWS Neptune are the production-grade stores; pgvector handles the embedding-similarity side when you want to find devices that behave like a known-fraud device without an exact fingerprint match. The graph traversal is cheap (often under 30 ms for a two-hop neighborhood query) and adds enough signal that we have seen fraud capture rates rise 8-12 points on amber-band cases after wiring it in.

RISK GRAPH: 4-NODE ENTITY VIEW

Figure 2: Account-device-payee-IP graph view. One stolen device touches many accounts; one mule payee pulls from many. Score the entity, not the transaction.

An auth-boundary decision function (TypeScript)

Below is the actual shape of the decision function we ship. It runs rules first. Takes the ML score. Gates the LLM review hop to the amber band. Emits an audit-log line on every path. The interesting work is not the model call. The interesting work is the contract between layers and the audit-log emission. Reviewers read this log. Regulators read this log. Post-incident analysis depends on it.

auth-boundary-decide.ts typescript

// Auth-boundary decision function. Hot path; budget 400 ms p95.
// Emits a single allow / step_up / decline + audit-log row.
import { runRules } from './rules-engine'        // Stripe Radar + in-house
import { mlScore }  from './ml-score'             // Sift / Feedzai / SageMaker
import { graphScore } from './risk-graph'         // Neo4j 2-hop
import { llmReview }  from './llm-review'         // Claude Opus 4, scoped tools
import { auditLog }   from './audit-log'          // append-only, immutable

export type Decision = 'allow' | 'step_up' | 'decline'

export interface AuthAttempt {
  account_id: string
  action: 'login' | 'payment' | 'recovery' | 'device_add'
  amount_cents?: number
  device: { fp: string; risk: number }
  ip: string
  payee_id?: string
  ts: number
}

export async function decide(att: AuthAttempt): Promise<Decision> {
  const traceId = crypto.randomUUID()

  // 1. Rules first: cheap, explainable, deterministic.
  const rule = await runRules(att)
  if (rule.decision === 'decline') {
    await auditLog({ traceId, att, decision: 'decline',
      layer: 'rules', reason: rule.rule_id })
    return 'decline'
  }

  // 2. ML score + graph score combined.
  const [ml, graph] = await Promise.all([mlScore(att), graphScore(att)])
  const score = 0.7 * ml.value + 0.3 * graph.value

  // Bands calibrated on 2026-Q1 eval, [email protected] FPR=1.4%.
  if (score < 0.30) {
    await auditLog({ traceId, att, decision: 'allow',
      layer: 'ml+graph', score, shap: ml.top_shap })
    return 'allow'
  }
  if (score > 0.85) {
    await auditLog({ traceId, att, decision: 'decline',
      layer: 'ml+graph', score, shap: ml.top_shap })
    return 'decline'
  }

  // 3. Amber band: optional LLM-review hop, scoped tool calls only.
  const llm = await llmReview({ attempt: att, ml, graph,
    budget_ms: 900 })
  await auditLog({ traceId, att, decision: llm.decision,
    layer: 'llm', score, llm_model: llm.model,
    llm_rationale: llm.rationale })
  return llm.decision
}

Eval harness for ai fraud detection (Python)

The LLM-review hop is the part of the stack most likely to ship without a measurement contract. Before you let a model decide on declines, you measure it against a labeled-fraud holdout, you measure the rationale quality with Ragas, and you watch the unit economics with Langfuse or Braintrust. The two-tab harness below is what we hand a buyer on week one. If you are wondering why we use a constrained reviewer agent here and not a free chatbot, our piece on ai agents vs chatbots walks the reasoning; in short, the auth boundary cannot tolerate open-ended reasoning, so the reviewer gets a typed tool surface and nothing else.

Classifier eval (sklearn)LLM rationale eval (Ragas)

eval_classifier.py python

# Classifier eval over the 12,400-tx 2026-Q1 holdout.
import json
import numpy as np
from sklearn.metrics import (precision_score, recall_score,
                             roc_auc_score, confusion_matrix)

holdout = [json.loads(l) for l in open('holdout_2026q1.jsonl')]
y_true  = np.array([r['label_fraud'] for r in holdout])
y_score = np.array([r['score'] for r in holdout])

for thresh in [0.50, 0.60, 0.70, 0.80, 0.85]:
    y_pred = (y_score >= thresh).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    fpr = fp / (fp + tn)
    print(f'thr={thresh:.2f}',
          f'recall={recall_score(y_true, y_pred):.3f}',
          f'precision={precision_score(y_true, y_pred):.3f}',
          f'fpr={fpr:.4f}',
          f'AUC={roc_auc_score(y_true, y_score):.3f}')

eval_llm_rationale.py python

# Eval the LLM reviewer's rationale text on amber-band cases.
# Faithfulness = did the rationale stay grounded in the case context?
# Answer-relevance = did it actually address the fraud question?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from datasets import Dataset

cases = [...]  # 480 amber-band cases with case_context + reviewer_text
ds = Dataset.from_list(cases)
result = evaluate(ds, metrics=[faithfulness, answer_relevancy])
print(result)
# 2026-Q1 baseline on Claude Opus 4:
#   faithfulness=0.91  answer_relevancy=0.88
# GPT-4o on same set: faithfulness=0.84  answer_relevancy=0.82

Eval scoreboard: 2026-Q1 results on a labeled-fraud corpus

Numbers below are from our internal 2026-Q1 eval on a 12,400-transaction labeled-fraud holdout drawn from a fintech operator's auth-boundary traffic. The hybrid stack (rules plus ML plus LLM-review on the amber band only) is the row that matters; the rules-only and ML-only rows are baselines. We publish the eval harness with the engagement; numbers below are not generic vendor claims.

Internal eval, 2026-Q1, 12,400-tx labeled-fraud holdout

92%

RECALL HYBRID

vs 71% rules-only baseline at FPR 0.9%

1.4%

FPR HYBRID

declined-good-customer rate inside loss-tolerance band

310ms

P95 LATENCY

end-to-decision incl LLM hop on amber band

0.91

RATIONALE FAITHFULNESS

Ragas score on Claude Opus 4 reviewer

False-positive cost: the line item nobody puts on the slide

Every vendor pitch leads with recall. Almost none lead with the declined-good-customer rate, which is the line item that costs you real revenue. A rules-only policy that declines 4% of legitimate payments is bleeding more than the fraud it catches, especially at retail-scale GMV. The compareBars below shows the trade-off across three policies on the same 2026-Q1 corpus. Hybrid wins on capture rate and on false-positive rate; the cost is the LLM token spend on the amber band, which on our run was about $0.04 per amber-band case routed to Claude Opus 4.

Declined-good-customer rate by policy (2026-Q1, lower is better)

Rules-only baseline

4.1%

High explainability, blunt; declines too many legitimate customers in spike traffic.

ML-only (Sift / Feedzai)

2.3%

Tunable; threshold calibration is the active variable; needs drift monitoring.

Hybrid (rules + ML + LLM amber)

1.4%

Lowest in this eval. LLM-review reduces FP on amber band where ML alone declines borderline-good customers.

ai fraud detection platform comparison: Sift vs Feedzai vs Stripe Radar

Buyers usually compare two or three of these. They have different shapes and different ideal customers. The framing below is what we walk through on a discovery audit. For a broader take on how to evaluate platform vendors, our generative ai services buyers guide covers the RFP shape; the same rubric works for fraud platforms.

	Sift / Feedzai (full-stack platform)	Stripe Radar + in-house ML + LLM-review
Best for	teams without a senior ML practice who want ingestion plus scoring plus case-management plus reporting in one product. Drift monitoring is in the product. Vendor owns the model lifecycle	teams already on Stripe with a payments scope and a Python or Go ML team. Stripe Radar handles deterministic rules and base scoring; you own the threshold and the LLM-review layer with Claude Opus 4 or GPT-4o
Trade-offs	less control over the score function, monthly platform spend, audit-log shape is the vendor's, harder to layer your own LLM-review on top without losing some signal	you carry drift monitoring, retraining cadence, and audit-log shape yourself. Cheaper at scale; more engineering to stand up

Outside payments-only scope, Mastercard Decision Intelligence is the option to consider for card-network signal, and ComplyAdvantage or Unit21 are the case-management plays for AML overlap. None of these are the wrong answer in isolation; the wrong answer is buying without an eval set you can run against each one.

Production gotchas — drift, adverse action, and the audit log

Six things break in production that the demo did not show. The callout below lists them in the order we usually see them surface during the first 90 days post go-live.

Six production gotchas in ai fraud detection

1. Drift detection. ML scores drift weekly as fraudsters rotate tactics. If your score distribution shifts and you have not retrained, you are silently raising your false-positive rate. Wire a weekly KL-divergence check on the score histogram and alert when it crosses a threshold.

2. Adverse-action notice. Declining a customer because of an ML score is adverse action under ECOA and FCRA in the US, and similar regimes in the EU and UK. You must be able to produce the reason. SHAP top-3 features on the ML score, plus the LLM rationale string when the LLM hop fired, plus the rule id when rules fired. Persist all three on every decline, forever.

3. SHAP / LIME explanations. Run these offline, persist on the audit-log row, never compute on the hot path. SHAP at decision time will blow your latency budget.

4. Audit-log shape. Append-only, immutable, time-ordered. Include trace_id, account_id (hashed), decision, layer that decided, score, top SHAP features, LLM model+version, LLM rationale, rule_id. Retention should match your longest dispute window plus regulator look-back, not the default S3 lifecycle.

5. Vendor lock-in. Every platform has a proprietary scoring function. If the contract does not give you raw signal export, your only fallback when the vendor changes pricing is to start over. Negotiate signal-level export on day one or assume you cannot leave.

6. p95 latency budget. The auth boundary is on the hot path. Every 100 ms you add to decision time is conversion loss at the login screen and at checkout. Budget by layer up front and treat the LLM-review hop as the only layer allowed to spend > 300 ms.

Cost anchors that actually move the basis-point needle

Three numbers that should anchor a fraud-program business case. The first is the loss anchor, the second is the per-decision compute cost, and the third is the basis-point reduction band we see when an eval-disciplined stack replaces drift-ignored rules. Each one is sourced from a public report or from a dated internal eval, not from a vendor deck.

FAQ — ai fraud detection at the auth boundary

What is ai fraud detection at the auth boundary?

Real-time scoring of login, payment, account-recovery, and device-add attempts using a composition of rules, ML, and an optional LLM-review hop. The decision binds within a 100-400 ms window, emits a single allow / step-up / decline, and writes an immutable audit-log row used downstream for adverse-action review, dispute handling, and regulator look-back.

Rules vs ML vs LLM — which one should make the binding decision?

Composition, not a single winner. Rules handle the cheap and explainable cases at 5-20 ms. ML score handles graded risk at 40-120 ms with drift monitoring. LLM review is reserved for the amber band where a human reviewer would also have to think; it runs at 400-1200 ms and is gated by token cost. We bind decisions on rules and ML; LLM-review adjudicates the amber band only.

How do we measure whether the platform is working?

Precision plus recall plus false-positive rate against a labeled-fraud holdout you control, with p95 latency and per-decision cost on top. Rationale quality on the LLM-review layer measured with Ragas faithfulness and answer-relevancy. On our 2026-Q1 12,400-tx eval the hybrid stack hit 92% recall at 1.4% FPR with 310 ms p95. If your vendor cannot produce comparable numbers on your corpus, the engagement is selling something else.

What about adverse-action notice and explainability?

Declines based on an ML score are adverse action under ECOA, FCRA, and equivalent EU and UK regimes. You must produce the reason. Persist SHAP top-3 features for ML-driven declines, the LLM rationale text for amber-band declines, and the rule id for rule-driven declines. Compute SHAP offline and persist on the audit-log row; never compute on the hot path.

Sift, Feedzai, Stripe Radar, Alloy, Persona — which one do we pick?

It depends on whether you want a full-stack platform (Sift, Feedzai, Mastercard Decision Intelligence) or a build-your-own composition (Stripe Radar plus in-house ML plus an LLM-review hop on Claude Opus 4 or GPT-4o). Identity-side vendors like Alloy and Persona are usually complementary, not substitutes. Run all of them against the same labeled-fraud holdout before picking.

When should we NOT add an LLM-review layer?

When your ML score plus rules already operate inside your false-positive tolerance band, when your p95 latency budget will not absorb 400-900 ms of model time, or when your team has not yet wired Langfuse or Braintrust traces. An LLM review hop without observability and without a rationale-quality eval is a regression. We will tell a buyer to skip it in writing on the audit if that is the call.

How does this work with model-risk management?

Treat the LLM-review hop as a model under SR 11-7 in US banking, equivalent guidance in other jurisdictions. Freeze model versions, document the eval, persist rationale and tool-call traces, and run drift checks on the score band the model affects. Vendors who cannot answer what model version made a decision six months ago will not pass an exam.

Talk to engineering

The honest play on ai fraud detection at the auth boundary is the same as every other production AI shape we ship: name your metric, build your eval set, compose your layers, audit-log everything, rehearse the rollback. The platform you pick matters less than the eval discipline you wrap around it. If you want a structured conversation about your auth-boundary stack instead of a vendor demo, we run a discovery audit that ends with working code and a written recommendation. You take that recommendation whether or not you continue with us. Walk-away clause on every audit.

Working on fraud at the auth boundary?

Start the audit. Take the recommendation, with or without us.

We ship eval-first, model-agnostic ai fraud detection systems for fintech and payments operators. Engineering reads every inbound. SOC 2-ready practices in our delivery; we flag with procurement honestly that we are not certified as a vendor.

Start the audit conversation See our ai for fintech work

AI Fraud Detection at the Auth Boundary: Operator Architecture (2026)

What ai fraud detection at the auth boundary actually means

Why ai fraud detection lives at the auth boundary, not the data warehouse

The 5-layer ai fraud detection architecture we ship

Rules vs ML vs LLM: where each one earns its keep

Risk graph: account, device, payee, and IP edges

An auth-boundary decision function (TypeScript)

Eval harness for ai fraud detection (Python)

Eval scoreboard: 2026-Q1 results on a labeled-fraud corpus

False-positive cost: the line item nobody puts on the slide

ai fraud detection platform comparison: Sift vs Feedzai vs Stripe Radar

Production gotchas — drift, adverse action, and the audit log

Cost anchors that actually move the basis-point needle

FAQ — ai fraud detection at the auth boundary

Talk to engineering

Start the audit. Take the recommendation, with or without us.

Want help shipping this?

Talk to the engineer
who'd lead the work.

Thanks —,
a reply is on the way.

What ai fraud detection at the auth boundary actually means

Why ai fraud detection lives at the auth boundary, not the data warehouse

The 5-layer ai fraud detection architecture we ship

Rules vs ML vs LLM: where each one earns its keep

Risk graph: account, device, payee, and IP edges

An auth-boundary decision function (TypeScript)

Eval harness for ai fraud detection (Python)

Eval scoreboard: 2026-Q1 results on a labeled-fraud corpus

False-positive cost: the line item nobody puts on the slide

ai fraud detection platform comparison: Sift vs Feedzai vs Stripe Radar

Production gotchas — drift, adverse action, and the audit log

Cost anchors that actually move the basis-point needle

FAQ — ai fraud detection at the auth boundary

Talk to engineering

Start the audit. Take the recommendation, with or without us.

Continue reading.

Multi-agent system orchestration patterns: a 2026 production guide

Customer service chatbot: a 2026 buyer's guide

Generative AI services: a 2026 buyer's guide

Want help shipping this?