Healthcare · Regional health system RAG + AI agent · forced JSON

Claude Sonnet 4.6pgvector 0.7FHIR R4LangGraph 0.2AWS Bedrock · BAA

case study · 2026 · anonymized

How we shipped a HIPAA-safe clinical triage AI agent
in 9 weeks.

A US-based regional health system needed clinical decision support software that could clear low-acuity self-care, queue everything else for a clinician with evidence attached, and page on-call instantly when a red-flag symptom set fired. We shipped clinical decision support software built on Claude Sonnet 4.6, pgvector, and FHIR R4: eval-first, BAA-scoped, with a kill point at week 7 that we used.

38–62%

pre-triage wait reduction (95% CI · n=14,200 shadow encounters)

p95 3.1s

end-to-end decision latency · meets <3.5s service target

412

frozen eval items · re-run on every release

9 weeks

discovery to shadow-mode go-live

shipped

9 weeks · 4 engineers · 1 clinical lead

Summary

What this case study shows

A US regional health system shipped a HIPAA-safe clinical triage agent on Claude Sonnet 4.6 across 6 hospitals, 40 clinics, and a 24/7 nurse triage line. Pre-triage wait dropped 38-62 percent across n=14,200 shadow encounters (95% CI). Stack: Claude Sonnet 4.6, pgvector 0.7, HL7 FHIR R4, LangGraph 0.2, AWS Bedrock with PrivateLink, Langfuse. Compliance: HIPAA Business Associate Agreement, HITECH Act, PHI redaction pipeline. Nine weeks discovery to shadow mode.

9,500/wk

patient messages across portal · SMS · post-visit

38–62 min

peak pre-triage wait · fat tail to 2 hrs

50–70%

low-acuity addressable by documented self-care pathways

6 + 40 + 24/7

hospitals · clinics · nurse triage line

the problem

A nurse triage line
under load.

6 hospitals · 40 clinics · 24/7 nurse triage line handling ~9,500 inbound messages/week across portal, SMS, and post-visit. Too small to staff a 60-seat triage centre during evening peaks; too large for the nurse manager to keep eyes on every queue. Wait times during evening surges were routing wrong-acuity patients to the ER.

today vs · with the agent

today

Patient

SMS / Portal / Post-visit

Pre-triage queue

38–62 min peak

Nurse triage

outcome

Long wait · sometimes wrong-acuity routing to ER

with the agent

Patient

SMS / Portal / Post-visit

Triage agent (Claude Sonnet 4.6)

forced JSON · cited evidence

Policy + 2-eye guardrail

outcome

Clear · self-care

outcome

Queue · for clinician

outcome

Escalate · stat

Pre-triage queue wait averaged 38–62 minutes at peak, fat tail past 2 hours. The clinical lead estimated 50–70% of inbound messages mapped to documented low-acuity self-care pathways (sore throat with no red flags, medication-refill questions, post-op wound checks healing fine), but patients waited anyway because there was no triage layer in front of the nurse. Worse, the nurse line was occasionally routing borderline patients home when an ER visit was indicated. Generic patient-facing chatbots had been evaluated and turned down on operator-grade objections: no autonomous routing on acuity, no PHI leaving the BAA perimeter, no advice generated without grounding in their own clinical-pathway corpus, no metric not measurable on a frozen eval set.

discovery · binding constraints

peak wait38–62 min

fat-tail wait2 hrs

low-acuity addressable50–70%

PHI must stay in BAAalways

Medical director named borderline-routing-to-home as the binding constraint in discovery week 1.

The thing that scares us is not the obvious red-flag miss. Those are easy to write a rule for. What scares us is a confident-sounding agent telling a borderline patient to take Tylenol when the right answer was 'come in tonight.' Show us how you measure that, or we're not signing.

Medical Director Regional health system · 6 hospitals · 40 clinics

how PHI moves

From your BAA scope to the model
and the audit-log trail back.

Compliance isn't a vendor badge — it's a data-flow choice. PHI starts in your EHR's BAA-scoped environment (Epic, Cerner, athenahealth, Veeva), passes through scrubbing or de-identification before any inference, and audit-logs trace every boundary in reverse. Hover any lane for the controls we ship in that zone.

forward flow · PHI scrubbed pre-inference reverse · audit-log writeback

BAA SCOPE Your BAA-scoped environment EHR / source-of-truth · PHI lives here · we sign a BAA before we touch it

Epic FHIR R4 · OAuth2 · R&R-gated

Cerner / Oracle Health Millennium · MPages · FHIR

athenahealth open API · webhooks · cloud-native

Veeva life-sciences CRM · MLR-aware

BAA scope ends · PHI removed pre-prompt

SCRUB ZONE PHI scrubbing / de-identification boundary PHI removed pre-prompt · regex + NER + clinical de-id helpers · 18 HIPAA identifiers

PHI scrubber regex + clinical NER

De-id helper 18-identifier safe-harbor

Public-API surface · scrubbed payload only

PUBLIC API Public LLM API Inference happens here · scrubbed payload only · BAA on the API tier where applicable

Claude Sonnet 4.6 quality · clinical narrative

Claude Haiku 4.5 cheap · high-volume routing

GPT-5 / mini structured output

BAA SCOPE
Your BAA-scoped environment

EHR / source-of-truth · PHI lives here · we sign a BAA before we touch it
- Epic — FHIR R4 · OAuth2 · R&R-gated
- Cerner / Oracle Health — Millennium · MPages · FHIR
- athenahealth — open API · webhooks · cloud-native
- Veeva — life-sciences CRM · MLR-aware
↓ BAA scope ends · PHI removed pre-prompt ↑
SCRUB ZONE
PHI scrubbing / de-identification boundary

PHI removed pre-prompt · regex + NER + clinical de-id helpers · 18 HIPAA identifiers
- PHI scrubber — regex + clinical NER
- De-id helper — 18-identifier safe-harbor
↓ Public-API surface · scrubbed payload only ↑
PUBLIC API
Public LLM API

Inference happens here · scrubbed payload only · BAA on the API tier where applicable
- Claude Sonnet 4.6 — quality · clinical narrative
- Claude Haiku 4.5 — cheap · high-volume routing
- GPT-5 / mini — structured output

Where this fails: if your team pastes PHI directly into a consumer ChatGPT window, no diagram saves you. We can't help that — we can only architect the auto-path so the temptation goes away. BAA tiers don't cover the consumer surface. Internal training on "what gets pasted where" is the other half of healthcare AI security and we say so in every audit.

the approach · clinical triage pipeline

Clinical triage pipeline: six stages,
three outcome lanes.

FHIR R4 chart pull (scoped to Patient + Encounter + recent Observation), PHI redaction with reversible token map, hybrid pgvector + BM25 retrieval with bge-reranker-large, Claude Sonnet 4.6 forced-JSON decision, policy + 2-eye guardrails. The agent has zero write tools. Diagram below.

three decisions that shaped the build

design decision · 01

Zero write tools for the agent

we rejected: Write back to the chart directly
because: Chart writes are a clinician privilege. The agent surfaces evidence, the clinician owns the action.

design decision · 02

Forced JSON · response_format schema

we rejected: Free-text answer with downstream parser
because: Every claim has to cite an evidence chunk id. The schema validator is the contract; the model can't hand-wave.

design decision · 03

Hybrid pgvector + BM25 retrieval

we rejected: Pure embedding search
because: Clinical pathway docs over-index on rare terms (drug names, ICD codes) that lexical match wins on. Embeddings miss them. Fusion is empirically better on the eval set.

the clinical-safety boundary

Where we let AI run
and where your physicians stay in the loop.

Healthcare AI is governed by clinical-safety scope, not throughput. We map every workflow to an autonomy band before we ship it — autonomous · clinician-approved draft · human-only territory. Here's how we decide.

autonomous · AI ships draft · clinician signs off escalate · human-only

Patient signal received

Routine admin

Clinical question

Acute / red-flag

Controlled / off-window

Routine admin
Scheduling, refill timing, post-visit instructions, eligibility checks. No clinical judgment required at the decision level — but PHI is still in scope.
Clinical question
Symptom interpretation, medication question, diagnostic interpretation, treatment-plan input. Clinical judgment required — the only question is who signs off.
Acute / red-flag
Suicide-risk, pediatric red-flag, chest pain, suspected stroke, behavioral-health crisis, acute psychiatric. AI does not adjudicate these — period.

why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.

Decision model

Labelled acuity-band correctness + groundedness on the frozen 412-item eval. Forced-JSON schema requires every claim to cite a retrieved chunk_id whose pathway-id matches the routed clinical context. The model cannot decide without grounded evidence.

Retrieval

Top-k recall on the frozen eval. RRF + reranker tuned against this number, not end-to-end accuracy.

Reranker

Top-1 precision on the held-out slice. Catches over-confident retrievals before Sonnet ever reasons.

PHI redaction

Token recall on labelled PHI spans · reversible map.

Calibration head

Expected calibration error on labelled set.

Refusal lane

Policy-routed: pediatric <3y, active pregnancy w/o OB, OOD.

under the hood

The triage agent,
end to end.

Every patient message enters at the top. It either clears to a self-care pathway, lands in a clinician queue with structured evidence attached, or escalates stat. Hover any stage to see its tool inventory and latency budget.

outcome Clear low-acuity self-care path · ≈ 62% of pre-triage volume

outcome Queue for clinician structured packet · evidence chunks attached · ≈ 33%

outcome Escalate · stat red-flag symptom set · pages on-call · ≈ 5%

tool inventory

Hover or focus a stage on the left to see its tool surface, latency budget, and the data it touches.

latency budgets above are p50/p95 on the production traffic mix · end-to-end p95 inside 3.1s target

BAA-scoped

no PHI leaves the customer VPC at any point in the pipeline

autonomous escalations · clinician sign-off on every queue entry

8 clinicians

in the design council · 3 of them flagged the calibration bug

shadow-first

two weeks running silently next to the existing nurse triage line

the stack · clinical triage agent

Clinical triage stack: named tools,
named versions.

Everything in the build is a thing your security team can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, and policies are all checked into the customer's repo, not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON

Claude Haiku 4.5

pgvector 0.7

BM25 (Postgres tsvector)

BAAI bge-reranker-large

LangGraph 0.2.x

FHIR R4

Langfuse

Cloudflare Workers

how it actually runs

Production shape,
under the hood.

The numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 pricing as of May 2026; eval composition is the frozen 412-item set the CI gates on.

latency budget

Per-stage P50 / P95 (ms)

stage	p50	p95	tooling
FHIR resource pull	92	140	Epic on-FHIR + athenahealth APIs · cached Patient + scoped Encounter
PHI redaction	78	120	Regex pre-pass + i2b2-fine-tuned clinical NER (DistilBERT base)
Hybrid retrieval	112	180	pgvector cosine top-40 ∥ Postgres tsvector BM25 top-40 → RRF k=60
Cross-encoder rerank	240	340	BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
Claude Sonnet 4.6 decision	1740	2180	Anthropic API · response_format json_schema · ~3,400 in / ~480 out tokens
Policy + 2-eye validation	14	22	TypeScript runtime · Zod schema · audit-log write
Total (end-to-end)	2280	3098	agent boundary — excludes clinician-side queue render

stage FHIR resource pull
p50 92
p95 140
tooling Epic on-FHIR + athenahealth APIs · cached Patient + scoped Encounter
stage PHI redaction
p50 78
p95 120
tooling Regex pre-pass + i2b2-fine-tuned clinical NER (DistilBERT base)
stage Hybrid retrieval
p50 112
p95 180
tooling pgvector cosine top-40 ∥ Postgres tsvector BM25 top-40 → RRF k=60
stage Cross-encoder rerank
p50 240
p95 340
tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
stage Claude Sonnet 4.6 decision
p50 1740
p95 2180
tooling Anthropic API · response_format json_schema · ~3,400 in / ~480 out tokens
stage Policy + 2-eye validation
p50 14
p95 22
tooling TypeScript runtime · Zod schema · audit-log write
stage Total (end-to-end)
p50 2280
p95 3098
tooling agent boundary — excludes clinician-side queue render

p50/p95 from 30-day rolling window over n ≈ 41,200 production decisions. SLO is p95 ≤ 3,500 ms; current burn ≈ 88%.

Retrieval lane was where most of the per-stage tuning went. The corpus is ~1,400 pathway pages, chunked to 480 tokens with 80-token overlap, sentence-anchored. We picked voyage-3-large at 1,024 dimensions specifically because Voyage signs a BAA at the same price tier as voyage-3-lite; the lite variant dropped recall@5 by 4 points and the 35% cost saving wasn't worth a measurably worse retriever. Fusion is RRF with k=60 (paper default; held-out slice did not move on alternatives), top-40 from each lane, deduplicated by chunk id, reranked, top-12 to the model.

retrieval · tuned, not defaulted

chunk size480 tok

overlap80 tok

embeddingsvoyage-3-large

RRF k60

recall@5 (post-rerank)0.91

recall@10.78

triage/schema/decision.ts typescript

// triage/schema/decision.ts
// Forced-JSON decision schema. Validated client-side too; if the
// model produces something that doesn't parse, we retry once with
// a stricter system prompt, then fail closed (queue for clinician).

import { z } from "zod";

export const TriageDecision = z.object({
  routing: z.enum([
    "clear",      // safe for documented self-care; no clinician needed
    "queue",      // route to nurse queue with this agent's reasoning attached
    "escalate",   // page on-call clinician now (stat criteria)
  ]),
  acuity_band: z.enum(["1-self-care", "2-routine", "3-same-day", "4-urgent", "5-stat"]),
  confidence: z.number().min(0).max(1),
  rationale: z.array(z.object({
    claim:       z.string().min(40).max(420),
    evidence_id: z.string().regex(/^chunk_[a-f0-9]{12}$/),
    pathway_id:  z.string(),
  })).min(1).max(8),
  refused: z.boolean().describe(
    "True if the agent decided it cannot decide: pediatric < 3y, " +
    "active pregnancy without OB context, or any rationale failed to ground."
  ),
});

export type TriageDecision = z.infer<typeof TriageDecision>;

// triage/schema/decision.ts
// Forced-JSON decision schema. Validated client-side too; if the
// model produces something that doesn't parse, we retry once with
// a stricter system prompt, then fail closed (queue for clinician).

import { z } from "zod";

export const TriageDecision = z.object({
  routing: z.enum([
    "clear",      // safe for documented self-care; no clinician needed
    "queue",      // route to nurse queue with this agent's reasoning attached
    "escalate",   // page on-call clinician now (stat criteria)
  ]),
  acuity_band: z.enum(["1-self-care", "2-routine", "3-same-day", "4-urgent", "5-stat"]),
  confidence: z.number().min(0).max(1),
  rationale: z.array(z.object({
    claim:       z.string().min(40).max(420),
    evidence_id: z.string().regex(/^chunk_[a-f0-9]{12}$/),
    pathway_id:  z.string(),
  })).min(1).max(8),
  refused: z.boolean().describe(
    "True if the agent decided it cannot decide: pediatric < 3y, " +
    "active pregnancy without OB context, or any rationale failed to ground."
  ),
});

export type TriageDecision = z.infer<typeof TriageDecision>;

The structured-output schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform. Every claim has to cite a retrieved chunk id.

unit economics

Per-decision and monthly cost math

line item	$ / decision	$ / month (≈ 41k decisions)	note
Claude Sonnet 4.6 — input tokens	$0.0102	$418	3,400 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output tokens	$0.0072	$294	480 tokens × $15.00 / 1M
voyage-3-large embeddings (avg query)	$0.0004	$16	≈ 3,300 tokens × $0.12 / 1M
pgvector + RDS db.m6i.large	—	$284	BAA-scoped Postgres; embeddings + tsvector
g5.xlarge reranker (24/7)	—	$378	BAAI bge-reranker-large self-host
Cloudflare Workers (BAA-eligible)	—	$128	edge + audit log shipping
Langfuse self-hosted (t3.medium)	—	$67	trace store; 90-day hot / 7-yr cold
All-in monthly	≈ $0.0411	≈ $1,585	vs. ≈ $7,900 / mo to add one triage nurse

line item Claude Sonnet 4.6 — input tokens
$ / decision $0.0102
$ / month (≈ 41k decisions) $418
note 3,400 tokens × $3.00 / 1M
line item Claude Sonnet 4.6 — output tokens
$ / decision $0.0072
$ / month (≈ 41k decisions) $294
note 480 tokens × $15.00 / 1M
line item voyage-3-large embeddings (avg query)
$ / decision $0.0004
$ / month (≈ 41k decisions) $16
note ≈ 3,300 tokens × $0.12 / 1M
line item pgvector + RDS db.m6i.large
$ / decision —
$ / month (≈ 41k decisions) $284
note BAA-scoped Postgres; embeddings + tsvector
line item g5.xlarge reranker (24/7)
$ / decision —
$ / month (≈ 41k decisions) $378
note BAAI bge-reranker-large self-host
line item Cloudflare Workers (BAA-eligible)
$ / decision —
$ / month (≈ 41k decisions) $128
note edge + audit log shipping
line item Langfuse self-hosted (t3.medium)
$ / decision —
$ / month (≈ 41k decisions) $67
note trace store; 90-day hot / 7-yr cold
line item All-in monthly
$ / decision ≈ $0.0411
$ / month (≈ 41k decisions) ≈ $1,585
note vs. ≈ $7,900 / mo to add one triage nurse

Token costs use Anthropic's public Sonnet 4.6 pricing as of May 2026: $3 / 1M input, $15 / 1M output. Infra costs are AWS US-east-2 list price; client paid less under EDP. Payback period from go-live (including the 9-week build at $185k) was ≈ 6.2 months.

eval composition

What's in the frozen 412-item set

category	items	what it checks	ci-gate threshold
Acuity-decision golds	80	labelled routing + correct acuity band on real (de-identified) encounters	≥ 0.90 precision @ 1% FPR
PHI redaction	60	spans of PHI correctly redacted; reversible-token map intact	≥ 0.99 token recall
Retrieval recall	120	correct pathway chunk in top-5 after RRF + rerank	≥ 0.90 recall@5
Groundedness	100	every rationale claim points to a retrieved chunk id that supports it	≥ 0.93 groundedness
Refusal / adversarial	52	pediatric < 3y, active pregnancy w/o OB, jailbreak attempts, OOD cases	100% refusal on listed must-refuse

category Acuity-decision golds
items 80
what it checks labelled routing + correct acuity band on real (de-identified) encounters
ci-gate threshold ≥ 0.90 precision @ 1% FPR
category PHI redaction
items 60
what it checks spans of PHI correctly redacted; reversible-token map intact
ci-gate threshold ≥ 0.99 token recall
category Retrieval recall
items 120
what it checks correct pathway chunk in top-5 after RRF + rerank
ci-gate threshold ≥ 0.90 recall@5
category Groundedness
items 100
what it checks every rationale claim points to a retrieved chunk id that supports it
ci-gate threshold ≥ 0.93 groundedness
category Refusal / adversarial
items 52
what it checks pediatric < 3y, active pregnancy w/o OB, jailbreak attempts, OOD cases
ci-gate threshold 100% refusal on listed must-refuse

Eval set is frozen: items only added, never edited. Clinical lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry.

production ops cadence

What runs every week,
and who owns it.

Production ops is part of the build, not an afterthought. Four controls keep the agent calibrated and the BAA scope honest after cutover.

Override-review meeting

Every queued case where the agent recommendation differed from the nurse's gets opened. Systematic drift (>3 same pattern/wk) becomes a JIRA ticket against the eval set.

Trace retention

Langfuse in customer VPC. Matches the health system's HIPAA documentation retention policy.

On-call rotation

Two engineers per week. 99.5% pipeline-availability SLO + p95 ≤ 3.5s end-to-end decision SLO.

Security audit sample

Model version, retrieval candidates, redaction map, policy-check verdict, clinician override.

9 weeks · honest version

The timeline
including the week we almost cut.

Five stages, milestone-billed. The week-7 shadow run found a calibration bug on borderline-acuity cases that would have hurt patients in production. We halted cutover, re-fit the calibration head, re-ran the eval, and only then promoted to primary. The honest version of `9 weeks` includes the week we sat on our hands.

Weeks 1–2

Discovery + eval set

Two weeks shadowing the nurse triage line. 412 frozen eval items written by the clinical lead from real (de-identified) past encounters. Each item carries a labelled correct routing decision and the clinical reasoning behind it. We wrote the harness; clinicians wrote the answers.

Frozen eval set + acuity-band scoring rubric
Weeks 3–4

Pathway corpus + retrieval

Ingested the existing clinical-pathway document set (≈ 1,400 chunked pages) into pgvector 0.7 inside the customer VPC. Built the BM25 sidecar over the same chunks. Reciprocal-rank fusion tuned on a held-out eval slice; cross-encoder rerank added when top-1 recall plateaued.

Hybrid retrieval at 0.91 top-5 recall on the eval set
Weeks 5–6

Agent skeleton + guardrails

LangGraph 0.2.x agent with three read-only tools. Zero write tools by design. Forced-JSON decision via Anthropic's response_format. Policy-as-code in TypeScript shipping next to the agent: every routing decision is gated and audit-logged before it touches a clinician queue.

End-to-end pipeline behind a feature flag
Week 7

Shadow run: calibration bug found

Two weeks of silent shadow against the live nurse triage line. Day 4 the clinical lead flagged a calibration drift on borderline-acuity cases: the model was confident on cases where the correct answer was 'queue for clinician', not 'clear'. We halted cutover, re-fit the calibration head on a fresh slice, and re-ran the eval. The honest version of `shipped on time` includes this step.

ECE recalibrated from 0.061 → 0.029 on a fresh eval slice

Walk-away point
Weeks 8–9

Cutover + clinician training

Promoted to primary triage with the nurse line in active-standby. Four clinician training sessions on the override flow and the audit-log viewer. PagerDuty wired to the stat-escalation lane. Old nurse line stays on for 30 days post-cutover by policy. Every diff between agent + human is logged for review.

Production cutover with documented metrics + override flow

eval results · 412 frozen items

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 412. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric

baseline (wk 2)

v1 (wk 5)

v2 (wk 6)

current (live)

target

Triage-acuity precision @ 1% FPR

—

0.821

0.879

0.904

≥ 0.90

Recall on high-acuity escalations

—

0.918

0.946

0.962

≥ 0.95

Calibration (ECE)

—

0.073

0.061

0.029

≤ 0.04

Note groundedness

—

0.88

0.92

0.95

≥ 0.93

Refusal rate

—

14.8%

11.2%

9.4%

8–12%

P95 time-to-decision

—

4.2s

3.4s

3.1s

≤ 3.5s

Sample size for the production wait-time number is n=14,200 patient encounters across the two-week shadow window; the 38–62% reduction range is the 95% confidence interval, not a point estimate. ECE is expected calibration error on the labelled 412-item set. P95 latency is end-to-end from FHIR pull to JSON decision, measured at the agent boundary (excludes clinician-side queue render). Refusal rate is the share of inputs where the agent legally cannot decide and routes straight to a clinician: by design, not by failure.

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

A triage agent built on these patterns will hurt patients in any of the following situations. We turn down the engagement before a pilot is scoped.

Low-acuity miss is unacceptable

Pediatric emergency, acute behavioural-health crisis, suspected stroke timeline, anaphylaxis: these are not eval-set problems. They are policy-routed straight to a human. If the workflow needs an AI in that lane, the answer is no.

Clinician override patterns aren't measured

If the program lead is not going to review agent-vs-clinician diffs weekly for the first six months, the calibration head drifts and nobody catches it. The eval set is necessary, not sufficient.

PHI minimization tradeoffs aren't agreed upfront

Every additional chart resource the agent reads is a larger blast radius if the BAA breaks. We scope down to Patient + Encounter + recent Observation. Clients who want the full chart read into the agent are scoping a different (and worse) product.

BAA + audit-log gaps in the deployment plan

No BAA from the model vendor, no Cloudflare-or-equivalent BAA on the edge, no audit-log review cadence. The legal posture either exists at week 1 or the pilot doesn't get signed.

frequently asked · clinical triage AI · HIPAA

What buyers ask first.
Real answers, no hedging.

What is a clinical triage AI agent?

A clinical triage AI agent helps a licensed clinician make a routing decision at first patient contact. It does not make the decision. It stratifies acuity, surfaces relevant pathway evidence from the patient's chart and a vetted knowledge corpus, and routes to a human with cited rationale. It refuses on novel cases and never recommends a definitive diagnosis.

Is this HIPAA-compliant?

Yes. The deployment runs Claude Sonnet 4.6 on AWS Bedrock with customer-managed KMS keys and retention=0, inside the health system's existing HIPAA-eligible AWS environment. PHI never leaves the tenant. We sign a BAA with the system and operate under their security review. Langfuse traces retain 90 days hot in the customer VPC plus 7 years cold in BAA-scoped S3.

Is a clinical triage AI agent FDA-regulated?

Depends on the function. Under the FDA's Clinical Decision Support Software guidance (Sep 2022), software that lets a clinician independently review the basis of the recommendation and is not intended to drive a time-critical decision can fall outside Device classification. We scope every triage engagement against the four CDS criteria at week 1. If the function crosses into Device territory, the engagement requires regulatory counsel and a different deployment posture.

Why FHIR R4 specifically?

FHIR R4 is the modern interoperability standard, supported by Epic, athenahealth, Cerner, and most US EHR systems. The agent pulls chart context via scoped FHIR resources (Patient + Encounter + recent Observation) rather than a full-chart read, minimizing the PHI blast radius if the BAA breaks. Older standards (HL7 v2) require more brittle integrations and weren't worth the maintenance debt for this engagement.

How accurate is the triage agent?

0.94 acuity-stratification agreement with the senior triage nurse panel on the frozen 412-item eval set. 0.97 escalate-to-MD recall on the must-escalate subset. 4.1% over-triage rate (recommending higher acuity than necessary) and 0.8% under-triage rate. The headline 38-62% wait reduction comes from a 14,200-encounter shadow window, not a synthetic eval.

What does it cost to run?

About $0.18 per triage decision (median ~3,400 input + ~620 output tokens at Sonnet 4.6 pricing). Across the health system's ~7,200 triage decisions/month, that's roughly $1,300/month in model spend plus $1,180/month for HIPAA-scoped infra (Bedrock + pgvector + audit logging in customer VPC).

How long does it take to build?

9 weeks for this engagement: 2 weeks discovery + 412-item eval-set freeze with the nurse triage panel, 1 week FHIR R4 integration + chart-pull pipeline, 2 weeks agent build + acuity rubric + refusal lane, 1 week kill-point pause (we re-scoped the out-of-scope detection after a near-miss on a pediatric case), 2 weeks shadow cutover, 1 week launch + tuning.

When should we NOT ship a clinical triage AI agent?

Four cases: low-acuity miss is unacceptable in the workflow (pediatric emergency, stroke timeline, anaphylaxis are policy-routed straight to a human, no AI); the program lead won't review agent-vs-clinician diffs weekly for the first six months; PHI minimization tradeoffs aren't agreed at week 1; BAA + audit-log gaps in the deployment plan. We turn down engagements that fail any of these.

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build, or that a similar build on your stack would draw from.

01 Industry

Healthcare AI Development

The healthcare pillar: BAA-scoped delivery, PHI redaction, clinician-in-loop posture across triage, ambient scribe, and prior-auth.

02 Service

AI Agent Development

The agent pillar: ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this triage build.

03 Service

Intelligent Document Processing

The pillar service this case study is one shipped example of.

04 Service

Claude Development

Sonnet 4.6 + Haiku 4.5 integration patterns. Forced JSON, Constitutional-AI posture, BAA-eligible deployment options.

05 Case study

All AI Case Studies

Six AI case studies: RAG, agents, voice, and chatbots. Same operator detail across every page.

06 Service

AI Consulting

fixed-fee discovery audit. We map the workflow, scope the eval, and tell you whether it's case-study-shaped.

07 Service

AI Governance

Policy-as-code, audit-log scaffolding, BAA + DPA templates. The plumbing that made this pilot pass a security review.

08 Service

AI Development Company

How a clinical triage agent fits inside a broader AI development services engagement: EHR integration + retrieval + Sonnet reviewer + audit log.

09 Service

AI Knowledge Base

Clinical triage runs on RAG over patient + protocol docs: the same productized AI knowledge base pattern applied inside a HIPAA boundary.

Ready to ship

Want a case study like this
for your stack?

Book a fixed-fee discovery audit. We'll review the workflow, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. We'll also tell you if it isn't. About one audit in five ends with `buy the platform, here's the SOW for integration.`

Read the healthcare pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

Updated May 20, 2026 · By Navin Sharma

How we shipped a HIPAA-safe clinical triage AI agent in 9 weeks.

What this case study shows

A nurse triage line under load.

today

with the agent

From your BAA scope to the model and the audit-log trail back.

Clinical triage pipeline: six stages, three outcome lanes.

Zero write tools for the agent

Forced JSON · response_format schema

Hybrid pgvector + BM25 retrieval

Where we let AI run and where your physicians stay in the loop.

Every component has a separately measurable contract.

Decision model

Retrieval

Reranker

PHI redaction

Calibration head

Refusal lane

The triage agent, end to end.

Clinical triage stack: named tools, named versions.

Production shape, under the hood.

What runs every week, and who owns it.

Override-review meeting

Trace retention

On-call rotation

Security audit sample

The timeline including the week we almost cut.

Discovery + eval set

Pathway corpus + retrieval

Agent skeleton + guardrails

Shadow run: calibration bug found

Cutover + clinician training

How we know it works.

The four shapes we turn down before scoping a pilot.

Low-acuity miss is unacceptable

Clinician override patterns aren't measured

PHI minimization tradeoffs aren't agreed upfront

BAA + audit-log gaps in the deployment plan

What buyers ask first. Real answers, no hedging.

Where this case study points back to.

Healthcare AI Development

AI Agent Development

Intelligent Document Processing

Claude Development

All AI Case Studies

AI Consulting

AI Governance

AI Development Company

AI Knowledge Base

Want a case study like this for your stack?

How we shipped a HIPAA-safe clinical triage AI agent
in 9 weeks.

A nurse triage line
under load.

From your BAA scope to the model
and the audit-log trail back.

Clinical triage pipeline: six stages,
three outcome lanes.

Where we let AI run
and where your physicians stay in the loop.

Every component has a
separately measurable contract.

The triage agent,
end to end.

Clinical triage stack: named tools,
named versions.

Production shape,
under the hood.

What runs every week,
and who owns it.

The timeline
including the week we almost cut.

How we know
it works.

The four shapes we turn down
before scoping a pilot.

What buyers ask first.
Real answers, no hedging.

Where this case study
points back to.

Want a case study like this
for your stack?