← all case studies
Legal · Mid-market US law firm AI legal assistant · AI contract review software · forced-JSON clause risk
Claude Sonnet 4.6LangChain 0.3LangGraph 0.2pgvector 0.7iManage Work
ai legal assistant · ai contract review case study · rag case study · 2026 · anonymized

An AI legal assistant for
first-pass contract review.

A US-based mid-market law firm with four practice groups needed an AI legal assistant that could do first-pass MSA review at partner-trust quality: split a contract into clauses, retrieve matching policies and precedents from a reconciled clause library, flag clause risk against the firm's playbook with cited policy IDs, and refuse out loud on novel patterns. We built the AI contract review software on Claude Sonnet 4.6, LangChain 0.3, LangGraph 0.2, a hybrid pgvector and BM25 index, and a forced-JSON clause-risk schema. AI-powered contract review the general counsel could defend on an ABA Op. 512 audit. Nine weeks, partner-shadow-first, with a clause-library drift kill point at week 4 that we paused the build for.

≈ 71%
first-pass MSA review time saved · partner-signed-off (95% CI · n=180 MSAs across 6 months)
p95 62s
MSA wall-clock to full clause-risk report · meets <90s service target
740
frozen clause-eval items · re-run on every release
9 weeks
discovery to production cutover
shipped
9 weeks · 4 engineers · 4 senior counsel (one per practice group)
Summary

What this case study shows

A US mid-market law firm shipped a first-pass MSA review agent on Claude Sonnet 4.6 plus a LangChain orchestrator over a reconciled clause library of 1,420 master-service-agreement clauses. Across n=180 MSAs (95% CI), partner time per first-pass review dropped 71% with partner sign-off on every output. Stack: Claude Sonnet 4.6, LangChain 0.3, LangGraph 0.2, pgvector 0.7, BAAI bge-reranker-large, Langfuse, iManage Work. Compliance: attorney-client privilege, ABA Model Rules of Professional Conduct, ABA Opinion 512 on UPL, FRE 502. Multi-month ongoing engagement. The clause-extraction layer underneath is the same engagement shape we ship as a standalone document processing pipeline when the extraction itself is the primary surface.

6–9 hrs
partner first-pass review per MSA · pre-build
180 / yr
MSAs flowing through the 4 practice groups
4
practice groups · M&A · employment · real estate · IP
11%
post-execution disputes traced to inconsistent first-pass clause calls
the problem

Four practice groups,
four contradictory playbooks.

A US-based mid-market law firm — ~70 attorneys, 4 practice groups (M&A, employment, real estate, IP), ~180 MSAs a year. Too small to fund Ironclad/Spellbook seats firm-wide; too large for the managing partner to hand-review every contract. Partners spent 6–9 hours per MSA on first-pass review. The pre-build process was partner-by-partner, playbook-by-playbook, and the playbooks had quietly drifted apart.

today vs · with the agent

today

MSA arrives
Partner reads cover-to-cover
Cross-checks against 4 playbooks
drift across practice groups
Marks up redline by hand
outcome
6–9 hr first-pass per MSA · inconsistent calls across partners · 11% downstream dispute rate

with the agent

MSA upload
Semantic clause split
Hybrid clause-RAG + policy lookup
reconciled clause library
Sonnet 4.6 clause-risk JSON
policy_id + precedent_ids enforced
outcome
Acceptable · partner skim
outcome
Negotiate · redline drafted
outcome
Block / manual review

Two failure modes. Wall-clock: partners burning 6–9 hours per MSA on first-pass review, cross-checking each clause against their practice-group playbook by hand. Quality cost of the drift: a general counsel audit traced 11% of post-execution disputes back to inconsistent first-pass clause calls — same fact pattern, opposite playbook calls between practice groups. Ironclad, Spellbook, and three smaller LegalTech vendors had all been evaluated and turned down. The conversation we walked into was not "should we ship LangChain" — it was "show us how a clause-risk agent could miss a hard-no, and tell us how you'd catch it before a partner sends the redline to opposing counsel."

pre-build · the binding constraints
partner time / MSA6–9 hrs
MSAs / year~180
dispute rate from drift11%
vendor tools accepted0 / 5

Vendor turn-downs detailed below. The four objections decided the engagement shape.

why the firm rejected every ai contract review tools vendor it evaluated
design decision · 01

No reconciled clause library

we rejected
Vendor's pre-trained legal model
because
Vendor tools train on a generic legal corpus. The firm's four practice groups had drifted apart on standard clauses — an agent trained on the drift would inherit it. We required reconciliation as a first-class deliverable, not a pre-trained black box.
design decision · 02

No policy_id citation contract

we rejected
Free-text rationale per clause
because
Every flagged clause must cite a policy_id (regex-validated) that resolves to a live policy document. Partners verify the policy, not the model. No vendor we evaluated surfaced this in the output schema.
design decision · 03

No published eval methodology

we rejected
Vendor's headline accuracy numbers
because
We require an eval set the firm owns, scored by senior counsel, frozen between releases. Vendors guard their eval shapes; the firm can't verify accuracy claims against the firm's own clause distribution.
design decision · 04

No first-class refusal lane

we rejected
Default-to-confident output on every clause
because
On novel patterns the agent must refuse, not guess. Manual-review is a routing lane, not an error state. Vendors framed novelty as failure; we framed it as design.
discovery · week 1

The thing that scares us is not the obvious miss — we can write a rule for an uncapped indemnity. What scares us is a confident negotiate-band call on a clause where the right answer was block, because the agent didn't know the M&A and real estate playbooks contradict each other on that fact pattern. Show us how you measure that, or we're not signing.

General Counsel Mid-market US law firm · 4 practice groups
the approach · ai legal assistant pipeline · legal rag architecture

AI contract review pipeline — six stages,
four risk bands.

iManage matter-scoped pull, semantic clause splitter with §N.M boundaries, two parallel retrieval lanes (clause-RAG over the reconciled 1,420-clause library + practice-group-aware policy lookup with regex-validated IDs), bge-reranker-large (A/B winner against Cohere Rerank v3), Claude Sonnet 4.6 with forced-JSON ClauseRisk schema. Zero write tools — the agent produces redlines; partners send them. Diagram below.

three decisions that shaped the ai legal assistant build
design decision · 01

Forced-JSON clause-risk schema with policy-id regex

we rejected
Free-text redline summary
because
Every flagged clause has to cite a policy_id matching the regex policy_(practice-group)-(NNN). The Zod validator is the contract; the model can't suggest a redline without naming the policy it traces to. Partners check the policy, not the model.
design decision · 02

Four risk bands · acceptable / negotiate / block / manual-review

we rejected
Continuous risk score 0–1
because
Partners read bands, not scores. A 0.71 risk score is harder to defend in front of a client than `negotiate per policy IP-014`. The band-based output also gates queue routing: block clauses page the supervising partner; manual-review clauses leave a human-only marker on the redline.
design decision · 03

Reconcile the 4 clause libraries before any agent build

we rejected
Train per-practice-group models on each library as-is
because
Discovery week 4 found that M&A and real estate had contradictory standard indemnification clauses (same fact pattern, opposite playbook calls). An agent trained on the drift would inherit it. We paused the build for a 2-week reconciliation pass with senior counsel from each group, then resumed.
why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.

Clause-risk decision model

Labelled risk-band correctness + policy-citation accuracy on the frozen 740-item eval. Forced-JSON schema enforces a regex-validated policy_id on every rationale claim — the model cannot suggest a redline without naming the policy it traces to.

Clause-RAG retrieval

Recall@5 on the reconciled clause library. RRF fusion + bge-reranker-large tuned against this number, not end-to-end accuracy.

Policy-lookup index

Citation-accuracy against senior-counsel ground truth. Practice-group-aware: IP clauses retrieve IP policies first.

Reranker bake-off

bge-reranker-large self-host vs Cohere Rerank v3.

Manual-review lane

Refusal rate by design, not a failure mode.

Partner-override audit

Every disagreement opened in the weekly override-review meeting.

under the hood

The contract-review chain,
clause by clause.

Every clause runs the same six-stage chain. Two retrieval lanes fan out at stage 4 — house policy and the reconciled clause library — then converge on a forced-JSON clause-risk model. The four risk bands at the bottom are the only legal outputs. Hover any stage for its tool inventory.

risk · band 1 Acceptable matches house playbook · partner skim only · ≈ 41% of clauses
risk · band 2 Negotiate redline drafted with policy citation · ≈ 38%
risk · band 3 Block violates a hard policy · partner notified before send · ≈ 9%
risk · band 4 Manual review novel pattern · agent refuses · ≈ 12%

stage latencies above are per-clause p50 / p95 · a typical MSA runs ≈ 80–140 clauses in parallel batches of 12 · end-to-end MSA wall-clock 38–62s

policy-cited
every flagged clause carries a policy_id (regex-enforced)
0
autonomous redlines · partner approves every send to the counterparty
4 senior counsel
in the reconciliation council · one per practice group
shadow-first
MSAs reviewed in parallel by agent + partner for the first 6 weeks post-cutover
the stack

AI contract review software stack — named tools,
named versions.

Everything in the build is a thing your IT director can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, schemas, and policies are all checked into the firm's repo. Cohere Rerank stays wired as a fallback so the rerank stage has a documented swap-out path.

Claude Sonnet 4.6 Anthropic API · forced JSON role clause-risk decision model
LangChain 0.3 Python role orchestrator + retrieval glue
LangGraph 0.2.x Python role clause-by-clause chain · shared scratchpad
voyage-3-large 1,024 dim role embeddings · clause library + policy index
pgvector 0.7 Postgres 16 role embedding retrieval
BM25 (Postgres tsvector) Postgres 16 role lexical retrieval · RRF fusion
BAAI bge-reranker-large self-hosted g5.xlarge role cross-encoder rerank · ship
Cohere Rerank v3 · A/B alternative role rerank · loser on legal corpus (kept as fallback)
Langfuse v3 · self-hosted role per-decision trace · partner-override review
iManage Work Cloud · matter-scoped role DMS · OAuth-on-behalf access
how it actually runs

Production shape,
under the hood.

Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 pricing as of May 2026; eval composition is the frozen 740-item clause-eval set the CI gates on.

ai contract review latency budget

Per-clause P50 / P95 (ms)

stagep50p95tooling
MSA intake + iManage pull280720iManage Work · matter-scoped OAuth · document hashed + version-pinned
Semantic clause split410880Heading-aware splitter · §N.M boundary + sub-clause merge
LangChain orchestrator step3688LangGraph 0.2 chain · shared scratchpad · clause-by-clause
Clause-RAG retrieval + rerank372620pgvector + tsvector RRF k=60 → bge-reranker-large top-6
Policy lookup (parallel)2258Practice-group-aware regex-validated policy index
Sonnet 4.6 clause-risk decision16002400Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
Validator + audit log1832Zod schema · policy_id regex check · Langfuse trace write
Total (per-clause end-to-end)27384798agent boundary; MSAs batch 12 clauses in parallel for ≈ 62s wall-clock
  1. stage MSA intake + iManage pull
    p50 280
    p95 720
    tooling iManage Work · matter-scoped OAuth · document hashed + version-pinned
  2. stage Semantic clause split
    p50 410
    p95 880
    tooling Heading-aware splitter · §N.M boundary + sub-clause merge
  3. stage LangChain orchestrator step
    p50 36
    p95 88
    tooling LangGraph 0.2 chain · shared scratchpad · clause-by-clause
  4. stage Clause-RAG retrieval + rerank
    p50 372
    p95 620
    tooling pgvector + tsvector RRF k=60 → bge-reranker-large top-6
  5. stage Policy lookup (parallel)
    p50 22
    p95 58
    tooling Practice-group-aware regex-validated policy index
  6. stage Sonnet 4.6 clause-risk decision
    p50 1600
    p95 2400
    tooling Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
  7. stage Validator + audit log
    p50 18
    p95 32
    tooling Zod schema · policy_id regex check · Langfuse trace write
  8. stage Total (per-clause end-to-end)
    p50 2738
    p95 4798
    tooling agent boundary; MSAs batch 12 clauses in parallel for ≈ 62s wall-clock

p50/p95 from a 6-month rolling window over n ≈ 18,400 per-clause decisions (180 MSAs × ≈ 102 clauses average). SLO is p95 ≤ 90s wall-clock per MSA; current burn ≈ 69%.

embedding model bake-off · held-out eval slice

Four-way Pareto check before locking the retrieval stack

modelrecall@5price tierpick reason
voyage-3-large (1,024d)0.92$$Pareto-best on recall, matched price tier — SHIPPED
OpenAI text-embedding-3-large0.89$$Lost 3pts recall@5 on the legal corpus
Cohere embed-multilingual-v30.87$$Lost on legal-domain recall, tied on price
bge-large (self-hosted, fine-tuned)0.90$ (infra only)Close on recall, lost on 24/7 GPU operational cost
  1. model voyage-3-large (1,024d)
    recall@5 0.92
    price tier $$
    pick reason Pareto-best on recall, matched price tier — SHIPPED
  2. model OpenAI text-embedding-3-large
    recall@5 0.89
    price tier $$
    pick reason Lost 3pts recall@5 on the legal corpus
  3. model Cohere embed-multilingual-v3
    recall@5 0.87
    price tier $$
    pick reason Lost on legal-domain recall, tied on price
  4. model bge-large (self-hosted, fine-tuned)
    recall@5 0.90
    price tier $ (infra only)
    pick reason Close on recall, lost on 24/7 GPU operational cost

Bake-off ran on a 60-clause held-out slice from the 740-item eval set. Same retrieval pipeline (pgvector + BM25 + RRF k=60) for every variant. We chose voyage-3-large after the bake-off; Cohere stayed wired as a runtime fallback.

contracts/schema/clause-risk.ts typescript
// contracts/schema/clause-risk.ts
// Forced-JSON clause-risk schema. Validated on every model output;
// if Sonnet produces something that doesn't parse, we retry once
// with a stricter system prompt, then fail closed (manual-review).

import { z } from "zod";

export const ClauseRisk = z.object({
  clause_id: z.string(),
  risk: z.enum(["acceptable", "negotiate", "block", "manual-review"]),
  rationale: z.array(z.object({
    claim: z.string().min(20).max(420),
    policy_citation: z.string().regex(/^policy_[A-Z]{2,4}-\d{3,5}$/),
    precedent_ids: z.array(z.string()).min(0).max(8),
  })).min(1),
  suggested_redline: z.string().optional(),
});

export type ClauseRisk = z.infer<typeof ClauseRisk>;
The clause-risk schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform. Every flagged clause has to name a policy_id matching the regex, or the validator rejects and the agent retries once with a stricter prompt, then routes the clause to manual review.
clause-risk output · sample MSA

What the partner sees

Six clauses from a sample MSA, rendered with the same risk-band tinting and expandable rationale that partners see in the firm's tooling. Click any clause to expand the policy citation, precedent ids, and (where present) the agent's suggested redline.

sample MSA · vendor agreement · 6 of 102 clauses shown
  • 2acceptable
  • 2negotiate
  • 1block
  • 1manual
  • 102total clauses
  1. § 3.1 Term & renewal — auto-renewal with 60-day notice acceptable

    Auto-renewal with 60-day opt-out notice matches the firm's standard MSA playbook for the vendor practice. No deviation flagged.

    policy citation
    policy_MA-104
    precedents
    prec_msa-2024-118 · prec_msa-2024-217
  2. § 7.2 Indemnification — IP infringement (third-party claims) negotiate

    Counterparty's draft caps IP-infringement indemnity at 1× annual fees and excludes injunctive relief. House playbook requires uncapped IP-infringement indemnity plus injunctive coverage for the licensed product.

    policy citation
    policy_IP-014
    precedents
    prec_msa-2024-088 · prec_msa-2025-031 · prec_msa-2025-104
    suggested redline
    Replace cap with: `Vendor shall indemnify, defend and hold harmless Customer from any third-party IP-infringement claim arising from the Service, without cap. Injunctive relief is included within the scope of this indemnity.`
  3. § 9.4 Limitation of liability — consequential-damages waiver block

    Mutual consequential-damages waiver is acceptable as a category; however, this draft also waives consequential damages for breach of confidentiality and data-protection obligations. Both of these the firm's policy carves out from any waiver. Partner notification required before send.

    policy citation
    policy_MA-203
    precedents
    prec_msa-2024-141 · prec_msa-2025-052
    suggested redline
    Add carve-out: `Nothing in this Section limits liability for breach of Sections 11 (Confidentiality) or 12 (Data Protection), or for indemnification obligations under Section 7.`
  4. § 14.5 Governing law — Delaware (with international arbitration) acceptable

    Delaware governing law + ICC international arbitration seat in Singapore matches the firm's cross-border SaaS playbook. No deviation flagged.

    policy citation
    policy_MA-307
    precedents
    prec_msa-2024-202
  5. § 16.2 Data residency — multi-region with carve-out for AI training manual review

    Clause contains a sub-paragraph permitting vendor to use customer data for `model improvement` outside the named regions. This is a novel clause shape not present in any of the 1,420 reconciled-library reference clauses; agent refuses to issue a band and routes to manual review.

    policy citation
    policy_IP-021
    precedents
  6. § 19.1 Assignment — change-of-control consent negotiate

    Draft permits assignment to any affiliate without consent. House playbook requires written consent for any assignment outside the same parent group, plus a 30-day cure period for change-of-control events.

    policy citation
    policy_MA-118
    precedents
    prec_msa-2025-019 · prec_msa-2025-077
    suggested redline
    Replace with: `Neither party may assign this Agreement without the other party's prior written consent, which shall not be unreasonably withheld. Change of control triggers a 30-day cure window.`

sample only — anchor_ids, policy_ids, and precedent_ids are illustrative and follow the shape the schema enforces · live surface renders all clauses, not just the 6 shown

ai contract review unit economics

Per-MSA and monthly cost math

line item$ / MSA$ / month (≈ 30 MSAs)note
Claude Sonnet 4.6 input tokens$0.857$25.71102 clauses × 2,800 tokens × $3.00 / 1M
Claude Sonnet 4.6 output tokens$0.795$23.86102 clauses × 520 tokens × $15.00 / 1M
voyage-3-large embeddings (per clause)$0.037$1.10102 clauses × ≈ 3,000 tokens × $0.12 / 1M
pgvector + RDS db.m6i.largefixed$284Postgres 16 in firm tenant · clause library + policy index
g5.xlarge reranker (24/7)fixed$378BAAI bge-reranker-large self-host · Cohere fallback wired
LangChain · LangGraph runtimefixed$94Python on Fargate · 2 vCPU · per-clause parallelism = 12
Langfuse self-hosted (t3.medium)fixed$67trace store · 90-day hot / 7-yr cold
iManage Work connectorfixed$0uses firm's existing iManage Cloud seat
All-in monthly (≈ 30 MSAs)≈ $1.69≈ $874vs. ≈ 200 partner hours saved at firm rates
  1. line item Claude Sonnet 4.6 input tokens
    $ / MSA $0.857
    $ / month (≈ 30 MSAs) $25.71
    note 102 clauses × 2,800 tokens × $3.00 / 1M
  2. line item Claude Sonnet 4.6 output tokens
    $ / MSA $0.795
    $ / month (≈ 30 MSAs) $23.86
    note 102 clauses × 520 tokens × $15.00 / 1M
  3. line item voyage-3-large embeddings (per clause)
    $ / MSA $0.037
    $ / month (≈ 30 MSAs) $1.10
    note 102 clauses × ≈ 3,000 tokens × $0.12 / 1M
  4. line item pgvector + RDS db.m6i.large
    $ / MSA fixed
    $ / month (≈ 30 MSAs) $284
    note Postgres 16 in firm tenant · clause library + policy index
  5. line item g5.xlarge reranker (24/7)
    $ / MSA fixed
    $ / month (≈ 30 MSAs) $378
    note BAAI bge-reranker-large self-host · Cohere fallback wired
  6. line item LangChain · LangGraph runtime
    $ / MSA fixed
    $ / month (≈ 30 MSAs) $94
    note Python on Fargate · 2 vCPU · per-clause parallelism = 12
  7. line item Langfuse self-hosted (t3.medium)
    $ / MSA fixed
    $ / month (≈ 30 MSAs) $67
    note trace store · 90-day hot / 7-yr cold
  8. line item iManage Work connector
    $ / MSA fixed
    $ / month (≈ 30 MSAs) $0
    note uses firm's existing iManage Cloud seat
  9. line item All-in monthly (≈ 30 MSAs)
    $ / MSA ≈ $1.69
    $ / month (≈ 30 MSAs) ≈ $874
    note vs. ≈ 200 partner hours saved at firm rates

Token costs use Anthropic's published May-2026 Sonnet 4.6 pricing ($3 / 1M input, $15 / 1M output). Infra costs are AWS US-east-2 list price (firm's tenant). Per-MSA token cost assumes the median 102-clause MSA observed in the eval set; range across the 180 production MSAs in the 6-month sample is $0.94 (62 clauses) to $2.83 (164 clauses). Payback period from go-live, including the 9-week build at $215k, is ≈ 4.4 months at the firm's published blended partner rate against the ≈ 71% time saved on partner-signed-off MSAs.

ai contract review eval composition · contract analysis ai rubric

What's in the frozen 740-item clause-eval set

categoryitemswhat it checksci-gate threshold
M&A clauses (golds)320labelled risk band + policy_id + suggested redline · senior-counsel signed≥ 0.88 band precision · ≥ 0.90 policy accuracy
Employment clauses (golds)180labelled risk band + policy_id · employment senior-counsel signed≥ 0.88 band precision
Real estate clauses (golds)140labelled risk band + policy_id · real estate senior-counsel signed≥ 0.88 band precision
IP clauses (golds)100labelled risk band + policy_id · IP senior-counsel signed≥ 0.88 band precision
Block-clause must-catchsee notesubset across all 4 groups · catch every hard-no clause (must)≥ 0.95 block recall
Manual-review (novel patterns)see notedeliberately novel clauses · agent must refuse, not guess100% refusal on listed must-refuse
  1. category M&A clauses (golds)
    items 320
    what it checks labelled risk band + policy_id + suggested redline · senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision · ≥ 0.90 policy accuracy
  2. category Employment clauses (golds)
    items 180
    what it checks labelled risk band + policy_id · employment senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision
  3. category Real estate clauses (golds)
    items 140
    what it checks labelled risk band + policy_id · real estate senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision
  4. category IP clauses (golds)
    items 100
    what it checks labelled risk band + policy_id · IP senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision
  5. category Block-clause must-catch
    items see note
    what it checks subset across all 4 groups · catch every hard-no clause (must)
    ci-gate threshold ≥ 0.95 block recall
  6. category Manual-review (novel patterns)
    items see note
    what it checks deliberately novel clauses · agent must refuse, not guess
    ci-gate threshold 100% refusal on listed must-refuse

Eval set is frozen (items only added, never edited). Senior counsel from the relevant practice group signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Block-catch and manual-review subsets are sub-categories overlapping the 740-item count.

production ops cadence · ai contract review software in the firm tenant

What stays calibrated and how

cadencewhat runswho ownscompliance fit
WeeklyOverride-review meeting — every clause the partner overrode the agent onSenior counsel (one per practice group) + on-call engineerPatterns showing 3× become eval-set additions
MonthlyAudit-log sample — model version, retrieved candidates from both lanes, reranker scores, policy_id, partner overrideFirm's IT directorABA Op. 512 reasonableness · FRE 502(b) inadvertent-disclosure
Per releaseFull 740-item clause-eval re-run against the frozen setEngineering on-callCI fails the release if any metric drops > 1pt
RetentionLangfuse traces — 90-day hot in firm tenant + 7-year cold in tenant-scoped S3Firm IT (we do not hold the keys)Matches the firm's privilege retention policy
  1. cadence Weekly
    what runs Override-review meeting — every clause the partner overrode the agent on
    who owns Senior counsel (one per practice group) + on-call engineer
    compliance fit Patterns showing 3× become eval-set additions
  2. cadence Monthly
    what runs Audit-log sample — model version, retrieved candidates from both lanes, reranker scores, policy_id, partner override
    who owns Firm's IT director
    compliance fit ABA Op. 512 reasonableness · FRE 502(b) inadvertent-disclosure
  3. cadence Per release
    what runs Full 740-item clause-eval re-run against the frozen set
    who owns Engineering on-call
    compliance fit CI fails the release if any metric drops > 1pt
  4. cadence Retention
    what runs Langfuse traces — 90-day hot in firm tenant + 7-year cold in tenant-scoped S3
    who owns Firm IT (we do not hold the keys)
    compliance fit Matches the firm's privilege retention policy

None of this is published anywhere else by anyone shipping legal agents. That's the bar.

ai legal assistant build · 9 weeks · honest version

The timeline
including the two weeks we paused.

Five stages, milestone-billed. The week-4 build turned up a clause-library drift problem: the M&A and real estate playbooks contradicted each other on the same fact pattern, and an AI contract review software trained on the drift would inherit it. We halted the build for a 2-week reconciliation pass with senior counsel from each practice group, then resumed. The honest version of 9 weeks is 7 weeks of build plus 2 weeks of pause.

  1. Weeks 1–2

    Discovery + eval set

    Two weeks shadowing partners across the four practice groups. The managing partner of each group sat in the design council. We sampled 60 MSAs from the prior 18 months, anonymized them, and the four senior partners labelled each clause with the correct risk band + policy citation + suggested redline. That sample became the frozen 740-item clause-eval set: 320 from M&A, 180 from employment, 140 from real estate, and 100 from IP.

    Frozen 740-item eval set + per-practice-group rubric
  2. Week 3

    Clause library + dual-index build

    Ingested each practice group's existing clause library (1,840 reference clauses across the four groups) into pgvector 0.7 with embedding via voyage-3-large at 1,024 dimensions. Built the Postgres tsvector BM25 sidecar over the same corpus. RRF fusion tuned on a held-out eval slice; cross-encoder rerank A/B-tested between bge-reranker-large and Cohere Rerank v3. bge won on the legal corpus by ≈ 3 points top-1 precision; Cohere stayed wired as a fallback.

    Hybrid retrieval at 0.92 recall@5 across all four practice corpora
  3. Week 4

    Clause-library drift — paused for reconciliation

    Building the per-clause review chain in LangGraph turned up a structural problem: M&A's standard indemnification language and real estate's contradicted each other on the same fact pattern (joint-and-several vs several-only for sub-tenancy indemnities). Employment and IP had two more such contradictions. An agent trained on the drift would inherit it. We halted the build, convened a 2-week reconciliation pass with senior counsel from each practice group, and produced a single reconciled clause library: 1,420 clauses after dedupe and reconciliation, down from 1,840. Cost two weeks of wall-clock; bought eighteen months of build-on-firm-ground.

    Reconciled clause library · 1,420 unique reference clauses · sign-off from all 4 practice groups
    Walk-away point
  4. Weeks 5–7

    LangChain agent + forced-JSON clause-risk model

    LangChain 0.3 orchestrator wraps the LangGraph clause-by-clause chain. Claude Sonnet 4.6 with `response_format: json_schema` set to the ClauseRisk shape. The schema is the contract: every flagged clause has to name a policy_id matching the regex; precedent_ids are bounded 0–8; suggested_redline is optional and only renders if the partner expands the clause card. Confidence < 0.8 routes to manual-review; the agent never produces an autonomous redline.

    End-to-end review pipeline behind a partner-only beta flag
  5. Weeks 8–9

    Shadow cutover + partner-override review

    Promoted to first-pass review with partners running in parallel for the first 6 weeks (every MSA reviewed both by the agent and by the partner; outputs compared in Langfuse). After week 6 of shadow the metric held: partner-override rate fell to 9.2% from a baseline 14% in week 1, and the firm cut over to agent-first first-pass with partner final-pass. The override-review meeting runs weekly with senior counsel from each group; patterns that show up three times become eval-set additions.

    Production cutover · partner-override-review cadence locked
ai contract review eval results · 740 frozen clause items

AI contract review eval — how we know
it works.

The AI contract review eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 740. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live partner-shadow numbers are within ±1.5 points across all rows over the last 30 days.

metric
baseline (pre-build)
v1 (wk 5)
v2 (wk 7 post-reconciliation)
current (live, wk 36)
target
Clause-risk band precision (4-class)
n/a
0.84
0.86
0.91
≥ 0.88
Block-clause recall (catch all hard-no)
n/a
0.92
0.95
0.97
≥ 0.95
Policy-citation accuracy (cited the right policy)
n/a
0.79
0.88
0.93
≥ 0.90
Partner-override rate (live shadow)
14.0%
12.4%
10.1%
9.2%
≤ 12%
Manual-review refusal rate (by design)
n/a
8.4%
11.6%
12.0%
10–14%
P95 wall-clock per MSA (full report)
n/a
78s
68s
62s
≤ 90s

Sample size for the headline time-saved number (≈ 71% first-pass MSA review time saved) is n=180 partner-signed-off MSAs across a 6-month rolling window; the figure is a 95% confidence interval, not a point estimate. Partner-override rate is the share of clauses where the partner overrode the agent's risk band on the live shadow slice (by design, not by failure). Manual-review refusal rate is the share of clauses the agent legally refuses to band (novel patterns, score-margin failures, off-corpus clauses) and routes straight to a partner — also by design. Latency is end-to-end MSA wall-clock from upload to full clause-risk report, measured at the agent boundary.

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

An AI contract review agent built on these patterns will mislead partners or counterparties in any of the following situations. We turn down the engagement before a pilot is scoped.

The clause library hasn't been reconciled

If practice groups maintain contradictory playbooks and senior counsel won't sit down for a reconciliation pass, the agent inherits the drift and produces confidently inconsistent calls. We will pause the build, as we did at week 4, until it happens.

Partner sign-off cadence isn't real

If partners aren't going to review every redline before it ships to counterparties for the first six months, the agent's wrong-but-confident clause call lands in opposing counsel's inbox unedited. We require a partner-final-pass workflow at week 1 or the pilot doesn't get signed.

Privilege deployment isn't agreed at week 1

Vendor-side data retention, training opt-out, audit-log retention, model tier. These are not post-launch decisions. ABA Op. 512 and FRE 502(b) scope is decided at week 1 with the firm's compliance lead, or the engagement doesn't start.

Override-review meeting isn't on the calendar

The agent stays honest because senior counsel from each practice group walks the disagreements weekly. If that meeting won't happen, calibration drifts within months and nobody catches it. The eval set is necessary, not sufficient.

frequently asked — ai contract review

What buyers ask first.
Real answers, no hedging.

What does an AI legal assistant actually do?
An AI legal assistant performs structured-output legal work under partner supervision: clause-by-clause contract review, policy citation against the firm's playbook, precedent retrieval, and refusal on novel patterns. It never sends to counterparties without partner sign-off.
How is AI contract review different from a CLM platform?
AI contract review is the decision layer (what risk band, which policy cited, suggested redline). A CLM platform is the workflow layer (intake, approval, signing). They are complementary; this case study replaces the partner-time-intensive review step inside an existing iManage workflow.
How accurate is the AI contract review pipeline?
0.91 clause-risk band precision (4-class), 0.97 block-clause recall, 0.93 policy-citation accuracy, on the frozen 740-item clause-eval set. 9.2% partner-override rate on the live shadow slice, down from a 14% baseline in week 1.
What does AI contract review cost to run?
About $1.69 per MSA in token + infra cost (Claude Sonnet 4.6 + voyage-3-large + pgvector + reranker + Langfuse), or roughly $874 per month at 30 MSAs. Token costs use Anthropic's published Sonnet 4.6 pricing. Payback including the 9-week build is about 4.4 months at the firm's blended partner rate.
How long does it take to build an AI legal assistant?
9 weeks for this engagement: 2 weeks discovery + eval-set freeze, 1 week retrieval build, 2 weeks paused for clause-library reconciliation, 3 weeks agent build + forced-JSON contract, 2 weeks shadow cutover. The 2-week pause was the load-bearing decision.
Is AI contract review ABA Op. 512 compliant?
AI contract review pipelines can be ABA Op. 512 compliant when the firm controls the audit log, retains traces in its own tenant, agrees model + retention scope at week 1, and runs partner-final-pass on every output. This case study's deployment passed the firm's general counsel review against ABA Op. 512 reasonableness duties.
What is the difference between AI contract review software and an AI contract review tool?
A tool is something you buy (Ironclad, Spellbook, Harvey) — its eval set, refusal lane, and policy schema are the vendor's. Software is something built around your firm — eval set, policies, refusal lane, and stack are yours and stay in your repo. This case study is the second shape.
When should a firm NOT ship AI contract review?
Four cases: clause library is not reconciled across practice groups; partners will not sign off every redline for the first six months; privilege deployment (retention, BAA-eligible model tier) is not agreed at week 1; weekly override-review meeting is not on the calendar. We turn down engagements that fail any of these.
keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this AI legal assistant / AI contract review software build, or that a similar legal AI software build on your stack would draw from.

01 Industry

Legal AI Development

The legal pillar — privilege-aware AI for law firms, FRE 502 audit-log scaffolding, ABA Op. 512 citation-chain logging across practice groups.

Read more
02 Service

Intelligent Document Processing

The IDP pillar — multi-modal extraction, schema-validated outputs, confidence routing across document types. The plumbing AI contract review software sits on top of.

Read more
03 Service

AI Agent Development

The agent pillar — ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this AI legal assistant build.

Read more
04 Service

Claude Development

Sonnet 4.6 + Haiku 4.5 integration patterns the AI-powered contract review above uses. Forced JSON, Constitutional-AI posture, BAA-eligible deployment options.

Read more
05 Case study

All AI Case Studies

Six AI case studies — AI legal assistant, AI fraud detection, AI knowledge base, AI triage, voice bot, ecommerce chatbot. Same operator detail across every page.

Read more
06 Service

AI Governance

Policy-as-code, audit-log scaffolding, privilege-aware deployment. The plumbing that made this contract review software pilot pass the firm's risk review.

Read more
07 Service

Custom AI Development

How a contract-review RAG fits inside a broader AI software development company engagement — retrieval + clause classifier + Sonnet reviewer + matter integration.

Read more
08 Service

AI Knowledge Base

Contract-review RAG is a domain-specific AI knowledge base over a matter-precedent corpus, with the same chunking, reranking, and eval harness as our product-docs and clinical RAG builds.

Read more
09 Service

AI Consulting

The contract-review build started with a fixed-fee audit — clause-library inventory, privilege-ring mapping, eval-set design on real MSAs, and a per-document cost projection before pilot.

Read more
Ready to ship

Want an AI legal assistant like this
for your firm's contract review?

Book a fixed-fee AI contract review audit. We'll review the clause library, scope the eval set, recommend an AI-powered contract review recipe (model + retrieval + refusal contract), project run-cost, and tell you honestly whether contract review software built around your playbook is the right shape — or whether you should buy a vendor AI contract review tool. About one audit in five ends with `you need a reconciliation pass before any agent build — here's the SOW for that.`

Read the legal pillar
30 min, async or live Eval-first scoping Walk-away point in the pilot
Updated May 20, 2026 · By Navin Sharma