Legal · Mid-market US law firm AI legal assistant · AI contract review software · forced-JSON clause risk

Claude Sonnet 4.6LangChain 0.3LangGraph 0.2pgvector 0.7iManage Work

ai legal assistant · ai contract review case study · rag case study · 2026 · anonymized

An AI legal assistant for
first-pass contract review.

Q: What does an AI legal assistant actually do?

An AI legal assistant performs structured-output legal work under partner supervision: clause-by-clause contract review, policy citation against the firm's playbook, precedent retrieval, and refusal on novel patterns. It never sends to counterparties without partner sign-off.

Q: How is AI contract review different from a CLM platform?

AI contract review is the decision layer (what risk band, which policy cited, suggested redline). A CLM platform is the workflow layer (intake, approval, signing). They are complementary; this case study replaces the partner-time-intensive review step inside an existing iManage workflow.

Q: How accurate is the AI contract review pipeline?

0.91 clause-risk band precision (4-class), 0.97 block-clause recall, 0.93 policy-citation accuracy, on the frozen 740-item clause-eval set. 9.2% partner-override rate on the live shadow slice, down from a 14% baseline in week 1.

Q: What does AI contract review cost to run?

About $1.69 per MSA in token + infra cost (Claude Sonnet 4.6 + voyage-3-large + pgvector + reranker + Langfuse), or roughly $874 per month at 30 MSAs. Token costs use Anthropic's published Sonnet 4.6 pricing. Payback including the 9-week build is about 4.4 months at the firm's blended partner rate.

Q: How long does it take to build an AI legal assistant?

9 weeks for this engagement: 2 weeks discovery + eval-set freeze, 1 week retrieval build, 2 weeks paused for clause-library reconciliation, 3 weeks agent build + forced-JSON contract, 2 weeks shadow cutover. The 2-week pause was the load-bearing decision.

Q: Is AI contract review ABA Op. 512 compliant?

AI contract review pipelines can be ABA Op. 512 compliant when the firm controls the audit log, retains traces in its own tenant, agrees model + retention scope at week 1, and runs partner-final-pass on every output. This case study's deployment passed the firm's general counsel review against ABA Op. 512 reasonableness duties.

Q: What is the difference between AI contract review software and an AI contract review tool?

A tool is something you buy (Ironclad, Spellbook, Harvey) — its eval set, refusal lane, and policy schema are the vendor's. Software is something built around your firm — eval set, policies, refusal lane, and stack are yours and stay in your repo. This case study is the second shape.

Q: When should a firm NOT ship AI contract review?

Four cases: clause library is not reconciled across practice groups; partners will not sign off every redline for the first six months; privilege deployment (retention, BAA-eligible model tier) is not agreed at week 1; weekly override-review meeting is not on the calendar. We turn down engagements that fail any of these.

A US-based mid-market law firm with four practice groups needed an AI legal assistant that could do first-pass MSA review at partner-trust quality: split a contract into clauses, retrieve matching policies and precedents from a reconciled clause library, flag clause risk against the firm's playbook with cited policy IDs, and refuse out loud on novel patterns. We built the AI contract review software on Claude Sonnet 4.6, LangChain 0.3, LangGraph 0.2, a hybrid pgvector and BM25 index, and a forced-JSON clause-risk schema. AI-powered contract review the general counsel could defend on an ABA Op. 512 audit. Nine weeks, partner-shadow-first, with a clause-library drift kill point at week 4 that we paused the build for.

≈ 71%

first-pass MSA review time saved · partner-signed-off (95% CI · n=180 MSAs across 6 months)

p95 62s

MSA wall-clock to full clause-risk report · meets <90s service target

740

frozen clause-eval items · re-run on every release

9 weeks

discovery to production cutover

shipped

9 weeks · 4 engineers · 4 senior counsel (one per practice group)

Summary

What this case study shows

A US mid-market law firm shipped a first-pass MSA review agent on Claude Sonnet 4.6 plus a LangChain orchestrator over a reconciled clause library of 1,420 master-service-agreement clauses. Across n=180 MSAs (95% CI), partner time per first-pass review dropped 71% with partner sign-off on every output. Stack: Claude Sonnet 4.6, LangChain 0.3, LangGraph 0.2, pgvector 0.7, BAAI bge-reranker-large, Langfuse, iManage Work. Compliance: attorney-client privilege, ABA Model Rules of Professional Conduct, ABA Opinion 512 on UPL, FRE 502. Multi-month ongoing engagement. The clause-extraction layer underneath is the same engagement shape we ship as a standalone document processing pipeline when the extraction itself is the primary surface.

6–9 hrs

partner first-pass review per MSA · pre-build

180 / yr

MSAs flowing through the 4 practice groups

practice groups · M&A · employment · real estate · IP

11%

post-execution disputes traced to inconsistent first-pass clause calls

the problem

Four practice groups,
four contradictory playbooks.

A US-based mid-market law firm — ~70 attorneys, 4 practice groups (M&A, employment, real estate, IP), ~180 MSAs a year. Too small to fund Ironclad/Spellbook seats firm-wide; too large for the managing partner to hand-review every contract. Partners spent 6–9 hours per MSA on first-pass review. The pre-build process was partner-by-partner, playbook-by-playbook, and the playbooks had quietly drifted apart.

today vs · with the agent

today

MSA arrives

Partner reads cover-to-cover

Cross-checks against 4 playbooks

drift across practice groups

Marks up redline by hand

outcome

6–9 hr first-pass per MSA · inconsistent calls across partners · 11% downstream dispute rate

with the agent

MSA upload

Semantic clause split

Hybrid clause-RAG + policy lookup

reconciled clause library

Sonnet 4.6 clause-risk JSON

policy_id + precedent_ids enforced

outcome

Acceptable · partner skim

outcome

Negotiate · redline drafted

outcome

Block / manual review

Two failure modes. Wall-clock: partners burning 6–9 hours per MSA on first-pass review, cross-checking each clause against their practice-group playbook by hand. Quality cost of the drift: a general counsel audit traced 11% of post-execution disputes back to inconsistent first-pass clause calls — same fact pattern, opposite playbook calls between practice groups. Ironclad, Spellbook, and three smaller LegalTech vendors had all been evaluated and turned down. The conversation we walked into was not "should we ship LangChain" — it was "show us how a clause-risk agent could miss a hard-no, and tell us how you'd catch it before a partner sends the redline to opposing counsel."

pre-build · the binding constraints

partner time / MSA6–9 hrs

MSAs / year~180

dispute rate from drift11%

vendor tools accepted0 / 5

Vendor turn-downs detailed below. The four objections decided the engagement shape.

why the firm rejected every ai contract review tools vendor it evaluated

design decision · 01

No reconciled clause library

we rejected: Vendor's pre-trained legal model
because: Vendor tools train on a generic legal corpus. The firm's four practice groups had drifted apart on standard clauses — an agent trained on the drift would inherit it. We required reconciliation as a first-class deliverable, not a pre-trained black box.

design decision · 02

No policy_id citation contract

we rejected: Free-text rationale per clause
because: Every flagged clause must cite a policy_id (regex-validated) that resolves to a live policy document. Partners verify the policy, not the model. No vendor we evaluated surfaced this in the output schema.

design decision · 03

No published eval methodology

we rejected: Vendor's headline accuracy numbers
because: We require an eval set the firm owns, scored by senior counsel, frozen between releases. Vendors guard their eval shapes; the firm can't verify accuracy claims against the firm's own clause distribution.

design decision · 04

No first-class refusal lane

we rejected: Default-to-confident output on every clause
because: On novel patterns the agent must refuse, not guess. Manual-review is a routing lane, not an error state. Vendors framed novelty as failure; we framed it as design.

The thing that scares us is not the obvious miss — we can write a rule for an uncapped indemnity. What scares us is a confident negotiate-band call on a clause where the right answer was block, because the agent didn't know the M&A and real estate playbooks contradict each other on that fact pattern. Show us how you measure that, or we're not signing.

General Counsel Mid-market US law firm · 4 practice groups

the approach · ai legal assistant pipeline · legal rag architecture

AI contract review pipeline — six stages,
four risk bands.

iManage matter-scoped pull, semantic clause splitter with §N.M boundaries, two parallel retrieval lanes (clause-RAG over the reconciled 1,420-clause library + practice-group-aware policy lookup with regex-validated IDs), bge-reranker-large (A/B winner against Cohere Rerank v3), Claude Sonnet 4.6 with forced-JSON ClauseRisk schema. Zero write tools — the agent produces redlines; partners send them. Diagram below.

three decisions that shaped the ai legal assistant build

design decision · 01

Forced-JSON clause-risk schema with policy-id regex

we rejected: Free-text redline summary
because: Every flagged clause has to cite a policy_id matching the regex policy_(practice-group)-(NNN). The Zod validator is the contract; the model can't suggest a redline without naming the policy it traces to. Partners check the policy, not the model.

design decision · 02

Four risk bands · acceptable / negotiate / block / manual-review

we rejected: Continuous risk score 0–1
because: Partners read bands, not scores. A 0.71 risk score is harder to defend in front of a client than `negotiate per policy IP-014`. The band-based output also gates queue routing: block clauses page the supervising partner; manual-review clauses leave a human-only marker on the redline.

design decision · 03

Reconcile the 4 clause libraries before any agent build

we rejected: Train per-practice-group models on each library as-is
because: Discovery week 4 found that M&A and real estate had contradictory standard indemnification clauses (same fact pattern, opposite playbook calls). An agent trained on the drift would inherit it. We paused the build for a 2-week reconciliation pass with senior counsel from each group, then resumed.

why this shape works

Every component has a
separately measurable contract.

When something regresses, the per-component metric tells us which stage broke. No single end-to-end number that hides which subsystem moved.

Clause-risk decision model

Labelled risk-band correctness + policy-citation accuracy on the frozen 740-item eval. Forced-JSON schema enforces a regex-validated policy_id on every rationale claim — the model cannot suggest a redline without naming the policy it traces to.

Clause-RAG retrieval

Recall@5 on the reconciled clause library. RRF fusion + bge-reranker-large tuned against this number, not end-to-end accuracy.

Policy-lookup index

Citation-accuracy against senior-counsel ground truth. Practice-group-aware: IP clauses retrieve IP policies first.

Reranker bake-off

bge-reranker-large self-host vs Cohere Rerank v3.

Manual-review lane

Refusal rate by design, not a failure mode.

Partner-override audit

Every disagreement opened in the weekly override-review meeting.

under the hood

The contract-review chain,
clause by clause.

Every clause runs the same six-stage chain. Two retrieval lanes fan out at stage 4 — house policy and the reconciled clause library — then converge on a forced-JSON clause-risk model. The four risk bands at the bottom are the only legal outputs. Hover any stage for its tool inventory.

risk · band 1 Acceptable matches house playbook · partner skim only · ≈ 41% of clauses

risk · band 2 Negotiate redline drafted with policy citation · ≈ 38%

risk · band 3 Block violates a hard policy · partner notified before send · ≈ 9%

risk · band 4 Manual review novel pattern · agent refuses · ≈ 12%

tool inventory

Hover or focus a stage on the left to see its tool surface, latency budget, and the data it touches.

stage latencies above are per-clause p50 / p95 · a typical MSA runs ≈ 80–140 clauses in parallel batches of 12 · end-to-end MSA wall-clock 38–62s

policy-cited

every flagged clause carries a policy_id (regex-enforced)

autonomous redlines · partner approves every send to the counterparty

4 senior counsel

in the reconciliation council · one per practice group

shadow-first

MSAs reviewed in parallel by agent + partner for the first 6 weeks post-cutover

the stack

AI contract review software stack — named tools,
named versions.

Everything in the build is a thing your IT director can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, schemas, and policies are all checked into the firm's repo. Cohere Rerank stays wired as a fallback so the rerank stage has a documented swap-out path.

Claude Sonnet 4.6 Anthropic API · forced JSON

LangChain 0.3 Python

LangGraph 0.2.x Python

voyage-3-large 1,024 dim

pgvector 0.7 Postgres 16

BM25 (Postgres tsvector) Postgres 16

BAAI bge-reranker-large self-hosted g5.xlarge

Cohere Rerank v3 · A/B alternative

Langfuse v3 · self-hosted

iManage Work Cloud · matter-scoped

how it actually runs

Production shape,
under the hood.

Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 pricing as of May 2026; eval composition is the frozen 740-item clause-eval set the CI gates on.

ai contract review latency budget

Per-clause P50 / P95 (ms)

stage	p50	p95	tooling
MSA intake + iManage pull	280	720	iManage Work · matter-scoped OAuth · document hashed + version-pinned
Semantic clause split	410	880	Heading-aware splitter · §N.M boundary + sub-clause merge
LangChain orchestrator step	36	88	LangGraph 0.2 chain · shared scratchpad · clause-by-clause
Clause-RAG retrieval + rerank	372	620	pgvector + tsvector RRF k=60 → bge-reranker-large top-6
Policy lookup (parallel)	22	58	Practice-group-aware regex-validated policy index
Sonnet 4.6 clause-risk decision	1600	2400	Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
Validator + audit log	18	32	Zod schema · policy_id regex check · Langfuse trace write
Total (per-clause end-to-end)	2738	4798	agent boundary; MSAs batch 12 clauses in parallel for ≈ 62s wall-clock

stage MSA intake + iManage pull
p50 280
p95 720
tooling iManage Work · matter-scoped OAuth · document hashed + version-pinned
stage Semantic clause split
p50 410
p95 880
tooling Heading-aware splitter · §N.M boundary + sub-clause merge
stage LangChain orchestrator step
p50 36
p95 88
tooling LangGraph 0.2 chain · shared scratchpad · clause-by-clause
stage Clause-RAG retrieval + rerank
p50 372
p95 620
tooling pgvector + tsvector RRF k=60 → bge-reranker-large top-6
stage Policy lookup (parallel)
p50 22
p95 58
tooling Practice-group-aware regex-validated policy index
stage Sonnet 4.6 clause-risk decision
p50 1600
p95 2400
tooling Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
stage Validator + audit log
p50 18
p95 32
tooling Zod schema · policy_id regex check · Langfuse trace write
stage Total (per-clause end-to-end)
p50 2738
p95 4798
tooling agent boundary; MSAs batch 12 clauses in parallel for ≈ 62s wall-clock

p50/p95 from a 6-month rolling window over n ≈ 18,400 per-clause decisions (180 MSAs × ≈ 102 clauses average). SLO is p95 ≤ 90s wall-clock per MSA; current burn ≈ 69%.

embedding model bake-off · held-out eval slice

Four-way Pareto check before locking the retrieval stack

model	recall@5	price tier	pick reason
voyage-3-large (1,024d)	0.92	$$	Pareto-best on recall, matched price tier — SHIPPED
OpenAI text-embedding-3-large	0.89	$$	Lost 3pts recall@5 on the legal corpus
Cohere embed-multilingual-v3	0.87	$$	Lost on legal-domain recall, tied on price
bge-large (self-hosted, fine-tuned)	0.90	$ (infra only)	Close on recall, lost on 24/7 GPU operational cost

model voyage-3-large (1,024d)
recall@5 0.92
price tier $$
pick reason Pareto-best on recall, matched price tier — SHIPPED
model OpenAI text-embedding-3-large
recall@5 0.89
price tier $$
pick reason Lost 3pts recall@5 on the legal corpus
model Cohere embed-multilingual-v3
recall@5 0.87
price tier $$
pick reason Lost on legal-domain recall, tied on price
model bge-large (self-hosted, fine-tuned)
recall@5 0.90
price tier $ (infra only)
pick reason Close on recall, lost on 24/7 GPU operational cost

Bake-off ran on a 60-clause held-out slice from the 740-item eval set. Same retrieval pipeline (pgvector + BM25 + RRF k=60) for every variant. We chose voyage-3-large after the bake-off; Cohere stayed wired as a runtime fallback.

contracts/schema/clause-risk.ts typescript

// contracts/schema/clause-risk.ts
// Forced-JSON clause-risk schema. Validated on every model output;
// if Sonnet produces something that doesn't parse, we retry once
// with a stricter system prompt, then fail closed (manual-review).

import { z } from "zod";

export const ClauseRisk = z.object({
  clause_id: z.string(),
  risk: z.enum(["acceptable", "negotiate", "block", "manual-review"]),
  rationale: z.array(z.object({
    claim: z.string().min(20).max(420),
    policy_citation: z.string().regex(/^policy_[A-Z]{2,4}-\d{3,5}$/),
    precedent_ids: z.array(z.string()).min(0).max(8),
  })).min(1),
  suggested_redline: z.string().optional(),
});

export type ClauseRisk = z.infer<typeof ClauseRisk>;

// contracts/schema/clause-risk.ts
// Forced-JSON clause-risk schema. Validated on every model output;
// if Sonnet produces something that doesn't parse, we retry once
// with a stricter system prompt, then fail closed (manual-review).

import { z } from "zod";

export const ClauseRisk = z.object({
  clause_id: z.string(),
  risk: z.enum(["acceptable", "negotiate", "block", "manual-review"]),
  rationale: z.array(z.object({
    claim: z.string().min(20).max(420),
    policy_citation: z.string().regex(/^policy_[A-Z]{2,4}-\d{3,5}$/),
    precedent_ids: z.array(z.string()).min(0).max(8),
  })).min(1),
  suggested_redline: z.string().optional(),
});

export type ClauseRisk = z.infer<typeof ClauseRisk>;

The clause-risk schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform. Every flagged clause has to name a policy_id matching the regex, or the validator rejects and the agent retries once with a stricter prompt, then routes the clause to manual review.

clause-risk output · sample MSA

What the partner sees

Six clauses from a sample MSA, rendered with the same risk-band tinting and expandable rationale that partners see in the firm's tooling. Click any clause to expand the policy citation, precedent ids, and (where present) the agent's suggested redline.

sample MSA · vendor agreement · 6 of 102 clauses shown

2acceptable
2negotiate
1block
1manual
102total clauses

§ 3.1 Term & renewal — auto-renewal with 60-day notice acceptable

Auto-renewal with 60-day opt-out notice matches the firm's standard MSA playbook for the vendor practice. No deviation flagged.

policy citation

policy_MA-104

precedents

prec_msa-2024-118 · prec_msa-2024-217
§ 7.2 Indemnification — IP infringement (third-party claims) negotiate

Counterparty's draft caps IP-infringement indemnity at 1× annual fees and excludes injunctive relief. House playbook requires uncapped IP-infringement indemnity plus injunctive coverage for the licensed product.

policy citation

policy_IP-014

precedents

prec_msa-2024-088 · prec_msa-2025-031 · prec_msa-2025-104

suggested redline

Replace cap with: `Vendor shall indemnify, defend and hold harmless Customer from any third-party IP-infringement claim arising from the Service, without cap. Injunctive relief is included within the scope of this indemnity.`
§ 9.4 Limitation of liability — consequential-damages waiver block

Mutual consequential-damages waiver is acceptable as a category; however, this draft also waives consequential damages for breach of confidentiality and data-protection obligations. Both of these the firm's policy carves out from any waiver. Partner notification required before send.

policy citation

policy_MA-203

precedents

prec_msa-2024-141 · prec_msa-2025-052

suggested redline

Add carve-out: `Nothing in this Section limits liability for breach of Sections 11 (Confidentiality) or 12 (Data Protection), or for indemnification obligations under Section 7.`
§ 14.5 Governing law — Delaware (with international arbitration) acceptable

Delaware governing law + ICC international arbitration seat in Singapore matches the firm's cross-border SaaS playbook. No deviation flagged.

policy citation

policy_MA-307

precedents

prec_msa-2024-202
§ 16.2 Data residency — multi-region with carve-out for AI training manual review

Clause contains a sub-paragraph permitting vendor to use customer data for `model improvement` outside the named regions. This is a novel clause shape not present in any of the 1,420 reconciled-library reference clauses; agent refuses to issue a band and routes to manual review.

policy citation

policy_IP-021

precedents
§ 19.1 Assignment — change-of-control consent negotiate

Draft permits assignment to any affiliate without consent. House playbook requires written consent for any assignment outside the same parent group, plus a 30-day cure period for change-of-control events.

policy citation

policy_MA-118

precedents

prec_msa-2025-019 · prec_msa-2025-077

suggested redline

Replace with: `Neither party may assign this Agreement without the other party's prior written consent, which shall not be unreasonably withheld. Change of control triggers a 30-day cure window.`

sample only — anchor_ids, policy_ids, and precedent_ids are illustrative and follow the shape the schema enforces · live surface renders all clauses, not just the 6 shown

ai contract review unit economics

Per-MSA and monthly cost math

line item	$ / MSA	$ / month (≈ 30 MSAs)	note
Claude Sonnet 4.6 input tokens	$0.857	$25.71	102 clauses × 2,800 tokens × $3.00 / 1M
Claude Sonnet 4.6 output tokens	$0.795	$23.86	102 clauses × 520 tokens × $15.00 / 1M
voyage-3-large embeddings (per clause)	$0.037	$1.10	102 clauses × ≈ 3,000 tokens × $0.12 / 1M
pgvector + RDS db.m6i.large	fixed	$284	Postgres 16 in firm tenant · clause library + policy index
g5.xlarge reranker (24/7)	fixed	$378	BAAI bge-reranker-large self-host · Cohere fallback wired
LangChain · LangGraph runtime	fixed	$94	Python on Fargate · 2 vCPU · per-clause parallelism = 12
Langfuse self-hosted (t3.medium)	fixed	$67	trace store · 90-day hot / 7-yr cold
iManage Work connector	fixed	$0	uses firm's existing iManage Cloud seat
All-in monthly (≈ 30 MSAs)	≈ $1.69	≈ $874	vs. ≈ 200 partner hours saved at firm rates

line item Claude Sonnet 4.6 input tokens
$ / MSA $0.857
$ / month (≈ 30 MSAs) $25.71
note 102 clauses × 2,800 tokens × $3.00 / 1M
line item Claude Sonnet 4.6 output tokens
$ / MSA $0.795
$ / month (≈ 30 MSAs) $23.86
note 102 clauses × 520 tokens × $15.00 / 1M
line item voyage-3-large embeddings (per clause)
$ / MSA $0.037
$ / month (≈ 30 MSAs) $1.10
note 102 clauses × ≈ 3,000 tokens × $0.12 / 1M
line item pgvector + RDS db.m6i.large
$ / MSA fixed
$ / month (≈ 30 MSAs) $284
note Postgres 16 in firm tenant · clause library + policy index
line item g5.xlarge reranker (24/7)
$ / MSA fixed
$ / month (≈ 30 MSAs) $378
note BAAI bge-reranker-large self-host · Cohere fallback wired
line item LangChain · LangGraph runtime
$ / MSA fixed
$ / month (≈ 30 MSAs) $94
note Python on Fargate · 2 vCPU · per-clause parallelism = 12
line item Langfuse self-hosted (t3.medium)
$ / MSA fixed
$ / month (≈ 30 MSAs) $67
note trace store · 90-day hot / 7-yr cold
line item iManage Work connector
$ / MSA fixed
$ / month (≈ 30 MSAs) $0
note uses firm's existing iManage Cloud seat
line item All-in monthly (≈ 30 MSAs)
$ / MSA ≈ $1.69
$ / month (≈ 30 MSAs) ≈ $874
note vs. ≈ 200 partner hours saved at firm rates

Token costs use Anthropic's published May-2026 Sonnet 4.6 pricing ($3 / 1M input, $15 / 1M output). Infra costs are AWS US-east-2 list price (firm's tenant). Per-MSA token cost assumes the median 102-clause MSA observed in the eval set; range across the 180 production MSAs in the 6-month sample is $0.94 (62 clauses) to $2.83 (164 clauses). Payback period from go-live, including the 9-week build at $215k, is ≈ 4.4 months at the firm's published blended partner rate against the ≈ 71% time saved on partner-signed-off MSAs.

ai contract review eval composition · contract analysis ai rubric

What's in the frozen 740-item clause-eval set

category	items	what it checks	ci-gate threshold
M&A clauses (golds)	320	labelled risk band + policy_id + suggested redline · senior-counsel signed	≥ 0.88 band precision · ≥ 0.90 policy accuracy
Employment clauses (golds)	180	labelled risk band + policy_id · employment senior-counsel signed	≥ 0.88 band precision
Real estate clauses (golds)	140	labelled risk band + policy_id · real estate senior-counsel signed	≥ 0.88 band precision
IP clauses (golds)	100	labelled risk band + policy_id · IP senior-counsel signed	≥ 0.88 band precision
Block-clause must-catch	see note	subset across all 4 groups · catch every hard-no clause (must)	≥ 0.95 block recall
Manual-review (novel patterns)	see note	deliberately novel clauses · agent must refuse, not guess	100% refusal on listed must-refuse

category M&A clauses (golds)
items 320
what it checks labelled risk band + policy_id + suggested redline · senior-counsel signed
ci-gate threshold ≥ 0.88 band precision · ≥ 0.90 policy accuracy
category Employment clauses (golds)
items 180
what it checks labelled risk band + policy_id · employment senior-counsel signed
ci-gate threshold ≥ 0.88 band precision
category Real estate clauses (golds)
items 140
what it checks labelled risk band + policy_id · real estate senior-counsel signed
ci-gate threshold ≥ 0.88 band precision
category IP clauses (golds)
items 100
what it checks labelled risk band + policy_id · IP senior-counsel signed
ci-gate threshold ≥ 0.88 band precision
category Block-clause must-catch
items see note
what it checks subset across all 4 groups · catch every hard-no clause (must)
ci-gate threshold ≥ 0.95 block recall
category Manual-review (novel patterns)
items see note
what it checks deliberately novel clauses · agent must refuse, not guess
ci-gate threshold 100% refusal on listed must-refuse

Eval set is frozen (items only added, never edited). Senior counsel from the relevant practice group signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Block-catch and manual-review subsets are sub-categories overlapping the 740-item count.

production ops cadence · ai contract review software in the firm tenant

What stays calibrated and how

cadence	what runs	who owns	compliance fit
Weekly	Override-review meeting — every clause the partner overrode the agent on	Senior counsel (one per practice group) + on-call engineer	Patterns showing 3× become eval-set additions
Monthly	Audit-log sample — model version, retrieved candidates from both lanes, reranker scores, policy_id, partner override	Firm's IT director	ABA Op. 512 reasonableness · FRE 502(b) inadvertent-disclosure
Per release	Full 740-item clause-eval re-run against the frozen set	Engineering on-call	CI fails the release if any metric drops > 1pt
Retention	Langfuse traces — 90-day hot in firm tenant + 7-year cold in tenant-scoped S3	Firm IT (we do not hold the keys)	Matches the firm's privilege retention policy

cadence Weekly
what runs Override-review meeting — every clause the partner overrode the agent on
who owns Senior counsel (one per practice group) + on-call engineer
compliance fit Patterns showing 3× become eval-set additions
cadence Monthly
what runs Audit-log sample — model version, retrieved candidates from both lanes, reranker scores, policy_id, partner override
who owns Firm's IT director
compliance fit ABA Op. 512 reasonableness · FRE 502(b) inadvertent-disclosure
cadence Per release
what runs Full 740-item clause-eval re-run against the frozen set
who owns Engineering on-call
compliance fit CI fails the release if any metric drops > 1pt
cadence Retention
what runs Langfuse traces — 90-day hot in firm tenant + 7-year cold in tenant-scoped S3
who owns Firm IT (we do not hold the keys)
compliance fit Matches the firm's privilege retention policy

None of this is published anywhere else by anyone shipping legal agents. That's the bar.

ai legal assistant build · 9 weeks · honest version

The timeline
including the two weeks we paused.

Five stages, milestone-billed. The week-4 build turned up a clause-library drift problem: the M&A and real estate playbooks contradicted each other on the same fact pattern, and an AI contract review software trained on the drift would inherit it. We halted the build for a 2-week reconciliation pass with senior counsel from each practice group, then resumed. The honest version of 9 weeks is 7 weeks of build plus 2 weeks of pause.

Weeks 1–2

Discovery + eval set

Two weeks shadowing partners across the four practice groups. The managing partner of each group sat in the design council. We sampled 60 MSAs from the prior 18 months, anonymized them, and the four senior partners labelled each clause with the correct risk band + policy citation + suggested redline. That sample became the frozen 740-item clause-eval set: 320 from M&A, 180 from employment, 140 from real estate, and 100 from IP.

Frozen 740-item eval set + per-practice-group rubric
Week 3

Clause library + dual-index build

Ingested each practice group's existing clause library (1,840 reference clauses across the four groups) into pgvector 0.7 with embedding via voyage-3-large at 1,024 dimensions. Built the Postgres tsvector BM25 sidecar over the same corpus. RRF fusion tuned on a held-out eval slice; cross-encoder rerank A/B-tested between bge-reranker-large and Cohere Rerank v3. bge won on the legal corpus by ≈ 3 points top-1 precision; Cohere stayed wired as a fallback.

Hybrid retrieval at 0.92 recall@5 across all four practice corpora
Week 4

Clause-library drift — paused for reconciliation

Building the per-clause review chain in LangGraph turned up a structural problem: M&A's standard indemnification language and real estate's contradicted each other on the same fact pattern (joint-and-several vs several-only for sub-tenancy indemnities). Employment and IP had two more such contradictions. An agent trained on the drift would inherit it. We halted the build, convened a 2-week reconciliation pass with senior counsel from each practice group, and produced a single reconciled clause library: 1,420 clauses after dedupe and reconciliation, down from 1,840. Cost two weeks of wall-clock; bought eighteen months of build-on-firm-ground.

Reconciled clause library · 1,420 unique reference clauses · sign-off from all 4 practice groups

Walk-away point
Weeks 5–7

LangChain agent + forced-JSON clause-risk model

LangChain 0.3 orchestrator wraps the LangGraph clause-by-clause chain. Claude Sonnet 4.6 with `response_format: json_schema` set to the ClauseRisk shape. The schema is the contract: every flagged clause has to name a policy_id matching the regex; precedent_ids are bounded 0–8; suggested_redline is optional and only renders if the partner expands the clause card. Confidence < 0.8 routes to manual-review; the agent never produces an autonomous redline.

End-to-end review pipeline behind a partner-only beta flag
Weeks 8–9

Shadow cutover + partner-override review

Promoted to first-pass review with partners running in parallel for the first 6 weeks (every MSA reviewed both by the agent and by the partner; outputs compared in Langfuse). After week 6 of shadow the metric held: partner-override rate fell to 9.2% from a baseline 14% in week 1, and the firm cut over to agent-first first-pass with partner final-pass. The override-review meeting runs weekly with senior counsel from each group; patterns that show up three times become eval-set additions.

Production cutover · partner-override-review cadence locked

ai contract review eval results · 740 frozen clause items

AI contract review eval — how we know
it works.

The AI contract review eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 740. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live partner-shadow numbers are within ±1.5 points across all rows over the last 30 days.

metric

baseline (pre-build)

v1 (wk 5)

v2 (wk 7 post-reconciliation)

current (live, wk 36)

target

Clause-risk band precision (4-class)

n/a

0.84

0.86

0.91

≥ 0.88

Block-clause recall (catch all hard-no)

n/a

0.92

0.95

0.97

≥ 0.95

Policy-citation accuracy (cited the right policy)

n/a

0.79

0.88

0.93

≥ 0.90

Partner-override rate (live shadow)

14.0%

12.4%

10.1%

9.2%

≤ 12%

Manual-review refusal rate (by design)

n/a

8.4%

11.6%

12.0%

10–14%

P95 wall-clock per MSA (full report)

n/a

78s

68s

62s

≤ 90s

Sample size for the headline time-saved number (≈ 71% first-pass MSA review time saved) is n=180 partner-signed-off MSAs across a 6-month rolling window; the figure is a 95% confidence interval, not a point estimate. Partner-override rate is the share of clauses where the partner overrode the agent's risk band on the live shadow slice (by design, not by failure). Manual-review refusal rate is the share of clauses the agent legally refuses to band (novel patterns, score-margin failures, off-corpus clauses) and routes straight to a partner — also by design. Latency is end-to-end MSA wall-clock from upload to full clause-risk report, measured at the agent boundary.

when NOT to ship this · kill points

The four shapes we turn down
before scoping a pilot.

An AI contract review agent built on these patterns will mislead partners or counterparties in any of the following situations. We turn down the engagement before a pilot is scoped.

The clause library hasn't been reconciled

If practice groups maintain contradictory playbooks and senior counsel won't sit down for a reconciliation pass, the agent inherits the drift and produces confidently inconsistent calls. We will pause the build, as we did at week 4, until it happens.

Partner sign-off cadence isn't real

If partners aren't going to review every redline before it ships to counterparties for the first six months, the agent's wrong-but-confident clause call lands in opposing counsel's inbox unedited. We require a partner-final-pass workflow at week 1 or the pilot doesn't get signed.

Privilege deployment isn't agreed at week 1

Vendor-side data retention, training opt-out, audit-log retention, model tier. These are not post-launch decisions. ABA Op. 512 and FRE 502(b) scope is decided at week 1 with the firm's compliance lead, or the engagement doesn't start.

Override-review meeting isn't on the calendar

The agent stays honest because senior counsel from each practice group walks the disagreements weekly. If that meeting won't happen, calibration drifts within months and nobody catches it. The eval set is necessary, not sufficient.

frequently asked — ai contract review

What buyers ask first.
Real answers, no hedging.

What does an AI legal assistant actually do?

An AI legal assistant performs structured-output legal work under partner supervision: clause-by-clause contract review, policy citation against the firm's playbook, precedent retrieval, and refusal on novel patterns. It never sends to counterparties without partner sign-off.

How is AI contract review different from a CLM platform?

AI contract review is the decision layer (what risk band, which policy cited, suggested redline). A CLM platform is the workflow layer (intake, approval, signing). They are complementary; this case study replaces the partner-time-intensive review step inside an existing iManage workflow.

How accurate is the AI contract review pipeline?

0.91 clause-risk band precision (4-class), 0.97 block-clause recall, 0.93 policy-citation accuracy, on the frozen 740-item clause-eval set. 9.2% partner-override rate on the live shadow slice, down from a 14% baseline in week 1.

What does AI contract review cost to run?

About $1.69 per MSA in token + infra cost (Claude Sonnet 4.6 + voyage-3-large + pgvector + reranker + Langfuse), or roughly $874 per month at 30 MSAs. Token costs use Anthropic's published Sonnet 4.6 pricing. Payback including the 9-week build is about 4.4 months at the firm's blended partner rate.

How long does it take to build an AI legal assistant?

9 weeks for this engagement: 2 weeks discovery + eval-set freeze, 1 week retrieval build, 2 weeks paused for clause-library reconciliation, 3 weeks agent build + forced-JSON contract, 2 weeks shadow cutover. The 2-week pause was the load-bearing decision.

Is AI contract review ABA Op. 512 compliant?

AI contract review pipelines can be ABA Op. 512 compliant when the firm controls the audit log, retains traces in its own tenant, agrees model + retention scope at week 1, and runs partner-final-pass on every output. This case study's deployment passed the firm's general counsel review against ABA Op. 512 reasonableness duties.

What is the difference between AI contract review software and an AI contract review tool?

A tool is something you buy (Ironclad, Spellbook, Harvey) — its eval set, refusal lane, and policy schema are the vendor's. Software is something built around your firm — eval set, policies, refusal lane, and stack are yours and stay in your repo. This case study is the second shape.

When should a firm NOT ship AI contract review?

Four cases: clause library is not reconciled across practice groups; partners will not sign off every redline for the first six months; privilege deployment (retention, BAA-eligible model tier) is not agreed at week 1; weekly override-review meeting is not on the calendar. We turn down engagements that fail any of these.

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this AI legal assistant / AI contract review software build, or that a similar legal AI software build on your stack would draw from.

01 Industry

Legal AI Development

The legal pillar — privilege-aware AI for law firms, FRE 502 audit-log scaffolding, ABA Op. 512 citation-chain logging across practice groups.

02 Service

Intelligent Document Processing

The IDP pillar — multi-modal extraction, schema-validated outputs, confidence routing across document types. The plumbing AI contract review software sits on top of.

03 Service

AI Agent Development

The agent pillar — ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this AI legal assistant build.

04 Service

Claude Development

Sonnet 4.6 + Haiku 4.5 integration patterns the AI-powered contract review above uses. Forced JSON, Constitutional-AI posture, BAA-eligible deployment options.

05 Case study

All AI Case Studies

Six AI case studies — AI legal assistant, AI fraud detection, AI knowledge base, AI triage, voice bot, ecommerce chatbot. Same operator detail across every page.

06 Service

AI Governance

Policy-as-code, audit-log scaffolding, privilege-aware deployment. The plumbing that made this contract review software pilot pass the firm's risk review.

07 Service

Custom AI Development

How a contract-review RAG fits inside a broader AI software development company engagement — retrieval + clause classifier + Sonnet reviewer + matter integration.

08 Service

AI Knowledge Base

Contract-review RAG is a domain-specific AI knowledge base over a matter-precedent corpus, with the same chunking, reranking, and eval harness as our product-docs and clinical RAG builds.

09 Service

AI Consulting

The contract-review build started with a fixed-fee audit — clause-library inventory, privilege-ring mapping, eval-set design on real MSAs, and a per-document cost projection before pilot.

Ready to ship

Want an AI legal assistant like this
for your firm's contract review?

Book a fixed-fee AI contract review audit. We'll review the clause library, scope the eval set, recommend an AI-powered contract review recipe (model + retrieval + refusal contract), project run-cost, and tell you honestly whether contract review software built around your playbook is the right shape — or whether you should buy a vendor AI contract review tool. About one audit in five ends with `you need a reconciliation pass before any agent build — here's the SOW for that.`

Read the legal pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

Updated May 20, 2026 · By Navin Sharma

An AI legal assistant for first-pass contract review.

What this case study shows

Four practice groups, four contradictory playbooks.

today

with the agent

No reconciled clause library

No policy_id citation contract

No published eval methodology

No first-class refusal lane

AI contract review pipeline — six stages, four risk bands.

Forced-JSON clause-risk schema with policy-id regex

Four risk bands · acceptable / negotiate / block / manual-review

Reconcile the 4 clause libraries before any agent build

Every component has a separately measurable contract.

Clause-risk decision model

Clause-RAG retrieval

Policy-lookup index

Reranker bake-off

Manual-review lane

Partner-override audit

The contract-review chain, clause by clause.

AI contract review software stack — named tools, named versions.

Production shape, under the hood.

The timeline including the two weeks we paused.

Discovery + eval set

Clause library + dual-index build

Clause-library drift — paused for reconciliation

LangChain agent + forced-JSON clause-risk model

Shadow cutover + partner-override review

AI contract review eval — how we know it works.

The four shapes we turn down before scoping a pilot.

The clause library hasn't been reconciled

Partner sign-off cadence isn't real

Privilege deployment isn't agreed at week 1

Override-review meeting isn't on the calendar

What buyers ask first. Real answers, no hedging.

Where this case study points back to.

Legal AI Development

Intelligent Document Processing

AI Agent Development

Claude Development

All AI Case Studies

AI Governance

Custom AI Development

AI Knowledge Base

AI Consulting

Want an AI legal assistant like this for your firm's contract review?

An AI legal assistant for
first-pass contract review.

Four practice groups,
four contradictory playbooks.

AI contract review pipeline — six stages,
four risk bands.

Every component has a
separately measurable contract.

The contract-review chain,
clause by clause.

AI contract review software stack — named tools,
named versions.

Production shape,
under the hood.

The timeline
including the two weeks we paused.

AI contract review eval — how we know
it works.

The four shapes we turn down
before scoping a pilot.

What buyers ask first.
Real answers, no hedging.

Where this case study
points back to.

Want an AI legal assistant like this
for your firm's contract review?