When LIBOR was retired, every financial institution on earth had the same problem: find every contract that references the benchmark, work out whether fallback language exists, and amend what needs amending — before a regulatory deadline that does not move. This platform automated the finding, extraction, and triage at portfolio scale, while keeping humans in charge of every uncertain call.
The problem#
A remediation engagement means 5,000–20,000 contracts in inconsistent shapes — roughly 60% born-digital PDFs, 25% DOCX, 15% scanned images — full of varied benchmark terminology, indirect references, and clauses that run for pages. The outputs had to be explainable and auditable, because a regulator may ask exactly why a contract was or wasn't flagged.
Two obvious approaches fail. Manual legal review alone is economically impractical and too slow against a fixed deadline. A pure rules engine misses contextual benchmark language — references that are paraphrased, indirect, or buried in fallback provisions. The answer was automation for the bulk, humans for the doubt, and a full audit trail for everything.
Architecture#
A seven-stage asynchronous pipeline: documents flow through format-aware extraction, clause segmentation, transformer-based classification, and a confidence router that decides — per extraction — whether a human needs to look.
Everything is async by design. Remediation programs process large, variable-sized backlogs over weeks — Kafka decouples every stage, gives partition-based parallelism for scale-out, and provides replay and backpressure that synchronous request-response patterns simply don't have.
Getting text out of anything#
The pipeline is parser-first, OCR-fallback: each document is routed to the highest-fidelity extraction available for its format. A text-layer check (sampled pages) sends born-digital PDFs to PDFMiner with layout positions preserved, DOCX to Apache Tika, and true scans to Tesseract's LSTM engine with deskew preprocessing. Running OCR on born-digital documents wastes compute and can actually degrade quality — routing avoided 40–60% of OCR compute outright.
OCR gets its own quality gate: per-page character confidence is scored, and pages below threshold skip the NLP entirely and go straight to the exception queue. Garbage in, garbage routed out.
Clause-level intelligence#
Document-level classification is useless here — knowing a contract "mentions LIBOR" tells you nothing about which clause is impacted and what fallback exists. Remediation decisions are clause-specific, so the intelligence had to be too.
Contracts of 50–200 pages are segmented into 40–120 clause-level blocks (section-header and paragraph-boundary detection, 50-token overlap so references spanning boundaries aren't lost) — which also solved the transformer's hard 512-token input limit. A fine-tuned RoBERTa-base classifies benchmark-related clauses and detects fallback language, served via TensorFlow Serving over gRPC in batches of 32. Alongside it, deliberately boring scikit-learn models (TF-IDF + structural features) handle document-type classification and routing in-process at sub-100ms — a transformer round-trip for a job that logistic regression does at
92% accuracy is pure waste.
Training data: 2,500–3,000 clauses per task, annotated by legal subject-matter experts with inter-rater agreement above 0.85 Cohen's kappa before anything was trusted.
Confidence-gated automation#
Never trust model outputs uniformly: every extraction gets a composite confidence score — model softmax blended with extraction coverage and OCR quality — and lands in one of three tiers: high (>0.85) persists automatically, medium (0.60–0.85) goes to a legal reviewer queue, low (<0.60) enters the exception queue.
Binary routing fails in both directions: auto-accepting too much is a compliance risk, reviewing everything defeats the ROI. Three tiers put human effort exactly where the model is uncertain. Reviewers see the highlighted source clause, the extracted attributes, per-attribute confidence, and the model version — and can approve, correct, reject, or escalate. Corrections are aggregated monthly into retraining annotation sets, closing the loop; a correction rate above 15% triggers a model investigation.
Built for audit#
Every extracted attribute carries an unbroken chain back to its origin:
{
"document_id": "…",
"source": { "page": 47, "char_offset": 1212, "clause_ref": "14.2(b)" },
"extraction_run_id": "…",
"model_version": "clause_classifier_roberta_v2.1",
"confidence": 0.91,
"tier": "high",
"reviewer_decision": null
}Pipeline version and model version are pinned per run; reviewer actions are recorded immutably with before/after values and override reasons. When the question is "why did the system flag this clause in March," the answer is reconstructable — which model, which confidence, which human, which decision.
Model evolution in the BERT era#
This was 2019–2020 — the BERT era of enterprise NLP — and the architecture's best decision was refusing to marry a model. TensorFlow Serving decoupled inference from the application, so the upgrade from BERT-base to RoBERTa-base (benchmarked at +3–4% F1 on the held-out legal clause set) shipped as a model version promotion: re-fine-tune, validate on 100 contracts in staging, promote. Zero application code changes, zero inference cost increase — same parameter count, better pre-training.
Legal-BERT was piloted later in the project (+1–2% further on a 200-clause sample) and slotted into the same promotion path. Promotion was gated, not vibes-based: F1 must improve >1% with no regression on critical classes, then shadow mode on a 10% sample for a week before going live.
Representative results#
The hero figures are design targets from the architecture work: 87%+ F1 on benchmark detection (up from 83% with the initial BERT deployment), 82%+ on fallback-language identification, 200–500 contracts/day through the full pipeline, p95 under 120 seconds per contract on CPU, 55–65% straight-through processing, 99.5% business-hours availability — targets, not audited production claims. Rebuilt today, the encoder stack would give way to LLM extraction with guided decoding — but the confidence tiers, the reviewer loop, and the audit chain would survive unchanged, because regulators haven't changed.
Live in the Ecosystem