Architecture

Cleareye · Finance · 2019–2020 · Public

PublicRepresentative · synthetic data
Live diagram — intake → route → extract-text → segment → analyze → fields → score → triage → record. Only extract-text (vision, on scans) and analyze (text) are metered.

The 9-stage pipeline

One pipeline runs end to end: intake → route → extract-text → segment → analyze → extract-fields → score → triage → record. Only two stages are metered — the vision OCR legibility read (on scans) and the clause text-analysis — and a deterministic validation floor (benchmark + fallback detection) is authoritative regardless of mode. Every other stage costs nothing in either mode.

The format gate — the cost lever

Routing is the cost lever. A digital PDF or Word document is parsed directly for $0 and never touches the vision model; only a scanned image is perceived. So vision spend accrues per scanned document, not per user action — the honest downscale of an OCR + extraction stack.

Dual approach — cloud vs on-prem OSS

The two metered stages can run on the cloud model (claude-haiku-4-5 — vision OCR + text analysis, cost-capped and fail-closed) or on self-hosted open models (qwen2.5vl:7b for vision, Qwen3-8B for text) recorded on local M4 hardware for the GPU-less host. The honest finding: the small OSS vision model reads degraded scans imperfectly — it under-reads a skewed or noisy page and reports lower confidence, so the extraction routes to a human. Cloud is the capable read, and the deterministic field-validation + triage floor is authoritative either way. The models are swap-ready behind the toggle.

Out of scope (deliberately)

Real document ingestion, a production OCR engine, ticketing / amendment-export integrations, and model retraining are simulated and labelled — represented by the synthetic corpus, the metered stages on rasterized synthetic pages, the simulated export, and the wrong-read corrections + owner overlay respectively. No real or copyrighted documents are used.

Architecture · Contract Intelligence for the LIBOR Transition · Abhishek Saxena