Data Quality & Lineage Monitor · Documentation

Architecture

Data Quality & Lineage Monitor's pipeline, its owned data, the events it emits/consumes, and what is out of scope.

Data Quality & Lineage Monitor
run-rulesmetadatascorevs thresholdrollup+ trendexplainmeteredmetered · LLMdeterministic · $0
Live diagram — rules, scores, rollups and the lineage graph are deterministic; only the root-cause explanation is metered.
run-rules$0score$0rollup$0$explainmeteredCOST LEVER · explain a failing rule
Live diagram — spend accrues only when explaining a failing rule; the checks and lineage are free.

Pipeline

Run rules over entity metadata → score + classify (pass/warn/fail vs threshold) → roll up (health counts + 14-day trend) → explain a failing rule (metered AI-assist). Detection and rollups are pure functions; the lineage graph is a static derivation of the coupling spine.

Reads + the data invariant

Reads metadata and scores across the shared data layer — the entities other clusters own — with no row-level access. It holds no mutable config of its own beyond the shared C8 surfaces and performs no canonical writes; thresholds are owner-governed and the whole surface is read-only to viewers.

Events + metering, dual-mode

Each Explain emits cost.logged into the shared ledger that #27 reads. The explain stage is dual-mode (Cloud `claude-haiku-4-5` cost-capped/fail-closed · OSS recorded $0); everything else is deterministic and $0. The lineage view documents the same event spine the rest of C8 rides.

Out of scope (simulated + labelled)

Checks are computed over the synthetic dataset rather than executed against a live warehouse, and the root-cause step is a simulated metered narrative. No real DQ engine, no alerting integration, no PII. In Stage-2 the same method surface maps to real check jobs and a real lineage store with no UI change.

Architecture · Data Quality & Lineage Monitor · Abhishek Saxena