Enterprise Data Trust — Chapter 1: Trusted Source Intake
Built by Anthony Johnson | EthereaLogic LLC
If this pattern is useful to your team, consider starring the repo — it helps others in the Databricks community find it.
Source systems silently send duplicate batches, drop required fields, or change schemas without notice — and downstream consumers start using the data before anyone has validated it.
This chapter demonstrates a governed intake gate that catches schema drift, blocks duplicate replays, quarantines invalid records with explicit reasons, and certifies what is safe for downstream consumption.
Every decision made from the platform depends on the assumption that ingested data is complete, unique, and contract-compliant. This chapter proves that assumption can be enforced — and shows what happens when it is not.
| Leadership question | Answer |
|---|---|
| What business risk does this address? | Source systems silently send duplicate batches, drop required fields, or change schemas — and downstream consumers start using the data before anyone has validated it. |
| What does this chapter prove? | A four-component intake control that catches schema drift, blocks duplicate replays, quarantines invalid records with explicit reasons, and provides a single operational view of batch health. |
| Why does it matter? | Every downstream decision depends on trusting that ingested data is complete, unique, and contract-compliant. This pattern enforces that assumption before anyone queries Bronze. |
The operational handoff summary shows the full pipeline outcome in one view — 33 rows landed, 10 ready, 23 quarantined, 10 replay duplicates blocked, 8 rescued from schema drift.
The Databricks pipeline refresh completed all five tables with zero warnings and zero failures — proving the control pattern runs cleanly on the platform.
Every blocked row carries an array of named check failures. Nothing is silently dropped — 23 rows quarantined with explicit reason arrays showing exactly which contract checks failed.
Enterprise ingestion pipelines face four problems that consistently erode trust in the data platform:
- Schema drift goes unnoticed. A source system changes a field. The pipeline absorbs the change silently. Downstream reports break — or worse, they produce numbers that look right but aren't.
- Duplicate batches slip through. Operations resends yesterday's file under a new name. Without business-level deduplication, every downstream aggregate is inflated.
- Required fields arrive empty. A partial extract produces rows where key business fields are null. These rows pass basic schema checks because the columns exist.
- Consumers query before validation. Dashboards and models start reading from Bronze the moment data lands. There is no gate between "data arrived" and "data is ready."
These are not edge cases. They are the daily operational reality of enterprise data engineering.
| Verified outcome | Evidence from this repository |
|---|---|
| Schema drift is captured, not absorbed | Auto Loader rescued data mode isolates drifted columns; 8 rows rescued and quarantined |
| Duplicate replays are blocked | Batch registry detects business-level replays; 10 rows flagged across redelivered batches |
| Invalid records are quarantined with reasons | 7 named contract checks produce explicit violation arrays; 23 rows quarantined |
| Handoff gate provides a single operational view | One materialized view aggregates landed, ready, quarantined, rescued, and replay counts |
Business decision: should downstream users trust today's landed data?
| KPI | Meaning |
|---|---|
total_landed |
Rows ingested from source files |
total_ready |
Rows that passed all 7 contract checks and replay detection |
total_quarantined |
Rows blocked with explicit reason arrays |
replay_duplicates |
Rows from redelivered batches, blocked at the registry level |
rescued_rows |
Rows with schema drift captured in _rescued_data |
Control rule: only bronze_ready is downstream-safe. All other tables are operational surfaces for triage, audit, and replay investigation.
- Gap 1. Raw ingest alone is not certification. A pipeline that absorbs every file without checking completeness, uniqueness, or schema conformance provides no basis for downstream trust.
- Gap 2. Replay detection must be business-level, not file-level. The platform deduplicates files, but business teams resend the same batch under new filenames. The batch registry catches this.
- Gap 3. Quarantine must use explicit reasons, not silent filtering. Every blocked row carries an array of named check failures so operations teams can triage without guessing.
- Raw ingest with rescued data. Auto Loader captures everything, including schema changes, into a streaming table. Drift is rescued into a dedicated column rather than silently absorbed.
- Batch registry. A materialized view groups ingested rows by business batch identifier and tracks first-seen timestamps, file counts, and replay flags.
- Seven named contract checks. Every row must pass all seven validation rules before reaching the ready output. Failed rows are quarantined with explicit reason arrays.
- Operational handoff summary. A single materialized view aggregates the full pipeline into actionable metrics: landed, ready, quarantined, rescued, and duplicate counts with ratios.
- Downstream gate. Only
bronze_readyis published as downstream-safe. All other tables exist for operational review.
For the full architecture diagram and design rationale, see docs/architecture.md.
- Lakeflow Declarative Pipelines (formerly DLT) for streaming ingestion and materialized views.
- Auto Loader with
cloudFilesfor incremental file processing with rescued data mode. - Unity Catalog for governed table publication and volume-based landing zones.
- Databricks Asset Bundles for source-controlled deployment and pipeline refresh.
- The intake pattern is source-agnostic and applies to any enterprise ingestion pipeline regardless of source count.
Install Chapter One first. It is the portfolio entry point and contains:
Open Executive Demo.commandscripts/executive_mode.pydocs/executive_mode.md
When Chapters 2 and 3 are cloned as sibling folders beside
trusted-source-intake, the launcher can bootstrap them, run one guided demo
per chapter, and write the executive report into docs/.
The shortest path is:
python3 scripts/executive_mode.py all --open-reportYou can also double-click Open Executive Demo.command on macOS.
Use Python 3.10 or newer.
git clone https://github.com/Org-EthereaLogic/trusted-source-intake.git
cd trusted-source-intake
python3 -m venv .venv && source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -e ".[dev]"
pytest tests/ -q # Expected: 56 passed
intake-demo| Evidence item | What it shows |
|---|---|
![]() |
4 batch directories landed in the Unity Catalog volume |
![]() |
B-001 replay detected: file_count=2, is_replay=true |
![]() |
B-001 = 10 ready rows — only the clean batch passes |
This validates the intake control pattern on one order-events source across four sample batches in a Databricks Free Edition workspace. It does not constitute production-scale proof or multi-source verification. The demonstration models a daily order-events source designed to trigger specific failure modes. The pattern itself is source-agnostic.
- GitHub Actions workflow: ci.yml
This is Chapter 1 of the Enterprise Data Trust portfolio — a three-part body of work addressing the full lifecycle of data reliability in enterprise Databricks platforms.
Install this chapter first if you want the portfolio launcher. It is the
canonical home of Open Executive Demo.command and scripts/executive_mode.py.
| Chapter | Focus | Repository |
|---|---|---|
| 1. Trusted Source Intake | Validate and certify data before downstream consumption | ← You are here |
| 2. Silent Failure Prevention | Detect distribution drift before it reaches executive dashboards | silent-failure-prevention |
| 3. Measurable Control Effectiveness | Prove that your data controls work against known failure scenarios | measurable-control-effectiveness |
MIT License. See LICENSE for details.





