Before Anyone Queries Bronze, How Do You Know the Data Is Ready?

Enterprise Data Trust — Chapter 1: Trusted Source Intake

Built by Anthony Johnson | EthereaLogic LLC

If this pattern is useful to your team, consider starring the repo — it helps others in the Databricks community find it.

Source systems silently send duplicate batches, drop required fields, or change schemas without notice — and downstream consumers start using the data before anyone has validated it.

This chapter demonstrates a governed intake gate that catches schema drift, blocks duplicate replays, quarantines invalid records with explicit reasons, and certifies what is safe for downstream consumption.

Every decision made from the platform depends on the assumption that ingested data is complete, unique, and contract-compliant. This chapter proves that assumption can be enforced — and shows what happens when it is not.

Executive Summary

Leadership question	Answer
What business risk does this address?	Source systems silently send duplicate batches, drop required fields, or change schemas — and downstream consumers start using the data before anyone has validated it.
What does this chapter prove?	A four-component intake control that catches schema drift, blocks duplicate replays, quarantines invalid records with explicit reasons, and provides a single operational view of batch health.
Why does it matter?	Every downstream decision depends on trusting that ingested data is complete, unique, and contract-compliant. This pattern enforces that assumption before anyone queries Bronze.

Key Exhibits

Exhibit 1: Single-Pane Intake Verdict

The operational handoff summary shows the full pipeline outcome in one view — 33 rows landed, 10 ready, 23 quarantined, 10 replay duplicates blocked, 8 rescued from schema drift.

Exhibit 2: Pipeline Completion with Zero Failures

The Databricks pipeline refresh completed all five tables with zero warnings and zero failures — proving the control pattern runs cleanly on the platform.

Exhibit 3: Explicit Quarantine with Named Violations

Every blocked row carries an array of named check failures. Nothing is silently dropped — 23 rows quarantined with explicit reason arrays showing exactly which contract checks failed.

The Business Problem

Enterprise ingestion pipelines face four problems that consistently erode trust in the data platform:

Schema drift goes unnoticed. A source system changes a field. The pipeline absorbs the change silently. Downstream reports break — or worse, they produce numbers that look right but aren't.
Duplicate batches slip through. Operations resends yesterday's file under a new name. Without business-level deduplication, every downstream aggregate is inflated.
Required fields arrive empty. A partial extract produces rows where key business fields are null. These rows pass basic schema checks because the columns exist.
Consumers query before validation. Dashboards and models start reading from Bronze the moment data lands. There is no gate between "data arrived" and "data is ready."

These are not edge cases. They are the daily operational reality of enterprise data engineering.

What This Repository Proves

Verified outcome	Evidence from this repository
Schema drift is captured, not absorbed	Auto Loader rescued data mode isolates drifted columns; 8 rows rescued and quarantined
Duplicate replays are blocked	Batch registry detects business-level replays; 10 rows flagged across redelivered batches
Invalid records are quarantined with reasons	7 named contract checks produce explicit violation arrays; 23 rows quarantined
Handoff gate provides a single operational view	One materialized view aggregates landed, ready, quarantined, rescued, and replay counts

Decision / KPI Contract

Business decision: should downstream users trust today's landed data?

KPI	Meaning
`total_landed`	Rows ingested from source files
`total_ready`	Rows that passed all 7 contract checks and replay detection
`total_quarantined`	Rows blocked with explicit reason arrays
`replay_duplicates`	Rows from redelivered batches, blocked at the registry level
`rescued_rows`	Rows with schema drift captured in `_rescued_data`

Control rule: only bronze_ready is downstream-safe. All other tables are operational surfaces for triage, audit, and replay investigation.

Why This Pattern

Gap 1. Raw ingest alone is not certification. A pipeline that absorbs every file without checking completeness, uniqueness, or schema conformance provides no basis for downstream trust.
Gap 2. Replay detection must be business-level, not file-level. The platform deduplicates files, but business teams resend the same batch under new filenames. The batch registry catches this.
Gap 3. Quarantine must use explicit reasons, not silent filtering. Every blocked row carries an array of named check failures so operations teams can triage without guessing.

How It Works

Raw ingest with rescued data. Auto Loader captures everything, including schema changes, into a streaming table. Drift is rescued into a dedicated column rather than silently absorbed.
Batch registry. A materialized view groups ingested rows by business batch identifier and tracks first-seen timestamps, file counts, and replay flags.
Seven named contract checks. Every row must pass all seven validation rules before reaching the ready output. Failed rows are quarantined with explicit reason arrays.
Operational handoff summary. A single materialized view aggregates the full pipeline into actionable metrics: landed, ready, quarantined, rescued, and duplicate counts with ratios.
Downstream gate. Only bronze_ready is published as downstream-safe. All other tables exist for operational review.

For the full architecture diagram and design rationale, see docs/architecture.md.

Databricks Fit

Lakeflow Declarative Pipelines (formerly DLT) for streaming ingestion and materialized views.
Auto Loader with cloudFiles for incremental file processing with rescued data mode.
Unity Catalog for governed table publication and volume-based landing zones.
Databricks Asset Bundles for source-controlled deployment and pipeline refresh.
The intake pattern is source-agnostic and applies to any enterprise ingestion pipeline regardless of source count.

Executive Mode

Install Chapter One first. It is the portfolio entry point and contains:

Open Executive Demo.command
scripts/executive_mode.py
docs/executive_mode.md

When Chapters 2 and 3 are cloned as sibling folders beside trusted-source-intake, the launcher can bootstrap them, run one guided demo per chapter, and write the executive report into docs/.

The shortest path is:

python3 scripts/executive_mode.py all --open-report

You can also double-click Open Executive Demo.command on macOS.

Reproducibility

Use Python 3.10 or newer.

git clone https://github.com/Org-EthereaLogic/trusted-source-intake.git
cd trusted-source-intake

python3 -m venv .venv && source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -e ".[dev]"
pytest tests/ -q     # Expected: 56 passed
intake-demo

Evidence Appendix

Evidence item	What it shows
	4 batch directories landed in the Unity Catalog volume
	B-001 replay detected: file_count=2, is_replay=true
	B-001 = 10 ready rows — only the clean batch passes

Scope Boundary

This validates the intake control pattern on one order-events source across four sample batches in a Databricks Free Edition workspace. It does not constitute production-scale proof or multi-source verification. The demonstration models a daily order-events source designed to trigger specific failure modes. The pattern itself is source-agnostic.

Engineering Signals

GitHub Actions workflow: ci.yml

Additional Documentation

Part of a Series

This is Chapter 1 of the Enterprise Data Trust portfolio — a three-part body of work addressing the full lifecycle of data reliability in enterprise Databricks platforms.

Install this chapter first if you want the portfolio launcher. It is the canonical home of Open Executive Demo.command and scripts/executive_mode.py.

Chapter	Focus	Repository
1. Trusted Source Intake	Validate and certify data before downstream consumption	← You are here
2. Silent Failure Prevention	Detect distribution drift before it reaches executive dashboards	silent-failure-prevention
3. Measurable Control Effectiveness	Prove that your data controls work against known failure scenarios	measurable-control-effectiveness

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
data/sample		data/sample
docs		docs
notebooks		notebooks
resources		resources
scripts		scripts
src		src
tests		tests
.codacy.yaml		.codacy.yaml
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONSTITUTION.md		CONSTITUTION.md
CONTRIBUTING.md		CONTRIBUTING.md
DIRECTIVES.md		DIRECTIVES.md
LICENSE.md		LICENSE.md
Open Executive Demo.command		Open Executive Demo.command
README.md		README.md
SECURITY.md		SECURITY.md
databricks.yml		databricks.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Before Anyone Queries Bronze, How Do You Know the Data Is Ready?

Executive Summary

Key Exhibits

Exhibit 1: Single-Pane Intake Verdict

Exhibit 2: Pipeline Completion with Zero Failures

Exhibit 3: Explicit Quarantine with Named Violations

The Business Problem

What This Repository Proves

Decision / KPI Contract

Why This Pattern

How It Works

Databricks Fit

Executive Mode

Reproducibility

Evidence Appendix

Scope Boundary

Engineering Signals

Additional Documentation

Part of a Series

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Before Anyone Queries Bronze, How Do You Know the Data Is Ready?

Executive Summary

Key Exhibits

Exhibit 1: Single-Pane Intake Verdict

Exhibit 2: Pipeline Completion with Zero Failures

Exhibit 3: Explicit Quarantine with Named Violations

The Business Problem

What This Repository Proves

Decision / KPI Contract

Why This Pattern

How It Works

Databricks Fit

Executive Mode

Reproducibility

Evidence Appendix

Scope Boundary

Engineering Signals

Additional Documentation

Part of a Series

About

Topics

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages