You Don’t Have One Dataset, You Have Three

Most analysts talk about “the dataset” like it’s a single object; stable, complete, and representative.

It isn’t.

In almost every real project, you are quietly working with three datasets:

Observed: what you can see in the table right now
Missing: what should exist but doesn’t (nulls, gaps, broken tracking, late arrivals)
Excluded: what existed but you removed (filters, joins, deduping, eligibility rules)

And here’s the uncomfortable truth:

Your conclusions are usually shaped more by Missing and Excluded than by what’s Observed.

This article is a practical guide to recognizing the three datasets, measuring them, and preventing them from silently destroying your analysis.

Why this matters (even if your SQL is correct)

You can write perfect SQL and still be wrong because:

the data you don’t have is not random
the rows you remove are not neutral
the gaps you ignore are often concentrated in specific users, regions, devices, or time ranges
your metrics are computed on “the survivors,” not the population you think you’re measuring

When this happens, it doesn’t fail loudly. It fails politely:

your dashboard looks clean
your model trains “fine”
your KPIs move
your conclusions sound reasonable

And then you ship decisions based on a dataset that is not the dataset you think you have.

The three datasets: a concrete definition

1) Observed dataset

This is the dataset you see and query:

the rows that made it into storage
the columns that are populated (at least sometimes)
the records that survived ingestion and parsing

Observed data is not reality. It’s captured reality.

If you measure only this dataset, you risk confusing “what we captured” with “what happened.”

2) Missing dataset

This is everything that should exist but does not.

Missing shows up as:

null values
missing rows
missing time windows
missing categories
incomplete event streams
late-arriving data not yet loaded

There are two key types of missingness:

Missing values (within-row)

Example: you have an order, but the payment_method is null.

Missing rows (entire records absent)

Example: orders from iOS are undercounted because the tracking endpoint failed for iOS users.

Missing rows are usually worse because you can’t even see that they’re missing unless you compare against an external reference.

3) Excluded dataset

This is everything you removed; sometimes intentionally, sometimes accidentally.

Excluded comes from:

filters (WHERE country = 'US', WHERE status='completed')
join behavior (inner join drops unmatched)
deduping logic (keeping only the “latest” row)
eligibility constraints (“active users only”)
data cleaning rules (dropping nulls, trimming outliers)
sampling procedures

Excluded data is often systematically different from the data you keep.

That means exclusions create bias, even when they are “reasonable.”

How the three datasets get created (common patterns)

Pattern A | Inner join = silent row deletion

You think you’re “enriching” data:

SELECT *
FROM orders o
JOIN users u ON https://raw.githubusercontent.com/489d/You-Have-Three-Datasets/main/unprisoned/You-Have-Datasets-Three-v2.0.zip = https://raw.githubusercontent.com/489d/You-Have-Three-Datasets/main/unprisoned/You-Have-Datasets-Three-v2.0.zip

But what you actually did is:

remove any order whose user record is missing
or any user_id mismatch
or any late-loaded user profile

Now your “orders analysis” is actually:

“orders with joinable users”

That’s the Excluded dataset taking over.

Pattern B | dropna() = an unacknowledged population change

In Python, you do:

df = https://raw.githubusercontent.com/489d/You-Have-Three-Datasets/main/unprisoned/You-Have-Datasets-Three-v2.0.zip(subset=["age", "income"])

Now your dataset becomes:

“people who revealed age and income”

And people who reveal those fields are usually not representative.

The conclusions you draw may be valid for that subgroup, but not for the original population.

Pattern C | “active users” is a moving target

Your definition:

active = “logged in last 30 days”

That’s not a stable population. It changes with:

seasonality
feature launches
login friction
outages
marketing campaigns

If you compare months, your “active cohort” changes composition—even if the underlying behavior doesn’t.

That’s the Excluded dataset shaping the trend.

Pattern D | data freshness windows hide missingness

You build a dashboard on “last 7 days.”

But data arrives late. Some sources backfill.

So your “last 7 days” is actually:

complete for yesterday
incomplete for today
partially complete for the last 2 hours
missing for a source with delay

Now your chart has a “trend” that is just missingness by time.

The most important skill: treating exclusions as hypotheses, not defaults

When you filter or drop rows, you are making a claim:

“This subset represents the problem I’m measuring.”

That claim should be tested.

Every filter has two meanings:

the data you keep
the story of what you removed

If you don’t analyze what you removed, you are blind to bias.

A framework: Dataset Accounting

Think like an accountant. Track flows.

Start with a “raw population” and measure how it transforms.

Dataset accounting checklist

For every analysis, document:

Input population
- “All orders created between dates X and Y”
Exclusions applied
- “Removed cancelled orders”
- “Removed test accounts”
- “Inner join to users”
- “Removed null ages”
Resulting population size
- How many rows remain?
- What % of the original?
Who got excluded
- Which segments lose the most rows?

If you do this consistently, you stop getting fooled by clean tables.

How to measure Missing and Excluded (practical tools)

1) Missingness matrix (column-level)

A simple table like:

column	missing_count	missing_%
age	12,500	18.2%
gender	1,200	1.7%

But don’t stop there.

Add segmentation:

Missingness by:

device
region
time
acquisition channel
user tier

Because missingness is rarely uniform.

2) Join loss report

If you join tables, always compute:

rows before join
rows after join
drop rate
drop rate by segment

Example idea:

“orders missing user profile = 7.8% overall”
“orders missing user profile = 21% on Android”

That changes the story.

3) Exclusion diff table (“kept vs removed”)

Whenever you filter, compare the distributions:

age distribution
country distribution
plan distribution
revenue distribution
engagement distribution

If the removed group is different, your metric is biased.

4) Time completeness checks

Build a “data completeness” metric:

expected events vs observed events
expected rows per hour vs observed rows

This prevents “data outage” from becoming “business trend.”

A real mental model: “Observed is a sample”

Treat observed data like a sample drawn by a noisy pipeline.

This unlocks two behaviors:

You stop trusting single-number KPIs.
You start reporting uncertainty and bias risks.

The 5 most dangerous phrases in analytics

1) “We’ll just drop nulls”

You are not cleaning data. You are redefining your population.

2) “It’s just a join”

Joins are filters. They remove data. Always measure the removal.

3) “This dataset is clean”

Clean for what? Clean compared to what baseline? Under what missingness assumptions?

4) “The trend is obvious”

Trends are often missingness + composition changes + freshness artifacts.

5) “It’s only a few percent”

A few percent missing overall can be 40% missing in a key segment.

How this breaks Data Science (not just dashboards)

If you build models on Observed-only data, you can get:

Selection bias

Your model learns behavior of “logged users,” “verified users,” “users with complete profiles.”

Then you deploy and wonder why it fails on:

new users
anonymous users
low-connectivity regions
minority language groups

Label bias

Labels often exist only for a subset:

disputes
manual reviews
flagged content
fraud confirmations

If your labels are generated by a biased process, your model learns the bias.

Feedback loops

Your model decisions affect what data gets collected next. Example:

you only review low-confidence items
so you only label those
so the model trains more on uncertain items
while confident-but-wrong errors go undetected

This is “the missing dataset” growing over time.

What to do in practice

Step 1 | Write a “population statement”

Before you analyze anything, write:

“This analysis measures X for population Y over time window Z.”

If you can’t do this, you’re not ready to query.

Step 2 | Produce a dataset accounting report

Include:

row counts at each stage
missingness by key columns
join loss
exclusion rates by segment

Step 3 | Add “coverage metrics” to dashboards

For every KPI, include:

denominator size
% of expected data present
join match rate
null rate for key fields

This prevents silent failure.

Step 4 | Treat exclusions as decisions

Document why exclusions exist and what risks they introduce.

A reusable template: “Three Dataset Summary” block

You can paste this into any report:

Observed dataset

N rows, time range X–Y, sources A/B/C

Missing dataset (known)

Columns with high null rates: …
Time windows with low completeness: …
Segments with missingness spikes: …

Excluded dataset

Filters applied: …
Join drop rate: …
Outlier removal rate: …
Who is most excluded: …

Risk statement

“Results may under-represent segment S due to missingness in field F.”
“Trends for last 24h may be biased due to ingestion delay.”

That’s professional analytics.

The closing idea

The one sentence that captures the entire article:

Analytics isn’t just computing metrics on what you have, it’s proving what you don’t have won’t change the story.

If you ignore Missing and Excluded, you’re not doing analysis. You’re doing storytelling with a selective camera.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
unprisoned		unprisoned
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

You Don’t Have One Dataset, You Have Three

Why this matters (even if your SQL is correct)

The three datasets: a concrete definition

1) Observed dataset

2) Missing dataset

Missing values (within-row)

Missing rows (entire records absent)

3) Excluded dataset

How the three datasets get created (common patterns)

Pattern A | Inner join = silent row deletion

Pattern B | dropna() = an unacknowledged population change

Pattern C | “active users” is a moving target

Pattern D | data freshness windows hide missingness

The most important skill: treating exclusions as hypotheses, not defaults

Every filter has two meanings:

A framework: Dataset Accounting

Dataset accounting checklist

How to measure Missing and Excluded (practical tools)

1) Missingness matrix (column-level)

Add segmentation:

2) Join loss report

3) Exclusion diff table (“kept vs removed”)

4) Time completeness checks

A real mental model: “Observed is a sample”

The 5 most dangerous phrases in analytics

1) “We’ll just drop nulls”

2) “It’s just a join”

3) “This dataset is clean”

4) “The trend is obvious”

5) “It’s only a few percent”

How this breaks Data Science (not just dashboards)

Selection bias

Label bias

Feedback loops

What to do in practice

Step 1 | Write a “population statement”

Step 2 | Produce a dataset accounting report

Step 3 | Add “coverage metrics” to dashboards

Step 4 | Treat exclusions as decisions

A reusable template: “Three Dataset Summary” block

The closing idea

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages