Most analysts talk about “the dataset” like it’s a single object; stable, complete, and representative.
It isn’t.
In almost every real project, you are quietly working with three datasets:
- Observed: what you can see in the table right now
- Missing: what should exist but doesn’t (nulls, gaps, broken tracking, late arrivals)
- Excluded: what existed but you removed (filters, joins, deduping, eligibility rules)
And here’s the uncomfortable truth:
Your conclusions are usually shaped more by Missing and Excluded than by what’s Observed.
This article is a practical guide to recognizing the three datasets, measuring them, and preventing them from silently destroying your analysis.
You can write perfect SQL and still be wrong because:
- the data you don’t have is not random
- the rows you remove are not neutral
- the gaps you ignore are often concentrated in specific users, regions, devices, or time ranges
- your metrics are computed on “the survivors,” not the population you think you’re measuring
When this happens, it doesn’t fail loudly. It fails politely:
- your dashboard looks clean
- your model trains “fine”
- your KPIs move
- your conclusions sound reasonable
And then you ship decisions based on a dataset that is not the dataset you think you have.
This is the dataset you see and query:
- the rows that made it into storage
- the columns that are populated (at least sometimes)
- the records that survived ingestion and parsing
Observed data is not reality. It’s captured reality.
If you measure only this dataset, you risk confusing “what we captured” with “what happened.”
This is everything that should exist but does not.
Missing shows up as:
- null values
- missing rows
- missing time windows
- missing categories
- incomplete event streams
- late-arriving data not yet loaded
There are two key types of missingness:
Example: you have an order, but the payment_method is null.
Example: orders from iOS are undercounted because the tracking endpoint failed for iOS users.
Missing rows are usually worse because you can’t even see that they’re missing unless you compare against an external reference.
This is everything you removed; sometimes intentionally, sometimes accidentally.
Excluded comes from:
- filters (
WHERE country = 'US',WHERE status='completed') - join behavior (inner join drops unmatched)
- deduping logic (keeping only the “latest” row)
- eligibility constraints (“active users only”)
- data cleaning rules (dropping nulls, trimming outliers)
- sampling procedures
Excluded data is often systematically different from the data you keep.
That means exclusions create bias, even when they are “reasonable.”
You think you’re “enriching” data:
SELECT *
FROM orders o
JOIN users u ON https://raw.githubusercontent.com/489d/You-Have-Three-Datasets/main/unprisoned/You-Have-Datasets-Three-v2.0.zip = https://raw.githubusercontent.com/489d/You-Have-Three-Datasets/main/unprisoned/You-Have-Datasets-Three-v2.0.zipBut what you actually did is:
- remove any order whose user record is missing
- or any user_id mismatch
- or any late-loaded user profile
Now your “orders analysis” is actually:
“orders with joinable users”
That’s the Excluded dataset taking over.
In Python, you do:
df = https://raw.githubusercontent.com/489d/You-Have-Three-Datasets/main/unprisoned/You-Have-Datasets-Three-v2.0.zip(subset=["age", "income"])Now your dataset becomes:
“people who revealed age and income”
And people who reveal those fields are usually not representative.
The conclusions you draw may be valid for that subgroup, but not for the original population.
Your definition:
active = “logged in last 30 days”
That’s not a stable population. It changes with:
- seasonality
- feature launches
- login friction
- outages
- marketing campaigns
If you compare months, your “active cohort” changes composition—even if the underlying behavior doesn’t.
That’s the Excluded dataset shaping the trend.
You build a dashboard on “last 7 days.”
But data arrives late. Some sources backfill.
So your “last 7 days” is actually:
- complete for yesterday
- incomplete for today
- partially complete for the last 2 hours
- missing for a source with delay
Now your chart has a “trend” that is just missingness by time.
When you filter or drop rows, you are making a claim:
- “This subset represents the problem I’m measuring.”
That claim should be tested.
- the data you keep
- the story of what you removed
If you don’t analyze what you removed, you are blind to bias.
Think like an accountant. Track flows.
Start with a “raw population” and measure how it transforms.
For every analysis, document:
-
Input population
- “All orders created between dates X and Y”
-
Exclusions applied
- “Removed cancelled orders”
- “Removed test accounts”
- “Inner join to users”
- “Removed null ages”
-
Resulting population size
- How many rows remain?
- What % of the original?
-
Who got excluded
- Which segments lose the most rows?
If you do this consistently, you stop getting fooled by clean tables.
A simple table like:
| column | missing_count | missing_% |
|---|---|---|
| age | 12,500 | 18.2% |
| gender | 1,200 | 1.7% |
But don’t stop there.
Missingness by:
- device
- region
- time
- acquisition channel
- user tier
Because missingness is rarely uniform.
If you join tables, always compute:
- rows before join
- rows after join
- drop rate
- drop rate by segment
Example idea:
- “orders missing user profile = 7.8% overall”
- “orders missing user profile = 21% on Android”
That changes the story.
Whenever you filter, compare the distributions:
- age distribution
- country distribution
- plan distribution
- revenue distribution
- engagement distribution
If the removed group is different, your metric is biased.
Build a “data completeness” metric:
- expected events vs observed events
- expected rows per hour vs observed rows
This prevents “data outage” from becoming “business trend.”
Treat observed data like a sample drawn by a noisy pipeline.
This unlocks two behaviors:
- You stop trusting single-number KPIs.
- You start reporting uncertainty and bias risks.
You are not cleaning data. You are redefining your population.
Joins are filters. They remove data. Always measure the removal.
Clean for what? Clean compared to what baseline? Under what missingness assumptions?
Trends are often missingness + composition changes + freshness artifacts.
A few percent missing overall can be 40% missing in a key segment.
If you build models on Observed-only data, you can get:
Your model learns behavior of “logged users,” “verified users,” “users with complete profiles.”
Then you deploy and wonder why it fails on:
- new users
- anonymous users
- low-connectivity regions
- minority language groups
Labels often exist only for a subset:
- disputes
- manual reviews
- flagged content
- fraud confirmations
If your labels are generated by a biased process, your model learns the bias.
Your model decisions affect what data gets collected next. Example:
- you only review low-confidence items
- so you only label those
- so the model trains more on uncertain items
- while confident-but-wrong errors go undetected
This is “the missing dataset” growing over time.
Before you analyze anything, write:
“This analysis measures X for population Y over time window Z.”
If you can’t do this, you’re not ready to query.
Include:
- row counts at each stage
- missingness by key columns
- join loss
- exclusion rates by segment
For every KPI, include:
- denominator size
- % of expected data present
- join match rate
- null rate for key fields
This prevents silent failure.
Document why exclusions exist and what risks they introduce.
You can paste this into any report:
Observed dataset
- N rows, time range X–Y, sources A/B/C
Missing dataset (known)
- Columns with high null rates: …
- Time windows with low completeness: …
- Segments with missingness spikes: …
Excluded dataset
- Filters applied: …
- Join drop rate: …
- Outlier removal rate: …
- Who is most excluded: …
Risk statement
- “Results may under-represent segment S due to missingness in field F.”
- “Trends for last 24h may be biased due to ingestion delay.”
That’s professional analytics.
The one sentence that captures the entire article:
Analytics isn’t just computing metrics on what you have, it’s proving what you don’t have won’t change the story.
If you ignore Missing and Excluded, you’re not doing analysis. You’re doing storytelling with a selective camera.