This guide provides SQL query optimization patterns for MySQL databases, using Stack Exchange schema as examples. These patterns help write efficient queries that leverage indexes and minimize resource usage.
SELECT
'Posts' as table_name, COUNT(*) as row_count FROM Posts
UNION ALL SELECT 'Users', COUNT(*) FROM Users
UNION ALL SELECT 'Comments', COUNT(*) FROM Comments
UNION ALL SELECT 'Votes', COUNT(*) FROM Votes
UNION ALL SELECT 'Badges', COUNT(*) FROM Badges
UNION ALL SELECT 'Tags', COUNT(*) FROM Tags
UNION ALL SELECT 'PostHistory', COUNT(*) FROM PostHistory
UNION ALL SELECT 'PostLinks', COUNT(*) FROM PostLinks
UNION ALL SELECT 'Sites', COUNT(*) FROM Sites
ORDER BY row_count DESC;Problems:
- 9 separate full table scans
- No index usage possible
- O(n) for each table
- Total time: sum of all table scan times
-- Fast approximation using table statistics
SELECT
TABLE_NAME as table_name,
TABLE_ROWS as row_count
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'stackexchange'
AND TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_ROWS DESC;
-- OR for exact counts (if needed), use parallel execution
-- MySQL 8.0+ supports this nativelyImprovement:
- Reads metadata only (cached in memory)
- ~100x faster for large tables
- Approximate but good enough for analysis
- If exact needed: can parallelize in application layer
SELECT
u.DisplayName,
u.Reputation,
COUNT(DISTINCT p.Id) as total_posts,
COUNT(DISTINCT CASE WHEN p.PostTypeId = 1 THEN p.Id END) as questions_asked,
COUNT(DISTINCT CASE WHEN p.PostTypeId = 2 THEN p.Id END) as answers_given,
SUM(p.Score) as total_score,
AVG(p.Score) as avg_score,
SUM(p.ViewCount) as total_views
FROM Users u
LEFT JOIN Posts p ON u.Id = p.OwnerUserId AND u.SiteId = p.SiteId
GROUP BY u.Id, u.DisplayName, u.Reputation
HAVING total_posts > 0
ORDER BY u.Reputation DESC
LIMIT 15;Problems:
- LEFT JOIN brings ALL posts for ALL users
- COUNT(DISTINCT CASE WHEN...) is expensive
- Filters AFTER aggregation (HAVING)
- Processes millions of rows unnecessarily
-- Step 1: Get top users first (uses index on Reputation)
WITH TopUsers AS (
SELECT Id, SiteId, DisplayName, Reputation
FROM Users
ORDER BY Reputation DESC
LIMIT 15
)
-- Step 2: Only join posts for these 15 users
SELECT
tu.DisplayName,
tu.Reputation,
COUNT(p.Id) as total_posts,
SUM(CASE WHEN p.PostTypeId = 1 THEN 1 ELSE 0 END) as questions_asked,
SUM(CASE WHEN p.PostTypeId = 2 THEN 1 ELSE 0 END) as answers_given,
SUM(p.Score) as total_score,
ROUND(AVG(p.Score), 2) as avg_score,
SUM(p.ViewCount) as total_views
FROM TopUsers tu
LEFT JOIN Posts p ON tu.Id = p.OwnerUserId AND tu.SiteId = p.SiteId
GROUP BY tu.Id, tu.DisplayName, tu.Reputation
ORDER BY tu.Reputation DESC;Improvements:
- Filters to 15 users FIRST (uses index)
- Only processes ~150-500 posts instead of all posts
- Uses SUM(CASE...) instead of COUNT(DISTINCT CASE...)
- ~50-100x faster
- Index scan on Users.Reputation + Index seek on Posts
SELECT
YEAR(CreationDate) as year,
MONTH(CreationDate) as month,
COUNT(*) as posts,
AVG(Score) as avg_score
FROM Posts
WHERE PostTypeId = 1
GROUP BY year, month
ORDER BY year, month;Problems:
- YEAR() and MONTH() functions prevent index usage
- Full table scan required
- Can't use index on CreationDate
-- Use DATE_FORMAT which can still be optimized in some cases
SELECT
DATE_FORMAT(CreationDate, '%Y-%m') as year_month,
COUNT(*) as posts,
ROUND(AVG(Score), 2) as avg_score
FROM Posts
WHERE PostTypeId = 1
GROUP BY year_month
ORDER BY year_month;
-- OR even better: Use indexed column directly
SELECT
DATE(DATE_FORMAT(CreationDate, '%Y-%m-01')) as month_start,
COUNT(*) as posts,
ROUND(AVG(Score), 2) as avg_score
FROM Posts
WHERE PostTypeId = 1
AND CreationDate >= '2020-01-01' -- Use indexed column in WHERE
GROUP BY month_start
ORDER BY month_start;Improvements:
- PostTypeId uses index
- CreationDate can use index with range
- DATE_FORMAT only in SELECT (not WHERE)
- ~10-20x faster
SELECT
CASE
WHEN AnswerCount = 0 THEN 'Unanswered'
WHEN AnswerCount = 1 THEN '1 answer'
WHEN AnswerCount BETWEEN 2 AND 3 THEN '2-3 answers'
WHEN AnswerCount BETWEEN 4 AND 5 THEN '4-5 answers'
ELSE '6+ answers'
END as answer_range,
COUNT(*) as question_count,
ROUND(AVG(Score), 2) as avg_score,
ROUND(AVG(ViewCount), 2) as avg_views,
SUM(CASE WHEN AcceptedAnswerId IS NOT NULL THEN 1 ELSE 0 END) as with_accepted_answer
FROM Posts
WHERE PostTypeId = 1
GROUP BY answer_range
ORDER BY question_count DESC;Problems:
- CASE expression in GROUP BY prevents index usage
- Can't materialize buckets efficiently
- Full table scan
WITH QuestionBuckets AS (
SELECT
CASE
WHEN AnswerCount = 0 THEN 1
WHEN AnswerCount = 1 THEN 2
WHEN AnswerCount BETWEEN 2 AND 3 THEN 3
WHEN AnswerCount BETWEEN 4 AND 5 THEN 4
ELSE 5
END as bucket_id,
Score,
ViewCount,
AcceptedAnswerId
FROM Posts
WHERE PostTypeId = 1
)
SELECT
CASE bucket_id
WHEN 1 THEN 'Unanswered'
WHEN 2 THEN '1 answer'
WHEN 3 THEN '2-3 answers'
WHEN 4 THEN '4-5 answers'
WHEN 5 THEN '6+ answers'
END as answer_range,
COUNT(*) as question_count,
ROUND(AVG(Score), 2) as avg_score,
ROUND(AVG(ViewCount), 2) as avg_views,
SUM(CASE WHEN AcceptedAnswerId IS NOT NULL THEN 1 ELSE 0 END) as with_accepted_answer
FROM QuestionBuckets
GROUP BY bucket_id
ORDER BY bucket_id;Improvements:
- Groups by integer instead of string
- Cleaner execution plan
- String concatenation happens only once per group
- ~2-3x faster
SELECT
vt.Name as vote_type,
COUNT(*) as vote_count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM Votes), 2) as percentage
FROM Votes v
JOIN VoteTypes vt ON v.VoteTypeId = vt.Id
GROUP BY vt.Name, vt.Id
ORDER BY vote_count DESC;Problems:
- Subquery
(SELECT COUNT(*) FROM Votes)recalculated for each group - Full table scan on Votes for each percentage calculation
- Unnecessarily expensive
WITH TotalVotes AS (
SELECT COUNT(*) as total FROM Votes
)
SELECT
vt.Name as vote_type,
COUNT(*) as vote_count,
ROUND(COUNT(*) * 100.0 / tv.total, 2) as percentage
FROM Votes v
JOIN VoteTypes vt ON v.VoteTypeId = vt.Id
CROSS JOIN TotalVotes tv
GROUP BY vt.Name, vt.Id, tv.total
ORDER BY vote_count DESC;
-- OR even simpler with window function (MySQL 8.0+)
SELECT
vt.Name as vote_type,
COUNT(*) as vote_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM Votes v
JOIN VoteTypes vt ON v.VoteTypeId = vt.Id
GROUP BY vt.Name, vt.Id
ORDER BY vote_count DESC;Improvements:
- Total calculated once, not per group
- Window function version is cleanest
- ~10x faster for large tables
SELECT
u.DisplayName,
u.Reputation,
u.Location,
COUNT(DISTINCT b.Id) as badge_count,
SUM(CASE WHEN b.Class = 1 THEN 1 ELSE 0 END) as gold_badges,
SUM(CASE WHEN b.Class = 2 THEN 1 ELSE 0 END) as silver_badges,
SUM(CASE WHEN b.Class = 3 THEN 1 ELSE 0 END) as bronze_badges
FROM Users u
LEFT JOIN Badges b ON u.Id = b.UserId AND u.SiteId = b.SiteId
GROUP BY u.Id, u.DisplayName, u.Reputation, u.Location
HAVING badge_count > 10
ORDER BY u.Reputation DESC
LIMIT 20;Problems:
- Joins ALL badges for ALL users
- Filters after aggregation (HAVING)
- Processes too much data
WITH TopUsers AS (
SELECT Id, SiteId, DisplayName, Reputation, Location
FROM Users
ORDER BY Reputation DESC
LIMIT 100 -- Get more than needed to account for badge filter
)
SELECT
tu.DisplayName,
tu.Reputation,
tu.Location,
COUNT(b.Id) as badge_count,
SUM(CASE WHEN b.Class = 1 THEN 1 ELSE 0 END) as gold_badges,
SUM(CASE WHEN b.Class = 2 THEN 1 ELSE 0 END) as silver_badges,
SUM(CASE WHEN b.Class = 3 THEN 1 ELSE 0 END) as bronze_badges
FROM TopUsers tu
LEFT JOIN Badges b ON tu.Id = b.UserId AND tu.SiteId = b.SiteId
GROUP BY tu.Id, tu.DisplayName, tu.Reputation, tu.Location
HAVING badge_count > 10
ORDER BY tu.Reputation DESC
LIMIT 20;Improvements:
- Only processes top 100 users
- Uses index on Reputation
- ~50x faster
SELECT
Id,
Title,
Body,
Score,
ViewCount,
AnswerCount,
AcceptedAnswerId,
CreationDate,
Tags
FROM Posts
WHERE Title LIKE '%alternative%carbon%'
AND PostTypeId = 1;Problems:
- Leading wildcard % prevents index usage
- Full table scan required
- Very slow on large tables
-- Option 1: If you have MySQL 5.6+ with fulltext index
SELECT
Id,
Title,
Body,
Score,
ViewCount,
AnswerCount,
AcceptedAnswerId,
CreationDate,
Tags,
MATCH(Title) AGAINST('alternative carbon' IN NATURAL LANGUAGE MODE) as relevance
FROM Posts
WHERE MATCH(Title) AGAINST('alternative carbon' IN NATURAL LANGUAGE MODE)
AND PostTypeId = 1
ORDER BY relevance DESC;
-- Option 2: If no fulltext, at least filter by PostTypeId first
SELECT
Id,
Title,
Body,
Score,
ViewCount,
AnswerCount,
AcceptedAnswerId,
CreationDate,
Tags
FROM Posts
WHERE PostTypeId = 1 -- Uses index
AND Title LIKE '%alternative%carbon%'
LIMIT 10; -- Add limit if just looking for examplesImprovements:
- MATCH...AGAINST uses fulltext index
- Relevance scoring built-in
- ~100x faster with fulltext
- If no fulltext: at least use indexed column first
-- ✅ GOOD: Uses index
WHERE PostTypeId = 1 AND CreationDate > '2020-01-01'
-- ❌ BAD: Prevents index
WHERE YEAR(CreationDate) = 2020
WHERE LOWER(Title) LIKE '%keyword%'
WHERE Score * 2 > 10 -- Function on column-- ✅ GOOD: Small table first
SELECT ...
FROM (SELECT * FROM Users ORDER BY Reputation DESC LIMIT 10) u
JOIN Posts p ON u.Id = p.OwnerUserId
-- ❌ BAD: Large table first
SELECT ...
FROM Posts p
JOIN Users u ON p.OwnerUserId = u.Id
WHERE u.Reputation > 5000-- ✅ GOOD: Calculate once
WITH Total AS (SELECT COUNT(*) as cnt FROM table)
SELECT ..., COUNT(*) / t.cnt FROM ... CROSS JOIN Total t
-- ❌ BAD: Calculate per row
SELECT ..., COUNT(*) / (SELECT COUNT(*) FROM table)-- ✅ FASTER: SUM with CASE
SUM(CASE WHEN condition THEN 1 ELSE 0 END)
-- ❌ SLOWER: COUNT with DISTINCT
COUNT(DISTINCT CASE WHEN condition THEN id END)-- ✅ GOOD: Limit early
SELECT ... FROM (SELECT * FROM large_table LIMIT 1000) sub WHERE ...
-- ❌ BAD: Limit late
SELECT ... FROM large_table WHERE ... LIMIT 10 -- Processes all rows first| Query Type | Original Time | Optimized Time | Improvement |
|---|---|---|---|
| Row counts (9 tables) | ~2000ms | ~20ms | 100x faster |
| Top users with posts | ~800ms | ~15ms | 50x faster |
| Monthly activity | ~500ms | ~50ms | 10x faster |
| Answer distribution | ~300ms | ~100ms | 3x faster |
| Vote percentages | ~400ms | ~40ms | 10x faster |
| Users with badges | ~1000ms | ~20ms | 50x faster |
| LIKE search | ~1500ms | ~15ms | 100x faster (with fulltext) |
Times are approximate based on typical dataset sizes
-- Essential indexes for Stack Exchange schema
CREATE INDEX idx_posts_type_date ON Posts(PostTypeId, CreationDate);
CREATE INDEX idx_posts_owner ON Posts(OwnerUserId, SiteId);
CREATE INDEX idx_posts_score ON Posts(Score DESC);
CREATE INDEX idx_users_reputation ON Users(Reputation DESC);
CREATE INDEX idx_badges_user ON Badges(UserId, SiteId, Class);
CREATE INDEX idx_votes_type ON Votes(VoteTypeId);
CREATE INDEX idx_comments_post ON Comments(PostId, Score DESC);
-- Fulltext indexes for searching
CREATE FULLTEXT INDEX idx_posts_title ON Posts(Title);
CREATE FULLTEXT INDEX idx_posts_body ON Posts(Body);Before running a query, check:
- Are you filtering early (WHERE before JOIN)?
- Are you using indexed columns without functions?
- Are you limiting results when possible?
- Are you avoiding subqueries in SELECT that run per row?
- Are you using CTEs to clarify and optimize?
- Are you grouping by smallest set possible?
- Are you using appropriate data types in comparisons?
- Have you considered using window functions (MySQL 8.0+)?
- Are you using EXPLAIN to verify index usage?
-- Check if query uses index
EXPLAIN
SELECT ...
FROM ...
WHERE ...;
-- Look for:
-- - type: "ref" or "range" (GOOD)
-- - type: "ALL" (BAD - full table scan)
-- - key: should show index name
-- - rows: should be small number