A data-driven analysis of pull request templates across 3,747 high-quality open source repositories.
I maintain a small Python library and wanted to improve my PR template. The problem: I have no intuition for what a good template looks like. Rather than copy a single popular project and hope for the best, I wanted to understand the patterns — what do well-maintained projects consistently ask for, and what's optional?
This project collects PR templates from popular open source repositories, extracts structural features using LLM-based analysis, and synthesizes the findings into actionable guidance.
| Stage | Count |
|---|---|
| Repos discovered (BigQuery) | 36,324 |
| Repos passing quality filter | 5,636 |
| PR template files collected | 5,621 |
| After dedup + cleaning | 3,747 |
| Features extracted | 3,746 |
Quality filter: ≥100 stars AND (≥10 watchers OR ≥20 forks). Median repository in the final dataset has 971 stars and 234 forks.
Discovery: BigQuery public dataset (github_repos.files) identified repositories containing PR template files (paths matching pull_request_template, markdown only).
Collection: GitHub GraphQL API in batches of 40 with 1-second delays. Each qualifying repository's metadata (stars, forks, watchers) and template file content were retrieved.
- Size filtering: Removed files <200 bytes (trivial/placeholder templates)
- Deduplication: SHA-256 hash of file contents removed 1,046 fork duplicates
- Encoding validation: Removed files failing UTF-8 decode
Structured extraction using gemini-2.5-flash-lite-preview-09-2025 with constrained JSON output (40 async workers, temperature=0). Each template was analyzed for:
- Structural metadata: section count, headings, word count, checklist presence
- Content categories: 11 boolean fields for what information the template requests (description, testing, related issues, etc.)
- Checklist composition: free-form topic labels, later normalized to a 36-category taxonomy
- Subjective assessments: tone (formal/neutral/casual), friction level (low/medium/high), specificity (generic/project-specific)
Extraction succeeded on 3,746 of 3,747 templates (1 failure due to a model output-length bug). Full schema in src/schema.py.
Free-form checklist topics from the LLM were mapped to 36 canonical categories via regex patterns. Coverage: 73.7% of 14,173 raw topic mentions matched a category. The remaining 26.3% are predominantly singletons (project-specific language not capturable by a generic taxonomy).
- Univariate prevalence for all boolean, categorical, and numeric features
- Fork-tier stratification (tier 1: 10–49 forks, tier 2: 50–199, tier 3: 200–999, tier 4: 1000+) as a proxy for project maturity and contribution volume
- Correlation analysis: phi coefficients (boolean pairs) and point-biserial correlations (numeric × boolean) with Benjamini-Hochberg FDR correction
| Feature | Overall | Tier 4 (1000+ forks) | Recommendation |
|---|---|---|---|
| Description of changes | 78% | 78% | Must have |
| Checklist | 75% | 74% | Must have |
| Related issues | 69% | 72% | Must have |
| Testing evidence | 69% | 68% | Strongly recommended |
| Placeholder text | 61% | 66% | Strongly recommended |
| Documentation reminder | 47% | 44% | Context dependent |
| Motivation / context | 43% | 44% | Context dependent |
| Change type classification | 26% | 24% | Optional |
| Breaking changes | 20% | 16% | Optional |
| Screenshots | 14% | 14% | Optional |
| Reviewer notes | 13% | 13% | Optional |
The "must have" threshold is ≥70% prevalence among tier 4 projects. "Strongly recommended" is 50–70%. Features below 50% tier 4 prevalence are context-dependent or optional.
A few observations:
- The core four are very stable across tiers. Description, checklist, related issues, and testing all exceed 64% in every tier. Projects converge on these regardless of size.
- Placeholder text increases with maturity. 56% in tier 1 → 66% in tier 4. Larger projects invest more in guiding contributors through the template.
- Breaking changes decreases with maturity. 23% in tier 1 → 16% in tier 4. This likely reflects that mature projects handle breaking changes through separate processes (RFCs, changelogs) rather than PR templates.
| Metric | Mean | Median | IQR |
|---|---|---|---|
| Sections | 3.0 | 3 | 1–4 |
| Word count | 100 | 75 | 50–130 |
| Checklist items | 4.0 | 4 | 1–6 |
The typical PR template is short: ~75 words, 3 sections, 4 checklist items. Two-thirds are medium friction (2–5 minutes to complete), 30% are low friction. Only 4% are high friction.
84% use a neutral tone. 56% are project-specific (referencing particular tools, conventions, or workflows).
Among templates that include checklists, the most common items are:
| Topic | % of all templates | Category |
|---|---|---|
| Documentation updated | 37% | Essential |
| Tests added | 36% | Essential |
| Tests pass | 26% | Recommended |
| Code style followed | 19% | Recommended |
| Issue linked | 16% | Recommended |
| Contributor guidelines read | 16% | Recommended |
| PR formatting / commit hygiene | 14% | Recommended |
| Changelog updated | 13% | Recommended |
| Change type labeled | 9% | Optional |
| Target branch correct | 7% | Optional |
| Legal sign-off (CLA/DCO) | 7% | Optional |
| Breaking changes noted | 7% | Optional |
The long tail includes 24 additional categories at <5% prevalence each (build passes, self-review, code comments, screenshots provided, etc.). Full taxonomy: 36 categories, documented in src/normalize.py.
No surprising co-occurrence patterns emerged. The strongest non-trivial correlations:
asks_for_breaking_changes×asks_for_change_type(phi = 0.52) — these tend to appear together as part of a "change classification" clusterhas_checklist×asks_for_documentation(phi = 0.42) andhas_checklist×asks_for_testing(phi = 0.42) — checklists tend to appear in templates that also ask for testing and docs, reflecting a "thoroughness" clusterhas_placeholder_text×asks_for_description(phi = 0.37) — templates that guide contributors with placeholders tend to be the same ones asking for descriptions
Using these findings, I generated an improved PR template for Pollux, a Python library for multimodal LLM orchestration that I maintain solo.
Approach: A structured prompt containing the aggregate findings, 6 exemplar templates from tier 4 repos (vscode, react, bootstrap, docker-mailserver, nestjs, deepchem), the existing template, and project-specific constraints was provided to Claude Opus 4.6. The full prompt, original template, and design rationale are in synthesis/.
## Summary
<!-- What does this PR do and why? One or two sentences on the change,
plus motivation if not obvious from the linked issue. -->
## Related issue
<!-- Link the issue this PR addresses. Use closing keywords if applicable:
"Closes #123", "Fixes #456". Write "None" for unprompted changes. -->
## Test plan
<!-- How did you verify this works? Describe what you tested, not just that
tests pass. Examples: "Added unit tests for X edge case", "Manually
tested against Y model provider", "N/A — docs-only change". -->
## Notes
<!-- Optional: anything reviewers should know, design trade-offs, follow-up
work, or context for your future self. -->
---
- [ ] PR title follows conventional commits
- [ ] `make check` passes
- [ ] Tests cover the meaningful cases, not just the happy path
- [ ] Docs updated (if this changes public API or user-facing behavior)| Change | Data support |
|---|---|
| Added Related issue section | 72% of tier 4 projects ask for linked issues |
| Added Test plan section | 69% ask for testing evidence; a free-text section is more useful than a checkbox |
| Expanded Summary guidance to include "and why?" | Folds motivation (43%) into the existing section without a new heading |
| Added docs updated checklist item | Most common checklist topic (37% of all templates) |
| Feature | Prevalence | Why excluded for this project |
|---|---|---|
| Change type classification | 26% | Conventional commits already encode this (fix:, feat:, feat!:) |
| Breaking changes section | 20% | Covered by ! suffix; pre-1.0, breaking changes are expected |
| Code style checklist items | 19% | make check runs ruff + mypy; CI is authoritative |
| Contributor guidelines reminder | 16% | Contributing guide is linked from the conventional commits item |
| Changelog reminder | 13% | python-semantic-release generates the changelog automatically |
The revised template has 4 sections and 4 checklist items (dataset median: 3 sections, 4 items) and stays in the low-friction band (~1–2 minutes to fill out vs. the dataset median of 2–5 minutes). Full design rationale in synthesis/rationale.md.
- Descriptive, not causal. This study identifies what popular projects do, not what works. There is no outcome linkage (merge time, review quality, rework rates). The implicit argument is revealed preference: if 78% of high-fork projects independently converge on the same structure, there is signal in that convergence.
- Survivor bias. Only repositories with ≥100 stars are included. Patterns may not generalize to smaller or newer projects.
- Snapshot timing. BigQuery's
github_reposdataset reports a last update of September 2025. Templates were fetched live from GitHub, but repository discovery reflects that snapshot. - LLM extraction. Feature extraction via Gemini may misclassify edge cases. Spot-checking was performed but no formal gold-set evaluation. One template (of 3,747) failed extraction due to a model output-length bug.
- Normalization coverage. The regex-based taxonomy captures 73.7% of raw checklist topics. The remaining 26.3% is dominated by singletons (project-specific language). This means topic prevalence numbers are lower bounds — the true prevalence of concepts like "tests added" is likely higher than reported, since some variant phrasings go uncaptured.
- Template ≠ practice. A template requesting testing evidence does not mean tests are actually written. Templates were analyzed in isolation, without knowledge of enforcement or compliance.
# Install dependencies
uv sync
# Run the full pipeline (stages must run in order for first-time data generation)
uv run python -m src.fetch_repos # Stage 1: BigQuery repo discovery
uv run python -m src.collect # Stage 2: GitHub GraphQL collection
uv run python -m src.clean # Stage 3: Deduplication & filtering
uv run python -m src.extract # Stage 4: LLM feature extraction
uv run python -m src.normalize # Stage 5: Checklist topic normalization
uv run python -m src.analyze # Stage 6: Statistical analysis
uv run python -m src.synthesize # Stage 7: Recommendations & exemplarsRequired environment variables (see .env.example):
| Variable | Required for |
|---|---|
BQ_PROJECT |
Stage 1 (BigQuery discovery) |
GITHUB_TOKEN |
Stage 2 (GraphQL collection) |
GEMINI_API_KEY |
Stage 4 (feature extraction) |
Stages 5–7 require no API keys and run in under 5 seconds total on the pre-generated data.