Skip to content

Data-driven analysis of pull request templates in popular open source repositories.

License

Notifications You must be signed in to change notification settings

seanbrar/gh-templates

Repository files navigation

PR Template Patterns in Popular Open Source Projects

A data-driven analysis of pull request templates across 3,747 high-quality open source repositories.

Motivation

I maintain a small Python library and wanted to improve my PR template. The problem: I have no intuition for what a good template looks like. Rather than copy a single popular project and hope for the best, I wanted to understand the patterns — what do well-maintained projects consistently ask for, and what's optional?

This project collects PR templates from popular open source repositories, extracts structural features using LLM-based analysis, and synthesizes the findings into actionable guidance.

Dataset

Stage Count
Repos discovered (BigQuery) 36,324
Repos passing quality filter 5,636
PR template files collected 5,621
After dedup + cleaning 3,747
Features extracted 3,746

Quality filter: ≥100 stars AND (≥10 watchers OR ≥20 forks). Median repository in the final dataset has 971 stars and 234 forks.

Methodology

Data Collection

Discovery: BigQuery public dataset (github_repos.files) identified repositories containing PR template files (paths matching pull_request_template, markdown only).

Collection: GitHub GraphQL API in batches of 40 with 1-second delays. Each qualifying repository's metadata (stars, forks, watchers) and template file content were retrieved.

Cleaning

  1. Size filtering: Removed files <200 bytes (trivial/placeholder templates)
  2. Deduplication: SHA-256 hash of file contents removed 1,046 fork duplicates
  3. Encoding validation: Removed files failing UTF-8 decode

Feature Extraction

Structured extraction using gemini-2.5-flash-lite-preview-09-2025 with constrained JSON output (40 async workers, temperature=0). Each template was analyzed for:

  • Structural metadata: section count, headings, word count, checklist presence
  • Content categories: 11 boolean fields for what information the template requests (description, testing, related issues, etc.)
  • Checklist composition: free-form topic labels, later normalized to a 36-category taxonomy
  • Subjective assessments: tone (formal/neutral/casual), friction level (low/medium/high), specificity (generic/project-specific)

Extraction succeeded on 3,746 of 3,747 templates (1 failure due to a model output-length bug). Full schema in src/schema.py.

Normalization

Free-form checklist topics from the LLM were mapped to 36 canonical categories via regex patterns. Coverage: 73.7% of 14,173 raw topic mentions matched a category. The remaining 26.3% are predominantly singletons (project-specific language not capturable by a generic taxonomy).

Analysis

  • Univariate prevalence for all boolean, categorical, and numeric features
  • Fork-tier stratification (tier 1: 10–49 forks, tier 2: 50–199, tier 3: 200–999, tier 4: 1000+) as a proxy for project maturity and contribution volume
  • Correlation analysis: phi coefficients (boolean pairs) and point-biserial correlations (numeric × boolean) with Benjamini-Hochberg FDR correction

Findings

What Templates Ask For

Feature Overall Tier 4 (1000+ forks) Recommendation
Description of changes 78% 78% Must have
Checklist 75% 74% Must have
Related issues 69% 72% Must have
Testing evidence 69% 68% Strongly recommended
Placeholder text 61% 66% Strongly recommended
Documentation reminder 47% 44% Context dependent
Motivation / context 43% 44% Context dependent
Change type classification 26% 24% Optional
Breaking changes 20% 16% Optional
Screenshots 14% 14% Optional
Reviewer notes 13% 13% Optional

The "must have" threshold is ≥70% prevalence among tier 4 projects. "Strongly recommended" is 50–70%. Features below 50% tier 4 prevalence are context-dependent or optional.

A few observations:

  • The core four are very stable across tiers. Description, checklist, related issues, and testing all exceed 64% in every tier. Projects converge on these regardless of size.
  • Placeholder text increases with maturity. 56% in tier 1 → 66% in tier 4. Larger projects invest more in guiding contributors through the template.
  • Breaking changes decreases with maturity. 23% in tier 1 → 16% in tier 4. This likely reflects that mature projects handle breaking changes through separate processes (RFCs, changelogs) rather than PR templates.

Template Shape

Metric Mean Median IQR
Sections 3.0 3 1–4
Word count 100 75 50–130
Checklist items 4.0 4 1–6

The typical PR template is short: ~75 words, 3 sections, 4 checklist items. Two-thirds are medium friction (2–5 minutes to complete), 30% are low friction. Only 4% are high friction.

84% use a neutral tone. 56% are project-specific (referencing particular tools, conventions, or workflows).

Checklist Topics

Among templates that include checklists, the most common items are:

Topic % of all templates Category
Documentation updated 37% Essential
Tests added 36% Essential
Tests pass 26% Recommended
Code style followed 19% Recommended
Issue linked 16% Recommended
Contributor guidelines read 16% Recommended
PR formatting / commit hygiene 14% Recommended
Changelog updated 13% Recommended
Change type labeled 9% Optional
Target branch correct 7% Optional
Legal sign-off (CLA/DCO) 7% Optional
Breaking changes noted 7% Optional

The long tail includes 24 additional categories at <5% prevalence each (build passes, self-review, code comments, screenshots provided, etc.). Full taxonomy: 36 categories, documented in src/normalize.py.

Correlations

No surprising co-occurrence patterns emerged. The strongest non-trivial correlations:

  • asks_for_breaking_changes × asks_for_change_type (phi = 0.52) — these tend to appear together as part of a "change classification" cluster
  • has_checklist × asks_for_documentation (phi = 0.42) and has_checklist × asks_for_testing (phi = 0.42) — checklists tend to appear in templates that also ask for testing and docs, reflecting a "thoroughness" cluster
  • has_placeholder_text × asks_for_description (phi = 0.37) — templates that guide contributors with placeholders tend to be the same ones asking for descriptions

Application

Using these findings, I generated an improved PR template for Pollux, a Python library for multimodal LLM orchestration that I maintain solo.

Approach: A structured prompt containing the aggregate findings, 6 exemplar templates from tier 4 repos (vscode, react, bootstrap, docker-mailserver, nestjs, deepchem), the existing template, and project-specific constraints was provided to Claude Opus 4.6. The full prompt, original template, and design rationale are in synthesis/.

Result

## Summary

<!-- What does this PR do and why? One or two sentences on the change,
     plus motivation if not obvious from the linked issue. -->

## Related issue

<!-- Link the issue this PR addresses. Use closing keywords if applicable:
     "Closes #123", "Fixes #456". Write "None" for unprompted changes. -->

## Test plan

<!-- How did you verify this works? Describe what you tested, not just that
     tests pass. Examples: "Added unit tests for X edge case", "Manually
     tested against Y model provider", "N/A — docs-only change". -->

## Notes

<!-- Optional: anything reviewers should know, design trade-offs, follow-up
     work, or context for your future self. -->

---

- [ ] PR title follows conventional commits
- [ ] `make check` passes
- [ ] Tests cover the meaningful cases, not just the happy path
- [ ] Docs updated (if this changes public API or user-facing behavior)

What changed

Change Data support
Added Related issue section 72% of tier 4 projects ask for linked issues
Added Test plan section 69% ask for testing evidence; a free-text section is more useful than a checkbox
Expanded Summary guidance to include "and why?" Folds motivation (43%) into the existing section without a new heading
Added docs updated checklist item Most common checklist topic (37% of all templates)

What was deliberately excluded

Feature Prevalence Why excluded for this project
Change type classification 26% Conventional commits already encode this (fix:, feat:, feat!:)
Breaking changes section 20% Covered by ! suffix; pre-1.0, breaking changes are expected
Code style checklist items 19% make check runs ruff + mypy; CI is authoritative
Contributor guidelines reminder 16% Contributing guide is linked from the conventional commits item
Changelog reminder 13% python-semantic-release generates the changelog automatically

The revised template has 4 sections and 4 checklist items (dataset median: 3 sections, 4 items) and stays in the low-friction band (~1–2 minutes to fill out vs. the dataset median of 2–5 minutes). Full design rationale in synthesis/rationale.md.

Limitations

  • Descriptive, not causal. This study identifies what popular projects do, not what works. There is no outcome linkage (merge time, review quality, rework rates). The implicit argument is revealed preference: if 78% of high-fork projects independently converge on the same structure, there is signal in that convergence.
  • Survivor bias. Only repositories with ≥100 stars are included. Patterns may not generalize to smaller or newer projects.
  • Snapshot timing. BigQuery's github_repos dataset reports a last update of September 2025. Templates were fetched live from GitHub, but repository discovery reflects that snapshot.
  • LLM extraction. Feature extraction via Gemini may misclassify edge cases. Spot-checking was performed but no formal gold-set evaluation. One template (of 3,747) failed extraction due to a model output-length bug.
  • Normalization coverage. The regex-based taxonomy captures 73.7% of raw checklist topics. The remaining 26.3% is dominated by singletons (project-specific language). This means topic prevalence numbers are lower bounds — the true prevalence of concepts like "tests added" is likely higher than reported, since some variant phrasings go uncaptured.
  • Template ≠ practice. A template requesting testing evidence does not mean tests are actually written. Templates were analyzed in isolation, without knowledge of enforcement or compliance.

Reproduction

# Install dependencies
uv sync

# Run the full pipeline (stages must run in order for first-time data generation)
uv run python -m src.fetch_repos       # Stage 1: BigQuery repo discovery
uv run python -m src.collect           # Stage 2: GitHub GraphQL collection
uv run python -m src.clean             # Stage 3: Deduplication & filtering
uv run python -m src.extract           # Stage 4: LLM feature extraction
uv run python -m src.normalize         # Stage 5: Checklist topic normalization
uv run python -m src.analyze           # Stage 6: Statistical analysis
uv run python -m src.synthesize        # Stage 7: Recommendations & exemplars

Required environment variables (see .env.example):

Variable Required for
BQ_PROJECT Stage 1 (BigQuery discovery)
GITHUB_TOKEN Stage 2 (GraphQL collection)
GEMINI_API_KEY Stage 4 (feature extraction)

Stages 5–7 require no API keys and run in under 5 seconds total on the pre-generated data.

License

MIT

About

Data-driven analysis of pull request templates in popular open source repositories.

Topics

Resources

License

Stars

Watchers

Forks

Languages