Code Change Characteristics and Description Alignment: A Comparative Study of Agentic versus Human Pull Requests
Replication Package for MSR 2026
This repository contains the replication package for the paper "Code Change Characteristics and Description Alignment: A Comparative Study of Agentic versus Human Pull Requests" accepted for publication at the MSR 2026 Conference.
This study investigates how AI coding agents' pull requests (APRs) differ from human pull requests (HPRs) in terms of code change characteristics and description quality. We analyze 33,596 agent-generated PRs and 6,618 human PRs to answer two research questions:
- RQ1: How do APRs and HPRs differ in code change characteristics (files changed, code churn, lines added/removed, and change purposes)?
- RQ2: How well do APR descriptions and commit messages align with code changes?
.
├── notebooks/
│ ├── RQ1.ipynb # RQ1: Code change characteristics analysis
│ └── RQ2.ipynb # RQ2: Description alignment analysis
├── scripts/
│ ├── build_human_pr_commit_details_df.py # Build human PR commit details
│ └── gen_commit_message.py # Generate commit messages
├── data/ # Data directory (see Data Requirements below)
├── plots/ # Generated plots and visualizations
├── prompts/ # LLM prompts (e.g., LLM-as-judge)
├── pyproject.toml # Project dependencies
├── uv.lock # Dependency lock file
└── README.md # This file
- Python 3.12 or higher
- uv (Python package manager)
- GitHub API tokens (for building human PR commit details)
On macOS:
brew install uvOn Linux/Windows, see: https://github.com/astral-sh/uv
uv venv --python 3.12
source .venv/bin/activate # On Windows: .venv\Scripts\activateuv syncCreate a .env file in the project root:
GITHUB_TOKEN_1=your_github_token_1
GITHUB_TOKEN_2=your_github_token_2
Note: GitHub tokens are required to build the human_pr_commit_details_df.parquet file using the build script (see Additional Scripts section).
The following data files are required to run the replication notebooks:
pull_request.parquet- Agent-generated PRspr_commits.parquet- PR commits datapr_commit_details.parquet- Commit-level file change details for APRshuman_pull_request.parquet- Human PRshuman_pr_commit_details_df.parquet- Human PR commit details (must be generated, see Additional Scripts section)pr_task_type.parquet- Agent PR task type classificationshuman_pr_task_type.parquet- Human PR task type classificationsrelated_issue.parquet- Related issue data
cleaned_train.csv- PR description and commit message similarity benchmark (train set)commitbench_test.csv- Commit message similarity benchmark (test set)
These data files are NOT included in this repository. They must be downloaded from the following sources:
- AIDev dataset: https://huggingface.co/datasets/hao-li/AIDev
- PR-Description benchmark: https://figshare.com/s/58ee9c2a4e9d951305d7?file=46126455
- CommitBench dataset: https://huggingface.co/datasets/Maxscha/commitbench
After downloading, place all .parquet and .csv files in the data/ directory:
data/
├── pull_request.parquet
├── pr_commits.parquet
├── pr_commit_details.parquet
├── human_pull_request.parquet
├── human_pr_commit_details_df.parquet # (must be generated)
├── pr_task_type.parquet
├── human_pr_task_type.parquet
├── related_issue.parquet
├── cleaned_train.csv
└── commitbench_test.csv
Analyzes how APRs and HPRs differ in:
- Merge rates and change footprints (commits, files, directories, lines)
- Symbol churn and symbol lifetime
- Change purposes (feature, bug fix, documentation, etc.)
Open and run the notebook:
jupyter notebook notebooks/RQ1.ipynbOr using JupyterLab:
jupyter lab notebooks/RQ1.ipynbExamines the quality of commit messages and PR descriptions using:
- PR-Commit Similarity (semantic alignment between PR description and commit messages)
- Patch-Commit Similarity (alignment between diff and messages)
- LLM-based Consistency Score (GPT-4o quality rating)
- Classification models to identify factors predicting strong descriptions
Open and run the notebook:
jupyter notebook notebooks/RQ2.ipynbOr using JupyterLab:
jupyter lab notebooks/RQ2.ipynbThe human_pr_commit_details_df.parquet file must be generated by running this script. This file is required for the analysis notebooks.
First time run:
python scripts/build_human_pr_commit_details_df.py -o data/human_pr_commit_details_df.parquetResume from previous run (if interrupted):
python scripts/build_human_pr_commit_details_df.py -o data/human_pr_commit_details_df.parquet --resumeNote: This script requires GitHub API tokens (see Setup section). The script fetches commit details for all human PRs from the AIDev dataset.
Generate commit messages using the CodeT5 model:
python scripts/gen_commit_message.py -i data/input.parquet -o data/output.parquetNote: GPU support is recommended for faster processing. The input parquet file must contain a patch column.
All analysis results are embedded directly in the Jupyter notebooks (RQ1.ipynb and RQ2.ipynb). Run the notebooks to reproduce all findings from the paper.
Key dependencies (managed via pyproject.toml and uv.lock):
pandas- Data manipulation and analysisnumpy- Numerical computationsscipy- Statistical testsscikit-learn- Machine learning modelsshap- Model interpretabilitymatplotlib,seaborn- Visualizationssentence-transformers- Text embeddingstransformers- LLM fine-tuning and inferencepyarrow- Parquet file support
- The notebooks include extensive documentation and markdown cells explaining each analysis step.
- Some analyses (e.g., embedding generation, model inference) are computationally intensive. GPU support is recommended for faster processing but not required.
- The LLM-as-judge prompt used for RQ2 is available in
prompts/lllm_as_judge_prompt.md.
If you use this replication package, please cite the paper:
@inproceedings{pham2026agentic_codechange,
title={Code Change Characteristics and Description Alignment: A Comparative Study of Agentic versus Human Pull Requests},
author={Dung Pham and Taher A. Ghaleb},
booktitle={Proceedings of the 23rd IEEE/ACM International Conference on Mining Software Repositories (MSR)},
year={2026}
}For questions about this replication package, please contact:
- Dung Pham: dungpham290198@gmail.com
- Taher A. Ghaleb: taherghaleb@trentu.ca
This work uses the AIDev dataset by Hao Li et al., available at https://huggingface.co/datasets/hao-li/AIDev.
We also use benchmark datasets from:
- Tire et al. (PR-Description benchmark)
- Schall et al. (CommitBench dataset)
This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC): RGPIN-2025-05897.