An analysis and report of how pull requests work for Github
You only need the following in case you want to regenerate (or generate more)
the data files used for the analysis. The data files used in the various
papers can be found in data/*.csv
.
Make sure that Ruby 2.2 is installed on your machine. You can try RVM, if it is not. Then, it should suffice to do:
apt-get install libicu-dev cmake libmysqlclient-dev parallel rvm install 2.2.1 rvm use 2.2.1 gem install bundler bundle install gem install mysql2 bson_ext
The executable commands in this project inherit functionality from the GHTorrent libraries. To work, they need the GHTorrent MongoDB data and a recent version of the GHTorrent MySQL database. For that, you may use the data from ghtorrent.org.
In addition to command specific arguments, the commands use the same
config.yaml
file for specific connection details to external systems. You
can find a template config.yaml
file
here.
The analysis scripts only are only interested in the connection details for
MySQL and MongoDB, and the location of a temporary directory
(the cache_dir
directory).
- Create intermediate tables to do the querying
create view project_languages_totals as
select project_id, language, bytes, max(created_at) as last_update
from project_languages
group by project_id, language, bytes;
create table project_language_perc_last_update as
select a.project_id as project_id, a.language as language,
a.bytes / (select sum(b.bytes)
from project_languages_totals b
where b.project_id = a.project_id
group by b.project_id) as ratio
from project_languages_totals a;
- For each one of the languages
java, javascript, scala, ruby, python
run the following query on the GHTorrent database:
select u.login, p.name, count(*)
from projects p, users u, pull_requests pr
where p.owner_id = u.id
and pr.base_repo_id = p.id
and p.deleted is false
and p.forked_from is null
and p.language = 'Javascript'
and (
select ratio
from project_language_perc_last_update lu
where lu.project_id = p.id
and lu.language = 'javascript' limit 1) >= 0.75
and exists (select *
from project_commits pc, commits c
where c.id = pc.commit_id
and pc.project_id = p.id
and c.created_at > DATE_SUB(DATE('2015-11-01'), INTERVAL 6 MONTH) limit 1)
group by p.id
having count(*) > 50
order by count(*) desc;
This will return all projects that have more than 50 pull requests, whose main language (main means > 75% code in this lanugage) is the indicated one and which have received at least one commit in the period Apr 1 and Nov 1, 2015.
The data analysis consists of two steps:
- Generating intermediate data files
- Analyzing data files with R
####Generating intermediate files
To produce the required data files, first run the
bin/pull_req_data_extraction.rb
script like so:
ruby -Ibin bin/pull_req_data_extraction.rb -c config.yaml owner repo lang
where:
owner
is the project ownerrepo
is the name of the repositorylang
is the main repository language as reported by Github. At the moment, onlyruby
,java
,python
,scala
andjavascript
are supported.
The projects we analyzed in each paper are included in the projects.txt
file included in each paper directory.
The projects that are commented out were excluded for reasons identified in each paper.
The data extraction script extracts several variables
for each pull request and prints to STDOUT
a comma-separated
line for each pull request using the following fields:
pull_req_id
: The database id for the pull requestproject_name
: The name of the project (same for all lines)github_id
: The Github id for the pull request. Can be used to see the actual pull request on Github using the following URL:https://github.com/#{owner}/#{repo}/pull/#{github_id}
created_at
: The epoch timestamp of the creation date of the pull requestmerged_at
: The epoch timestamp of the merge date of the pull requestclosed_at
: The epoch timestamp of the closing date of the pull requestlifetime_minutes
: Number of minutes between the creation and the close of the pull requestmergetime_minutes
: Number of minutes between the creation and the merge of the pull requestmerged_using
: The heuristic used to identify the merge action. The field can have the following valuesgithub
: The merge button was used for mergingcommits_in_master
: One of the pull request commits appears in the project's master branchfixes_in_commit
: The PR was closed by a commit and the commit SHA is in the project's mastercommit_sha_in_comments
: The PR's discussion includes a commit SHA and matches the following regexpmerg(?:ing|ed)|appl(?:ying|ied)|pull[?:ing|ed]|push[?:ing|ed]|integrat[?:ing|ed]
merged_in_comments
: One of the last 3 PR comments matches the above regular expressionunknown
: The pull request cannot be identified as merged
conflict
: Boolean, true if the pull request comments include the word conflictforward_links
: Boolean, true if the pull request comments include a link to a newer pull requestteam_size
: The number of people that had committed to the repository directly (not through pull requests) in the period(merged_at - 3 months, merged_at)
num_commits
: Number of commits included in the pull requestnum_commit_comments
: Number of code review commentsnum_issue_comments
: Number of discussion commentsnum_comments
: Total number of comments (num_commit_comments + num_issue_comments
)num_participants
: Number of people participating in pull request discussionsfiles_added
: Files added by the pull requestfiles_deleted
: Files deleted by the pull requestfiles_modified
: Files modified by the pull requestfiles_changed
: Total number of files changed (added, modified, deleted) by the pull requestsrc_files
: Number of src files touched by the pull requestdoc_files
: Number of documentation files touched by the pull requestother_files
: Number of other (non src/doc) files touched by the pull requestperc_external_contribs
: % of commits commit from pull requests up to one month before the start of this pull requesttotal_commits_last_month
: Number of commitsmain_team_commits_last_month
: Number of commits to the repository during the last month, excluding the commits coming from this and other pull requestssloc
: Number of executable lines of code in the main project reposrc_churn
: Number of src code lines changed by the pull requesttest_churn
: Number of test lines changed by the pull requestcommits_on_files_touched
: Number of commits on the files touch by the pull request during the last monthtest_lines_per_kloc
: Number of test (executable) lines per 1000 executable linestest_cases_per_kloc
: Number of tests per 1000 executable linesasserts_per_kloc
: Number of assert statements per 1000 executable lineswatchers
: Number of watchers (stars) to the repo at the time the pull request was done.requester
: The developer that performed the pull requestprev_pullreqs
: Number of pull requests by developer up to the specific pull requestrequester_succ_rate
: % of merged vs unmerged pull requests for developerfollowers
: Number of followers to the pull requester at the time the pull request was doneintra_branch
: Whether the pull request is among branches of the same repositorymain_team_member
: Boolean, true if the pull requester is part of the project's main team at the time the pull request was opened.
The following features have been disabled from output: num_commit_comments
,num_issue_comments
, files_added
, files_deleted
, files_modified
,
src_files
, doc_files
, other_files
, commits_last_month
, main_team_commits_last_month
. In addition, the following features are
not used in further analysis even if they are part of the data files:
test_cases_per_kloc
,asserts_per_kloc
, watchers
, followers
, requester
Lines reported are always executable lines of code (comments and whitespace have been stripped out). To count testing related data, the script exploits the fact that Java, Ruby and Python projects are organized using the Maven, Gem and Pythonic project conventions respectively. Test cases are recognized as follows:
-
Java: Files in directories under a
/test/
branch of the file tree are considered test files. JUnit 4 test cases are recognized using the@Test
tag. For JUnit3, methods starting withtest
are considered as test methods. Asserts are counted by "grepping" through the source code lines forassert*
statements. -
Ruby: Files under the
/test/
and/spec/
directories are considered test files. Test cases are recognized by "grepping" fortest*
(RUnit),should .* do
(Shoulda) andit .* do
(RSpec) in the source file lines. -
Python: http://pytest.org/latest/goodpractises.html#conventions-for-python-test-discovery
-
Scala: Same as Java with the addition of specs2 matchers
####Processing data with R
The statistical analysis is done with R. Generally, it suffices to do
cd pullreqs
R --no-save < R/packages.R # install required packages
Rscript R/one_of_the_scripts.R --help
The following scripts can be run with the procedure described above:
-
R/dataset-stats.R Various statistics and plots that require access to the GHTorrent MySQL database. Use command line arguments to configure the MySQL connection.
-
R/pullreq-stats.R Pull request descriptive statistics (analysis of the data files).
-
R/run-merge-decision-classifiers.R: Cross validation runs for the pull request merge decision classifiers.
-
R/run-mergetime-classifiers.R: Cross validation runs for the pull request merge time classifiers.
-
R/var-importance.R Generate the variable importance plots for choosing important features
####Citation information If you find this work interesting and want to use it in your work, please cite it as follows:
@inproceedings{GZ14,
author = {Gousios, Georgios and Zaidman, Andy},
title = {A Dataset for Pull-based Development Research},
booktitle = {Proceedings of the 11th Working Conference on Mining Software Repositories},
series = {MSR 2014},
year = {2014},
isbn = {978-1-4503-2863-0},
location = {Hyderabad, India},
pages = {368--371},
numpages = {4},
doi = {10.1145/2597073.2597122},
acmid = {2597122},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {distributed software development, empirical software engineering, pull request, pull-based development},
}