This repository contains the code and data for the SANER 2020 paper The Silent Helper: The Impact of Continuous Integration on Code Reviews (PDF). Joint work between Nathan Cassee (TU/e), Bogdan Vasilescu (CMU), and Alexander Serebrenik (TU/e).
There are several parts to this repository, data collection and pre-processing has been done using Python, some analysis and the plots have been generated using Python Jupyter Notebooks, and the statistical models have been run using Jupyter Notebooks in R.
Several directories contain an archive folder, these folders contain files that have not been used in the final analysis.
For each of the steps this readme will direct you to some files of note.
The main file used to scrape information the Python script
scrape_project_from_github.py
. As input this file requires
a .csv containing the slugs of the repositories that should be
mined, and a connection to a running GHTorrent
instance. As output this script requires connection to a MongoDB
instance, such that scraped items can be stored.
For each slug in the input csv the script first queries the GHTorrent
to determine whether that project has more than 1,000 general
pull-request comments, and after doing this for all projects
saves the intermediate results to a .json
file.
For each GitHub
project with more than 1,000 general comments
the scrape loop is executed.
We first check whether the project has at least one Travis build,
using the Travis API (This requires a Travis API key to be present in
travis_token.py
). If the project has a Travis build
we continue scraping, and using
PyGitHub
we scrape all closed pull-requests, associated comments, review comments
and commits, and all closed issues for that project. Note: direct commits
only pushed to the repository are not scraped.
The resulting data is then inserted in a MongoDB instance. Where project
data is split over 4 collections because of MongoDB limitations. These
4 collections are, projects
, issues
, pull_requests
, and commits
.
To efficiently scrape data from GitHub
the scrape script uses several
threads, and cycles through a set of tokens defined in gh_tokens.py
(More tokens == more speed).
When it comes to processing the scraped data there are several Python scripts and cells that take data in the MongoDb instance and augment it by adding fields.
-
analysis/first_travis_build.ipynb
:
This notebook contains a set of cells that uses a set of heuristics (including the GitHub commit statuses) to find the oldest Travis build associated with a pull-request, and to determine whether Travis was the first CI service used by the project. This result is then written to the MongoDB database by setting the fieldsstatus_travis_date (date)
andtravis_is_oldest_ci (bool)
. -
find_effective_comments.py
:
This Python script uses the scriptanalysis/effective_comments/find_effective.py
to process all pull requests in the MongoDB instance to find effective review comments as defined by Bosu et al.. This information is needed to run the RDD model that models the impact of Continuous Integration on effective comments in code reviews.
analysis/share_of_comments.ipynb
:
This jupyter notebook contains the cells that are used for data-export for the actual time-series models. Generatesgenerated/metrics_for_time_series.csv
which contains aggregated time series data used for the RDD models. Several cells in the ipnyb generate this file. Other cells generate other files that might not be relevant for the analysis.
analysis/time_series_models.ipnyb
:
R Jupyter notebook that contains the actual RDD models. For each model a cell exists that is used to build the model, and output the model information.
Please use the following BibTex snippet to cite this work:
@inproceedings{DBLP:conf/wcre/CasseeVS20,
author = {Nathan Cassee and
Bogdan Vasilescu and
Alexander Serebrenik},
editor = {Kostas Kontogiannis and
Foutse Khomh and
Alexander Chatzigeorgiou and
Marios{-}Eleftherios Fokaefs and
Minghui Zhou},
title = {The Silent Helper: The Impact of Continuous Integration on Code Reviews},
booktitle = {27th {IEEE} International Conference on Software Analysis, Evolution
and Reengineering, {SANER} 2020, London, ON, Canada, February 18-21,
2020},
pages = {423--434},
publisher = {{IEEE}},
year = {2020},
url = {https://doi.org/10.1109/SANER48275.2020.9054818},
doi = {10.1109/SANER48275.2020.9054818},
timestamp = {Thu, 16 Apr 2020 16:52:52 +0200},
biburl = {https://dblp.org/rec/conf/wcre/CasseeVS20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}