A utility to search and fetch code from GitHub. This tool was build to easily create datasets for repository analysis.
The tool works in two phases, search
finds repositories using the GitHub API,
and saves the result in a JSON file. download
fetch all the repositories
inside the JSON file.
This tool can be installed by running
pip install bigcode-fetcher
or by fetching this repository and running
pip install .
in this directory.
By default, the utility searches for repositories fulfilling the following conditions
size
between 1M and 100Mstars
count > 10- non-viral
license
(MIT,Apache-2.0,MPL-2.0,BSD-2-Clause,BSD-3-Clause,BSD-4-Clause,MS-PL)
and retrieves the first 100 projects, ordered by number of stars.
To avoid API rate limiting, an access token can be provided either with the --token
CLI argument or with the GITHUB_TOKEN
environment variable.
See the help to see all the options:
bigcode-fetcher search -h
Search for all Apache commons projects written in Java
mkdir -p apache-common-projects
bigcode-fetcher search --language Java --user apache --stars '>0' --keyword commons --max-repos 500 -o apache-common-projects/apache-commons.json
This commands will simply git clone
all the repositories in the
JSON
generated by the search
command.
To reduce the download size, only the latest revision is fetched by default (i.e. git clone --depth 1
). This can be disabled by passing in the --full
flag.
USERNAME/REPO
will be fetched in OUTPUT_DIR/USERNAME/REPO
, where
OUTPUT_DIR
is set by the --output
option.
The command will ignore the project if the directory already exists, so running the command multiple times is safe, and recommended to make sure all repositories have been fetched.
See the help for more information:
bigcode-fetcher download -h
Download all the Apache commons project generated above
mkdir -p apache-common-projects/repositories
bigcode-fetcher download -i apache-common-projects/apache-commons.json -o apache-common-projects/repositories