sr-detection

The goal of the study is to create a model that, by looking at the README file and meta-information, can identify GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency.

Motivation. During the work on CaM project, we were required to filter out repositories with samples. No readily available technique or tool existed that could perform that function, so we conducted research on this very subject.

The repository structured as follows:

sr-data, module that consists of a set of tasks that aggregates and filters collected metadata about GitHub repositories.
sr-train, module for training ML models to identify SRs.

Hypotheses

Our research based on the following hypotheses:

SRs usually don't have release pipeline inside .github/workflows
SRs usually have less strict build pipeline inside .github/workflows
SRs usually don't have releases
SRs have less pull requests
SRs don't have section about how to use it
SRs have more disconnected directories/files

Run experiments

First, prepare datasets:

docker run --rm -v "$(pwd)/output:/collection" -e START="<start date>" \
  -e END="<end date>" -e COLLECT_TOKEN="<GitHub PAT to collect repositories>" \
  -e COLLECT_TOKEN="<GitHub PAT to fetch metadata>" \
  -e HF_TOKEN="<Huggingface PAT>" -e COHERE_TOKEN="<Cohere API token>" \
  -e OUT="sr-data" h1alexbel/sr-detection

In the output directory you should have these datasets:

d1-scores.csv
d2-sbert.csv
d3-e5.csv
d4-embedv3.csv
d5-scores+sbert.csv
d6-scores+e5.csv
d7-scores+embedv3.csv

Alternatively, you can download existing datasets from gh-pages branch.

Then, you should run models against collected datasets via cluster.yml. Models will distribute repositories from each dataset into clusters. The clustering results will be placed into clusters branch.

How to contribute

Make sure that you have Python 3.10+, just, and npm installed on your system, fork this repository, make changes, send us a pull request. We will review your changes and apply them to the master branch shortly, provided they don't violate our quality standards. To avoid frustration, before sending us your pull request please run full build:

just full

Name		Name	Last commit message	Last commit date
Latest commit History 856 Commits
.github		.github
sr-data		sr-data
sr-filter		sr-filter
sr-paper		sr-paper
sr-train		sr-train
.0pdd.yml		.0pdd.yml
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.yml		.markdownlint.yml
.pdd		.pdd
.pylintrc		.pylintrc
.rultor.yml		.rultor.yml
.yamllint.yml		.yamllint.yml
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
data.sh		data.sh
justfile		justfile
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sr-detection

Hypotheses

Run experiments

How to contribute

About

Uh oh!

Releases 43

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

h1alexbel/sr-detection

Folders and files

Latest commit

History

Repository files navigation

sr-detection

Hypotheses

Run experiments

How to contribute

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 43

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages