Skip to content

Find out what kind of resource a CR linked to Throughput is: educational, misc, software dev, etc

License

Notifications You must be signed in to change notification settings

throughput-ec/repo_classifier

Repository files navigation

lifecycle NSF-1928366

Repository Classifier

Pipeline to find out what kind of resource a Code Repository linked to Throughput is:

  • educational
  • miscellaneous
  • software development
  • storage

Contributors

This project is an open project, and contributions are welcome from any individual. All contributors to this project are bound by a code of conduct. Please review and follow this code of conduct as part of your contribution.

Tips for Contributing

Issues and bug reports are always welcome. Code clean-up, and feature additions can be done either through branches.

All products of the Throughput Annotation Project are licensed under an MIT License unless otherwise noted.

How to use this repository

For now, the repository contains 3 notebooks to outline the process.

The first notebook is to get the Repositories ReadMe files. This is done using Neo4j to identify the repositories in Throughtput and then using GitHub's API with a Developer Key.

Content from ReadMe files is encoded in base64 so, decoding is also necessary for our NLP procedures.

Workflow Overview

This project uses the Throughput Graph Database as an input from neo4j:

  • neotoma: tsv file

These files are used as input that will help create a Recommender System.

  • Predict whether a code repository is educational, misc, etc..

System Requirements

This project is developed using Python and Neo4j.
This project will need Neo4j installed. It runs on a MacOS system. Continuous integration uses TravisCI.

Data Requirements

The project pulls data from the Throughput database. Need a GitHub API secret Labels were currently given by Morgan Wofford but should be able to get these from annotations in the Throughtput DB in the near future.

Key Outputs

This project will generate a structured dataset that provides the following information:

  • Whether a CR belongs to a certain class.

Currently, the model's accuracy is very poor due to low quantity of labeled data. However, test performance at its best is 66% (it is still painfully overfitting)

Pipeline

TODO: \n \n

[include workflow chart]

Instructions

View notebooks in following order:

About

Find out what kind of resource a CR linked to Throughput is: educational, misc, software dev, etc

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published