The CASICS Collector is a repository crawler and scraper that extracts data about projects and stores it in the CASICS (Comprehensive and Automated Software Inventory Creation System) database.
Authors: Michael Hucka
Repository: https://github.com/casics/collector
License: Unless otherwise noted, this content is licensed under the GPLv3 license.
CASICS (the Comprehensive and Automated Software Inventory Creation System) is a project to create a proof of concept that uses machine learning techniques to analyze source code in software repositories and classify the repositories. As part of this project, we need to obtain data about software project repositories in GitHub and (eventually) other hosting systems such as SourceForge. This module (the CASICS Collector) is designed to gather that data.
The Collector module queries hosting services via APIs (and for some purposes, also scrapes project web pages) and writes the data to the CASICS Database. It is designed as a separate module so that one or more instances can be started and run simultaneously. It does not download copies of repository files; that task is left to a separate module, the CASICS Downloader.
The CASICS Collector is written in Python.
If you find an issue, please submit it in the GitHub issue tracker for this repository.
A lot remains to be done on CASICS in many areas. We would be happy to receive your help and participation if you are interested. Please feel free to contact the developers either via GitHub or the mailing list casics-team@googlegroups.com.
Everyone is asked to read and respect the code of conduct when participating in this project.
This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.