Open Source Vulnerability Metrics

Description:

Mining the Software Heritage Graph Dataset a very large dataset containing the development history of publicly available software. The dataset links together source code files organized in directories, commits tracking evolution over time, up to full snapshots of version control systems (VCS) repositories.

Includes a full dataset at 1.2TB and two "teaser" datasets: popular-4k at 27GB and popular-3k-python at 5.3 GB.
- The popular-4k teaser contains a subset of 4000 popular repositories from GitHub, Gitlab, PyPI and Debian.
- The popular-3k-python teaser contains a subset of 3052 popular repositories tagged as being written in the Python language, from GitHub, Gitlab, PyPI and Debian.

Our major goal is trying to cross-reference known software vulnerabilities found on the National Vulneraibiltiy Database with commits found in the Software Heritage Graph Dataset. In doing so, hopefully we can answer some of the following questions:

How long is there between when a software bug is discovered and when it is patched?
How long is there between a fix and a new software release?
Is there a relationship between project activity and vulnerability severity?

Team Abracadata

Instructor: Dr. Somya Mohanty
Mentor: Dr. Steven Tate

Members and Tasks:

Seth Goodwin
- cleaning and understanding revision.csv, from popular-3k-python dataset Notebook
  - determine if any irregularities/anomalies in the data
  - looking into commit messages, trying to find commits that fix CVE's
  - link commit ids (ones that fix CVE's) with NVD
- determine how long it takes the NVD to report on software vulnerabilities Notebook
  - how responsive is the NVD?
  - basic statistical analysis (mean, standard deviation, etc.) of data
  - distribution modeling
  - hypothesis testing
- building machine learning model to predict software vulnerability base scores Notebook
  - using CVSS v3 metrics, train multiple linear regression model (70/30 training/testing split)
  - evaluate model accuracy using root mean squared error and r-squared value
Michael Follari
- looking through and cleaning release.csv, from popular-3k-python dataset
  - explore time dependence between release.csv and NVD
Jaron Dunham
- parsing NVD JSON files into CSV files
  - extracting the nested information within NVD
- Comparing Frequency of Commits (SWHGD) to Base Score (NVD)
  - attempted to see if the rising or lowering of the base score affected the frequency of commits
- Attempting to predict Base Score using variables from CVSS v2 metrics
Gabe Wilmoth
- Trying to connect NVD with SWHGD
- Look into Relationship between Releases and New Vulnerabilities
- Predict count of Releases from count of Revisions and Date of Revisions
  - Looking at the amount of Revisions per week, can we accurately predict the amount of Releases in the following week(s) to come?
Rohit Gade
- extracting GitHub hash id from links found on the NVD
- Create a dataset from NVD and SHDS to analize who long is there between when a software bug is discovered and when it is patched?
- Develop a model to predict the patch duration.

Name		Name	Last commit message	Last commit date
Latest commit History 214 Commits
data		data
doc		doc
src		src
util		util
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Source Vulnerability Metrics

Description:

Team Abracadata

Members and Tasks:

About

Releases

Packages

Contributors 6

Languages

UNCG-CSE/Open_Source_Vul_Metrics

Folders and files

Latest commit

History

Repository files navigation

Open Source Vulnerability Metrics

Description:

Team Abracadata

Members and Tasks:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages