Skip to content

Motivation: Gain insight into how open source software vulnerabilities arise, how we may be able to predict or prevent vulnerabilities, and general insight into open source software as a whole. Software Heritage Graph Dataset (SHGD): Largest existing public archive of software source code and history. National Vulnerability Database (NVD): Infor…

Notifications You must be signed in to change notification settings

UNCG-CSE/Open_Source_Vul_Metrics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Source Vulnerability Metrics

Description:

Mining the Software Heritage Graph Dataset a very large dataset containing the development history of publicly available software. The dataset links together source code files organized in directories, commits tracking evolution over time, up to full snapshots of version control systems (VCS) repositories.

  • Includes a full dataset at 1.2TB and two "teaser" datasets: popular-4k at 27GB and popular-3k-python at 5.3 GB.
    • The popular-4k teaser contains a subset of 4000 popular repositories from GitHub, Gitlab, PyPI and Debian.
    • The popular-3k-python teaser contains a subset of 3052 popular repositories tagged as being written in the Python language, from GitHub, Gitlab, PyPI and Debian.

Our major goal is trying to cross-reference known software vulnerabilities found on the National Vulneraibiltiy Database with commits found in the Software Heritage Graph Dataset. In doing so, hopefully we can answer some of the following questions:

  • How long is there between when a software bug is discovered and when it is patched?
  • How long is there between a fix and a new software release?
  • Is there a relationship between project activity and vulnerability severity?

Team Abracadata

Instructor: Dr. Somya Mohanty
Mentor: Dr. Steven Tate

Members and Tasks:

  • Seth Goodwin
    • cleaning and understanding revision.csv, from popular-3k-python dataset Notebook
      • determine if any irregularities/anomalies in the data
      • looking into commit messages, trying to find commits that fix CVE's
      • link commit ids (ones that fix CVE's) with NVD
    • determine how long it takes the NVD to report on software vulnerabilities Notebook
      • how responsive is the NVD?
      • basic statistical analysis (mean, standard deviation, etc.) of data
      • distribution modeling
      • hypothesis testing
    • building machine learning model to predict software vulnerability base scores Notebook
      • using CVSS v3 metrics, train multiple linear regression model (70/30 training/testing split)
      • evaluate model accuracy using root mean squared error and r-squared value
  • Michael Follari
    • looking through and cleaning release.csv, from popular-3k-python dataset
      • explore time dependence between release.csv and NVD
  • Jaron Dunham
    • parsing NVD JSON files into CSV files
      • extracting the nested information within NVD
    • Comparing Frequency of Commits (SWHGD) to Base Score (NVD)
      • attempted to see if the rising or lowering of the base score affected the frequency of commits
    • Attempting to predict Base Score using variables from CVSS v2 metrics
  • Gabe Wilmoth
    • Trying to connect NVD with SWHGD
    • Look into Relationship between Releases and New Vulnerabilities
    • Predict count of Releases from count of Revisions and Date of Revisions
      • Looking at the amount of Revisions per week, can we accurately predict the amount of Releases in the following week(s) to come?
  • Rohit Gade
    • extracting GitHub hash id from links found on the NVD
    • Create a dataset from NVD and SHDS to analize who long is there between when a software bug is discovered and when it is patched?
    • Develop a model to predict the patch duration.

About

Motivation: Gain insight into how open source software vulnerabilities arise, how we may be able to predict or prevent vulnerabilities, and general insight into open source software as a whole. Software Heritage Graph Dataset (SHGD): Largest existing public archive of software source code and history. National Vulnerability Database (NVD): Infor…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published