Skip to content

Presentation

k----n edited this page Dec 5, 2020 · 17 revisions

Project idea / goal / research question

  • How widespread are cloned files that contain known vulnerabilities?

Approach incl. usage of WoC

Obstacles you had to overcome on the way

  • ob2b mapping did not exist. Dr. Mockus added it.
  • uncertain about whether to use c2P or c2p, determined c2p was best.
  • The sheer volume of data makes it hard to get results in reasonable time. Spent some time on performance issues. Adding some parallel processing gave the biggest improvement in performance.

Results / findings incl. artefacts you produced

  • Go over the output of the find_cloned_files script at https://woc-hack.github.io/hemlock/
  • Chris' observations about labapart/polymcu: not updated for a while, but since commits stopped, interested people have come along and expressed interest, suggesting that people are picking up and using this code, not realizing there is a vulnerability: https://github.com/labapart/polymcu/issues/6
  • There are 5000+ forks of QEMU; we identified 21 that were NOT vulnerable.
  • Some forks of QEMU are super old, like http://github.com/jeffreymingyue/qemu, which is "30957 commits behind qemu:master". Suggests maybe ones that have no extra commits beyond the QEMU project are lower risk: they were a one time fork and ignored after that.
  • 572 projects copied QEMU code before this fix, and redistribute it, for example in a subdirectory. Ex: a Georgia Tech projected, https://github.com/sslab-gatech/opensgx, "OpenSGX: An open platform for Intel SGX". Last edited in 2016, and had this bug. Comment this fall (https://github.com/sslab-gatech/opensgx/issues/67) proposes to restart the project. New forks, such as gutjuri/opensgx, last edited (June 2020), contain the bug.

Future plans beyond submitting the project to the MSR track

  • Find more candidate projects
    • search commit logs for CVE
    • search nvd.nist.gov, cve.mitre.org, etc
  • Run the tool on many many more projects
  • Collect more in-depth information about a few specific vulnerable projects
  • Try to find out when a vulnerability was introduced
    • maybe diff fixed version with previous version to find lines that changed
    • maybe use SZZ algorithm
  • Crowdsource dataset to determine which blobs contain vulnerable code, and which lines of code are vulnerable (https://github.com/doccano/doccano)

Feedback (from Dec 5)

  • Looking at blobs might be weird (Kalvin note: it might be useful to help drill down files containing parts of code to figure out if the vulnerable code still exists)
  • Consider sources for CVE (might not be referenced in commit message, consider messages in GHtorrent)