kjd-analysis.qmd

--
title: "KJD Analysis"
author: "James Walden"
editor: visual
format: pdf
pdf:
  toc: true
  number-sections: true
  colorlinks: true
---

## KJD Analysis

We investigate the prevalance and survival time of cloned vulnerable files. We start with the CVEfixes dataset, then find projects containing such vulnerabilities in the World of Code and search for the vulnerable files in other WoC projects.

```{r}
#| echo: false
#| output: false
library(tidyverse)
library(fs)
library(lubridate)
library(tidymodels)
set.seed(101)
```

## Data

We use two data sets: CVEfixes to find vulnerable files, and World of Code to find where those files have been copied to.

The CVEfixes dataset was generated by Kristiina in November 2022 using the code found in the CVEfixes repository at https://github.com/secureIT-project/CVEfixes. There is no missing data in this dataset, though not all projects containing CVEs from this dataset exist in the World of Code.

```{r}
data_path <- path('data')
woc_path <- path(data_path, '11.04')
```

```{r}
cve <- read_delim(path(data_path, 'cvefixes_new.csv'),
                  delim=';', 
                  col_names=c('CVE','commit','url','pathname','create_time','fix_time'),
                  col_types='ccccTT')
```

This dataset has 3615 vulnerabilities and 6 columns of data about them.

```{r}
#| echo: false
dim(cve)
```

There is no missing data in any column of the CVEfixes dataset.
```{r}
cve |> summarize(across(, ~sum(is.na(.x))))
```

The World of Code dataset contains data collected on projects that have vulnerabilities listed in the CVEfixes dataset. It was constructed using David's scripts in the KJD repository.

We read in the UNIX timestamp format columns as doubles, since the default 32-bit integers are not long enough to store those values.
```{r}
woc_new <- read_csv(path(woc_path, 'final_filter0.csv.xz'),
                col_names = TRUE,
                col_types = 'ccccccdcdiciciciddciiiiiiciicc')
```

Check number of missing values in each column.
```{r rows.print=30}
woc_new_missing <- woc_new |> 
  summarize(across(, ~sum(is.na(.x)))) |>
  pivot_longer(cols=everything(), names_to='metric', values_to='count')
woc_new_missing |> arrange(desc(count))
```

We find multiple columns containing rows that are missing the same large number of values, which is more than 96% of all rows.
```{r}
woc_new_missing |> summarize(max = max(count), pct = max/nrow(woc_new))
```

Convert datetime and logical data types that read_csv doesn't handle.
```{r}
woc <- woc_new
woc$FirstBadTime <- as_datetime(woc$FirstBadTime)
woc$FirstGoodTime <- as_datetime(woc$FirstGoodTime)
woc$EarliestCommitDate <- as_datetime(woc$EarliestCommitDate)
woc$LatestCommitDate <- as_datetime(woc$LatestCommitDate)
woc$SECURITY.md <- ifelse(woc$SECURITY.md=="Yes", TRUE, FALSE)
woc$Corp <- ifelse(woc$Corp=="Yes", TRUE, FALSE)
```

If we remove all rows with missing values in the columns below, we remove about
2% of the dataset. We keep rows with missing data in the star count (GHStars,
NumStars) and GitHub commits (GHCommits), along with the 8 time columns
beginning with First or Time in which over 96% of rows have missing data. 

We also remove files with filename CHANGES, KConfig, README, and README.md, as
well as the file suffix .svn-base, as these files are not code or are probably
not in use (in the case of the svn-base files.) While CHANGES and KConfig only
appear once in the CVEfixes dataset, they are copied many times. The .svn-base
suffix does not appear in CVEfixes, but the content of the files appears in WoC
in files that have the .svn-base suffix added.
```{r}
wocnona <- woc |> 
    drop_na(all_of(c('FirstBadBlob','FirstBadTime','CommunitySize','NumForks',
    'Path','FileInfo','NumCommits','NumCore','NumCommits','NumAuthors'))) |>
    filter(!str_detect(Path, 'CHANGES|Kconfig|svn-base')) |>
    filter(!str_detect(Path, 'README(.md)?$'))
nrow(woc) - nrow(wocnona)
```

__NOTE:__ Use woc for the whole data set, wocnona for data without most missing data.


## RQ1: How prevalent are cloned files with vulnerabilities?

The total number of orphan vulnerabilities (copies of CVEfixes vulnerabilities):
```{r}
nrow(wocnona)
```

### Statistics of Original Vulnerabilities (CVEfixes)

There are 3,615 projects with vulnerabilities in the CVEfixes dataset.
```{r}
cve |> count()
```

Let's see how many CVEs are in CVEfixes for each year.
```{r}
cve_years <- cve |>
  separate(CVE, c("Constant", "Year", "Number")) |>
  select(-Constant) |>
  group_by(Year) |>
  summarize(NumVulns = n())
cve_years
```

```{r}
ggplot(cve_years, aes(x=Year, y=NumVulns)) + 
  geom_col() +
  ggtitle("CVEfixes Vulnerabilities by Year") +
  ylab("Number of Vulnerabilities")
```

There are 1,114 projects with vulnerabilities in the CVEfixes dataset.
```{r}
cve |> distinct(url) |> count()
```

### Statistics of Cloned Vulnerabilities (WoC)

Percentages here are relative to the 3014 cloned vulnerabilities, not the total
number of vulnerabilities in the CVEfixes dataset.

Construct a dataset of CVEs from CVEfixes that were cloned by WoC projects.
```{r}
cloned_cve <- inner_join(distinct(wocnona, CVE), 
                         select(cve, where(is.character)), 
                         by="CVE")
```

There are 3014 cloned vulnerabilities out of 3615 total (83.3%) in the CVE fixes dataset.
```{r}
cloned_cve |> summarize(n=n(), pct=n()/nrow(cve))
```

Let's see how many CVEs are cloned from CVEfixes for each year.
```{r}
cve_years <- cve |>
  separate(CVE, c("Constant", "Year", "Number")) |>
  select(-Constant) |>
  group_by(Year) |>
  summarize(NumVulns = n())
cve_years
```

This plot looks very much like the plot of all CVEs from CVEfixes, suggesting
that the year does not influence the chance of being cloned much. Even recent
years, where projects would have less time to make clones, show little change.
```{r}
ggplot(cve_years, aes(x=Year, y=NumVulns)) + 
  geom_col() +
  ggtitle("Cloned CVEs by Year") +
  ylab("Number of Vulnerabilities")
```

There are 1,114 projects with vulnerabilities in the CVEfixes dataset.
```{r}
cve |> distinct(url) |> count()
```

Only projects have vulnerabilities that were cloned.
```{r}
cloned_cve |> distinct(url) |> count()
```

794 (26.3%) of cloned vulnerabilities came from the Linux kernel, while another 
6.7% (203) were from Tensorflow and 3.9% (119) from ImageMagick.
```{r}
cloned_cve_by_project <- 
  cloned_cve |> 
  group_by(url) |> 
  summarize(nvulns = n(),
            percent = 100*nvulns/nrow(cloned_cve))
cloned_cve_by_project |> arrange(desc(nvulns))
```

546 (68.25%) of original projects had only a single cloned vulnerability.
```{r}
cloned_cve_by_project |> 
  filter(nvulns == 1) |>
  summarize(nvulns = n(),
            percent = 100*nvulns/nrow(cloned_cve_by_project))
```

A near majority of the original vulnerability files were written in C.
```{r}
cve_extensions <- cloned_cve |>
       mutate(Extension = ifelse(path_ext(pathname) == "",basename(pathname),path_ext(pathname))) |>
       group_by(Extension) |>
       count() |>
       mutate(percent = format(100 * n / nrow(cloned_cve), scientific=FALSE)) |>
       arrange(desc(n))
cve_extensions
```

If we combine all C source and header files, we find 1789 (59.3%) of original
vulnerable files are written in C.
```{r}
cve_extensions |> 
  filter(Extension %in% c('c','h')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

If we combine all C++ suffixes, we find 307 (10.2%) of files are written in C++.
```{r}
cve_extensions |> 
  filter(Extension %in% c('cc','cpp','cxx','hh','hpp')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```
If we combine PHP and template files with PHP code, we find 455 (15.1%) of
vulnerable files are written in PHP.
```{r}
cve_extensions |> 
  filter(Extension %in% c('php','ctp','phtml')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```
If we combine JavaScript code (.js), configuration (.json), and Node (.node)
files, we find 157 (5.2%) of vulnerable files are written in JavaScript.
```{r}
cve_extensions |> 
  filter(Extension %in% c('js','json','node')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

If we combine Perl source code (.pl) and module (.pm) files, we find that 18
(0.6%) vulnerable files are written in PHP.
```{r}
cve_extensions |> 
  filter(Extension %in% c('pl','pm')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

### Analyze which CVEs are cloned

Let us examine which CVEs are most often cloned.

```{r}
cve_rename_cols <- c('url', 'pathname')
cve_renamed <- cve |>
  select(CVE, all_of(cve_rename_cols)) |>
  rename_with(~ paste0("original_", .), all_of(cve_rename_cols))
```

```{r}
cve_clones <- wocnona |> left_join(cve_renamed, by="CVE")
nonjs_clones <- cve_clones |> filter(!str_detect(Path, '\\.(js|ts|json)$')) 
```

```{r}

```


```{r}
cve_clone_count <-
  cve_clones |>
  group_by(CVE,original_url,original_pathname) |>
  summarize(nvulns = n(),
            percent = 100*nvulns/nrow(wocnona)) |>
  ungroup()
```

```{r}
cve_clone_count |> select(CVE,nvulns,percent) |> arrange(desc(nvulns))
```

On average, there are over 1000 orphan vulnerabilities (copies) for each
of the 3014 original vulnerabilities that were copied from CVEfixes. The
range is tremendous from 1 to 112,297, with a standard deviation (4581)
larger than the mean (1010).
```{r}
cve_clone_count |> select(nvulns) |> summary()
```

```{r}
cve_clone_count |> pull(nvulns) |> sd()
```

10.1% of vulnerabilities are cloned only once.
```{r}
cve_clone_count |> 
  filter(nvulns==1) |>
  summarize(nvulns = n(),
            percent = 100*nvulns/nrow(cve_clone_count))
```

380 (12.6)% of vulnerabilities are cloned more than one thousand times.
```{r}
cve_clone_count |> 
  filter(nvulns >= 1000) |>
  summarize(nvulns = n(),
            percent = 100*nvulns/nrow(cve_clone_count))
```

Visualize the number of clones for the top 20 cloned CVEs.
Note: we use the arrange and mutate combination to order CVEs by nvulns.
```{r}
cve_clone_count_plot <-
  cve_clone_count |>
  arrange(desc(nvulns)) |>
  head(n=20) |>
  arrange(nvulns) |>
  mutate(CVE = factor(CVE, levels=CVE)) |>
  ggplot(aes(x=nvulns, y=CVE)) +
  geom_point() +
  scale_x_continuous(breaks=seq(20000,120000,20000), labels=label_comma()) +
  xlab("Number of Clones") +
  theme_bw()
cve_clone_count_plot
```


### Statistics of Cloned Projects (WoC)

There are 719,212 unique projects containing cloned vulnerabilities.
```{r}
wocnona |> distinct(Project) |> count()
```

There are over 3 million cloned vulnerable files.
```{r}
wocnona |> count()
```

The median project has only a single cloned vulnerability and the third quartile starts at two vulnerabilities. However, the mean is 4.237 and the maximum is 807 cloned vulnerabilities in a single project.
```{r}
vulnsperproj <- wocnona |> 
    group_by(Project) |> 
    summarize(numvulns=n()) |> 
    arrange(desc(numvulns))
summary(vulnsperproj |> select(numvulns))
```

```{r}
vulnsperproj |> arrange(desc(numvulns))
```

A majority (58.3%) of projects have only a single cloned vulnerability.
```{r}
vulnsperproj |>
  filter(numvulns == 1) |>
  summarize(nvulns = n(),
            percent = 100*nvulns/nrow(vulnsperproj))
```

Several thousand (1.3%) projects have 100 or more cloned vulnerabilities.
```{r}
vulnsperproj |>
  filter(numvulns >= 100) |>
  summarize(nvulns = n(),
            percent = 100*nvulns/nrow(vulnsperproj))
```

We can visualize vulnerabilities per project as a boxplot. The box is squashed as the IQR is 1 while there is a long tail of high outliers going up to 807.
```{r, fig.height=2, fig.width=12}
ggplot(data=vulnsperproj, aes(x=numvulns, y=1)) + geom_boxplot()
```

58.3% of all projects have only a single cloned vulnerability.
```{r}
proj_counts <- 
  vulnsperproj |> 
  group_by(numvulns) |> 
  summarize(nprojects=n(), percent=format(100*n()/nrow(vulnsperproj), scientific=FALSE))
proj_counts
```

Over 90% (90.8%) of projects have 1-3 vulnerabilities, while 97.5% of projects have 10 or fewer cloned vulnerabilities.
```{r}
proj_counts_top3 <- proj_counts |> slice_max(n=3, order_by=nprojects) |> pull(nprojects) |> sum()
proj_counts_top10 <- proj_counts |> slice_max(n=10, order_by=nprojects) |> pull(nprojects) |> sum()
tibble(nvulns_1_3=proj_counts_top3, pct3=100*proj_counts_top3/nrow(vulnsperproj),
       nvulns_1_10=proj_counts_top10, pct10=100*proj_counts_top10/nrow(vulnsperproj))
```

## Programming Language of Vulnerable Clones (WoC)

A small number of vulnerabilities (dozens) are tied to text files (.txt, .html)
and configuration files (htaccess). There are a few XML, YAML, and HAML format
files too as well as template files with .erb and .rhtml extensions. 16 files
have a .old extension.

```{r}
woc_extensions <- wocnona |>
       mutate(Extension = ifelse(path_ext(Path) == "", basename(Path), path_ext(Path))) |>
       group_by(Extension) |>
       count() |>
       mutate(percent = format(100 * n / nrow(wocnona), scientific=FALSE)) |>
       arrange(desc(n))
woc_extensions
```

A majority (63.6%) of cloned vulnerable files are C source or header files.
```{r}
woc_extensions |> 
  filter(Extension %in% c('c','h')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

Only 1.3% (38,533) cloned vulnerable files are C++ source or header files.
```{r}
woc_extensions |> 
  filter(Extension %in% c('cc','cpp','cxx','hh','hpp')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```
```{r}
cve |> 
  filter(pathname %in% c('cc','cpp','cxx','hh','hpp')) |>
  count()
```


741,433 (24.3%) cloned vulnerable files are written in JavaScript.
```{r}
woc_extensions |> 
  filter(Extension %in% c('js','json','node')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

Let's create a dataset of vulnerabilities in JavaScript clone files.
```{r}
js_vulns <- 
  wocnona |>
  filter(str_detect(Path, '.(js|json|node)$'))
nrow(js_vulns)
```

We find the subset of the JavaScript vulnerabiltiies that have a path that
indicates that they were installed using npm.
```{r}
npm_path_vulns <-
  js_vulns |>
  filter(str_detect(Path, 'node_modules'))
nrow(npm_path_vulns)
```

Compute the percentage of JavaScript files in paths that look like they were
installed by npm.
```{r}
100 * nrow(npm_path_vulns) / nrow(js_vulns)
```

194,991 (6.4%) of cloned vulnerable files are written in PHP.
```{r}
woc_extensions |> 
  filter(Extension %in% c('php','ctp','phtml')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

102,986 (3.4%) of cloned vulnerable files are written in Ruby.
```{r}
woc_extensions |> 
  filter(Extension %in% c('rb','erb','rhtml')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

Only 1282 (0.4%) of cloned vulnerable files are written in Perl.
```{r}
woc_extensions |> 
  filter(Extension %in% c('pl','pm')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

The number and percent of vulnerable cloned files using the most popular 5
languages (counting C and C++ as separate for that) are:
````{r}
woc_extensions |> 
  filter(Extension %in% c('js','php','rb','c','h','cc','cpp','cxx','hh','hpp')) |>
  ungroup() |>
  mutate(percent = as.numeric(percent)) |>
  summarise(across(where(is.numeric), ~ sum(., is.na(.), 0)))
```

__Which CVEs are Cloned the Most__

Let us create a database of cloned CVEs with clone counts and the associated
project (as specified by a URL from the CVEfixes dataset.)
```{r}
cve_clone_counts <- wocnona |> 
  group_by(CVE) |>
  summarize(nclones=n()) |>
  ungroup() |>
  arrange(desc(nclones))

cve_clones <- inner_join(cve_clone_counts, cve, by="CVE") |>
  select(CVE, nclones, url)

cve_clones
```

The project_clones dataset contains projects identified by CVEfixes URL and the
number of times CVEs from that project were cloned.
```{r}
project_clones <- cve_clones |> 
  group_by(url) |> 
  summarize(n = sum(nclones))
project_clones |> arrange(desc(n))
```


## RQ2: What are characteristics of projects containing cloned files with vulnerabilities?

Since the research question asks about projects, not vulnerabilities, we create a data frame with one row per WoC project. There are 720,128 projects in the dataset.

```{r}
numvulns <- wocnona |> 
  group_by(Project) |> 
  summarize(NumVulns=n(),
            NumFixed=sum(status=="fixed"),
            NumUnfixed=sum(status=="notfixed"),
            NumUnknown=sum(status=="unknown")
  )
```

```{r}
projects <- 
  left_join(numvulns, wocnona, by="Project") |>
  group_by(Project) |>
  slice(1) |>
  ungroup() |>
  filter(NumActi)
```


__Sources__

Separate out the ProjectURL into Source and Path fields. This works on all lines but the line for CVE-2013-7223 in project "mmarkoul_demo_app", which has the string "githProjectUrl" in the ProjectURL field.
```{r}
project_sources <- projects |>
  filter(ProjectUrl != "githProjectUrl") |>
  separate(ProjectUrl, c("Source","Path"), sep="/", extra="merge") |>
  mutate(Source = as.factor(Source))
```

The vast majority (about 98.5%) of projects are from GitHub, with only about 1.5% being from Gitlab.
```{r}
project_sources |> select(Source) |> summary()
```

__Active Months__

The median number of commits is very low (7), while the mean is much higher (11,705), indicating that many projects have few commits.
```{r}
projects |> filter(NumActiveMon < 600) |> select(NumActiveMon) |> summary()
```

```{r}
projects |> pull(NumActiveMon) |> sd()
```


__Commits__

The median number of commits is very low (7), while the mean is much higher (11,705), indicating that many projects have few commits.
```{r}
projects |> select(NumCommits) |> summary()
```

```{r}
projects |> pull(NumCommits) |> sd()
```

A boxplot shows that the distribution of commits per project has a huge number of outliers above the median, some of which are extremely far from the median.
```{r, fig.height=2, fig.width=12}
ggplot(data=projects, aes(x=NumCommits, y=1)) + geom_boxplot()
```

When viewing how many projects have each number of commits, we see hundreds of thousands of projects for numbers of commits <= 6 and tens of thousands of projects for numbers of commits <= 41.
```{r}
freq_commits <- projects |> 
    group_by(NumCommits) |> 
    select(NumCommits) |> 
    summarize(nprojects=n(), 
              percent=format(100*nprojects/nrow(projects), scientific=F))
freq_commits
```

```{r, fig.width=12}
freq100 <- freq_commits |> head(100)
ggplot(data=freq100, aes(x=NumCommits, y=nprojects)) + 
  geom_point() +
  scale_x_continuous(n.breaks=10) +
  scale_y_continuous(n.breaks=10) +
  ylab("Number of Projects")
```

Let's examine the data on a logarithmic scale.
```{r}
ggplot(data=freq_commits |> head(n=1e5), aes(x=NumCommits, y=nprojects)) + 
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  ylab("Number of Projects")
```

61.3% of projects have 10 or fewer commits.
```{r}
freq_commits |> 
  slice(1:10) |> 
  summarize(n=sum(nprojects),
            percent=format(100*n/nrow(projects), scientific=F))
```
94.3% of projects have 100 or fewer commits.
```{r}
freq_commits |> 
  slice(1:100) |> 
  summarize(n=sum(nprojects),
            percent=format(100*n/nrow(projects), scientific=F))
```

99.4% of projects have no more than 1000 commits.
```{r}
freq_commits |> 
  slice(1:1000) |> 
  summarize(n=sum(nprojects),
            percent=format(100*n/nrow(projects), scientific=F))
```

We can zoom in to look at the data without the top 10 numbers of commits.
```{r, fig.width=12}
freq_no_top10 <- freq_commits |> head(100) |> tail(90)
ggplot(data=freq_no_top10, aes(x=NumCommits, y=nprojects)) + 
  geom_point() +
  scale_x_continuous(n.breaks=10) +
  scale_y_continuous(n.breaks=10) +
  ylab("Number of Projects")
```


__Authors__

The median number of authors is 1, while the mean is much higher (55.79), indicating that many projects have a single or few authors.
```{r}
projects |> select(NumAuthors) |> summary()
```

```{r}
projects |> pull(NumAuthors) |> sd()
```

When viewing how many projects have each number of authors, we find that 71% of projects have a single author and 19.2% have two authors, meaning 90% of projects have only one or two authors.
```{r}
freq_authors <- projects |> 
    group_by(NumAuthors) |> 
    select(NumAuthors) |> 
    summarize(nprojects=n(), 
              percent=format(100*nprojects/nrow(projects), scientific=F))
freq_authors
```

97.5% of projects have 5 or fewer authors.
```{r}
freq_authors |> 
  slice(1:5) |> 
  summarize(n=sum(nprojects),
            percent=format(100*n/nrow(projects), scientific=F))
```

99% of projects have 10 or fewer authors.
```{r}
freq_authors |> 
  slice(1:10) |> 
  summarize(n=sum(nprojects),
            percent=format(100*n/nrow(projects), scientific=F))
```

We take a sample of 0.1% of vulnerable clones to examine the relationship between
the numbers of commits and authors, eliminating outliers with more than 1000 authors
or 10,000 commits.
```{r}
scatter_sample <- wocnona |>
  filter(NumAuthors < 1e3 & NumCommits < 1e4) |>
  slice_sample(n=25000)
```

Plotting the sample, we see a linear relationship with larger numbers of authors
indicating larger numbers of commits, though there are a considerable number of
outliers.
```{r}
ggplot(scatter_sample, aes(x=NumAuthors, y=NumCommits)) + 
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, level = 0.99)
```

__Community Size__

The median community size is 1, while the mean is much higher (60.1), indicating that many projects have no or very small communities.
```{r}
projects |> select(CommunitySize) |> summary()
```

The large majority (85%) of projects have a community size of 1, while another 10% have a community size of 2.
```{r}
freq_community <- projects |> 
    group_by(CommunitySize) |> 
    select(CommunitySize) |> 
    summarize(nprojects=n(), 
              percent=format(100*nprojects/nrow(projects), scientific=F))
freq_community
```

A bit less than 1% of projects have a community size of 10 or more.
```{r}
freq_community |>
  filter(CommunitySize > 10) |>
  summarize(n=sum(nprojects),
            percent=format(100*n/nrow(projects), scientific=F))
```

We take a sample of 0.1% of vulnerable clones to examine the relationship between
the numbers of commits and authors, eliminating outliers with more than 10,000 authors
or 10,000 community members.
```{r}
scatter_sample <- wocnona |>
  filter(NumAuthors < 1e4 & CommunitySize < 1e4) |>
  slice_sample(n=25000)
```

Plotting the sample, we see a linear relationship with larger numbers of authors
indicating larger numbers of commits, though there are a considerable number of
outliers.
```{r}
ggplot(scatter_sample, aes(x=NumAuthors, y=CommunitySize)) + 
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, level = 0.99)
```

__GitHub Stars__

_Note:_ See paper titled _What's in a GitHub Star?_ for relevance here, though it focuses on only 5000 projects with the highest number of GitHub stars.

Eliminate rows without GitHub stars before analyzing the star data. There are 17,194 projects with missing GH star data out of over 720,000 projects total. Over 10,000 of those projects are Gitlab projects, which obviously can't have GitHub stars.
```{r}
wocghstars <- projects |> drop_na(GHStars)
nrow(projects) - nrow(wocghstars)
```

The median number of stars is 0, while the mean is much higher (26.1), 
indicating that many projects have no or few stars.
```{r}
wocghstars |> select(GHStars) |> summary()
```

```{r}
wocghstars |> pull(GHStars) |> sd()
```


The large majority (83.3%) of projects have zero stars, while 10.4% have a 
single star, together making up >90% of all projects for which we have star data.
```{r}
freq_ghstars <- wocghstars |> 
    group_by(GHStars) |> 
    select(GHStars) |> 
    summarize(nprojects=n(), 
              percent=format(100*nprojects/nrow(projects), scientific=F))
freq_ghstars
```

There are slightly over 2000 projects that received 100 or more GH stars.
```{r}
freq_ghstars |> 
  filter(GHStars >= 100) |>
  ungroup() |>
  summarize(n=sum(nprojects),
            percent=format(100*n/nrow(projects), scientific=F))
```

__Language__

Most projects identify themselves as JavaScript (60%), according to the FileInfo
field, which I believe comes from GitHub's guess at a single project language.
As this does not correspond with the per-file data computed using file suffixes,
I think we should not attempt to analyze languages at the project level and
instead restrict our programming language analysis to the per-file level.
```{r}
freq_language <- projects |> 
    group_by(FileInfo) |> 
    select(FileInfo) |> 
    summarize(nprojects=n(), 
              percent=format(100*nprojects/nrow(projects), scientific=F))
freq_language |> arrange(desc(nprojects))
```

Let's look at the frequencies on a per-file basis, where we find that 64.2% of
vulnerable files are written in C/C++ and only 25.4% in JavaScript.
```{r}
freq_file_language <- wocnona |> 
    group_by(FileInfo) |> 
    select(FileInfo) |> 
    summarize(nfiles=n(), 
              percent=format(100*nfiles/nrow(wocnona), scientific=F))
freq_file_language |> arrange(desc(nfiles))
```

__Vulnerabilities__

Projects have a median of 1 and a mean of 4.2 vulnerabilities.
```{r}
projects |> select(NumVulns) |> summary()
```

```{r}
projects |> pull(NumVulns) |> sd()
```


Projects have a median of 0 fixed and a mean of 0.14 fixed vulnerabilities.
```{r}
projects |> select(NumFixed) |> summary()
```

Few projects (3.7%) fix vulnerable cloned files.
```{r}
projects |> 
    filter(NumFixed > 0) |>
    summarize(nfiles=n(), 
              percent=format(100*nfiles/nrow(projects), scientific=F))
```


__SECURITY.md__ 

GitHub supports the `SECURITY.md` standard for project security policies. It
describes supported project versions and how to report a vulnerability. More
about the standard can be found at
https://docs.github.com/en/code-security/getting-started/adding-a-security-policy-to-your-repository

Only 1.7% of projects in the dataset have a `SECURITY.md` policy file.
```{r}
projects |> 
  group_by(SECURITY.md) |>
  summarize(n=n(),
            percent=format(100*n/nrow(projects), scientific=F))
```

__Enterprise Projects__

No projects are enterprise projects.

FIXME: Is this correct or a data error?

```{r}
projects |> 
  group_by(Corp) |>
  summarize(n=n(),
            percent=format(100*n/nrow(projects), scientific=F))
```

### Correlations

Create subset containing the numeric non-datetime columns that do not have NAs.
We drop rows containing an NA in the GHStars column, so we can keep that column
in our correlation matrix. This means that we have dropped all GitLab projects.
We drop columns that contain >50% NAs like GHCommits and NumStars (stars as
measured by WoC), along with the time columns that have more than 90% NAs.
```{r}
wocnona_numeric <- wocnona |>
  drop_na(GHStars) |>
  select(where(is.numeric)) |>
  select(where(~ !any(is.na(.)))) |>
  select(!starts_with("Time"))
```

Print a correlation matrix of these columns.
```{r}
round(cor(wocnona_numeric), 2)
```
We see extremely strong (>0.90) correlations between NumAuthors, NumCommits, 
NumForks, and CommunitySize.


### Active Projects

Due to the high correlations between activity metrics, we use only GHStars
as the cutoff for identifying active projects. We choose GHStars >= 100 as
our criterion. We compute the 5-number summary and standard deviation of
each of our metrics below.

While the minimum number of commits is surprisingly low at 3 for the active
projects subset, the mean (51,614) and median (2404) show high activity levels.
The maximum is 17.7 million commits.

```{r}
active_projects |> 
  select(NumActiveMon, NumAuthors, NumCommits, GHStars, NumVulns) |>
  summarize(across(1:5,summary))
```

```{r}
active_projects |> 
  select(NumActiveMon, NumAuthors, NumCommits, GHStars, NumVulns) |>
  summarize(across(1:5,sd))
```

The mean number of authors goes up tremendously to 560, while the median
remains at 58. The maximum of 93,072 is astounding.

Active projects have larger community sizes, with a mean of 1429 and a median
of 131.
```{r}
active_projects |> select(CommunitySize) |> summary()
```

Active projects are much more likely to fix vulnerabilities, with a median of 1 fixed vulnerability and a mean of 3.53, compared with 0 (median) and 0.14 (mean) for all projects.
```{r}
active_projects |> select(NumFixed) |> summary()
```

A majority (54.8%) of active projects have fixed at least one vulnerability.
```{r}
active_projects |> 
    filter(NumFixed > 0) |>
    summarize(nfiles=n(), 
              percent=format(100*nfiles/nrow(active_projects), scientific=F))
```

While only 1.7% of all projects have SECURITY.md policy files, 11.6% of active
projects have such files.
```{r}
active_projects |> 
  group_by(SECURITY.md) |>
  summarize(n=n(),
            percent=format(100*n/nrow(active_projects), scientific=F))
```

Active projects have a different distribution of programming languages, with
39.7% having C/C++ as a primary language and 25.7% having Javascript, compared
with 8.8% C/C++ and 59.9% JavaScript in the entire WoC project dataset.
```{r}
active_freq_language <- active_projects |> 
    group_by(FileInfo) |> 
    select(FileInfo) |> 
    summarize(nprojects=n(), 
              percent=format(100*nprojects/nrow(active_projects), scientific=F))
active_freq_language |> arrange(desc(nprojects))
```


## RQ3: What percentage of cloned vulnerabilities are fixed

Let us create a data frame containing only the fixed vulnerabilities. There are 
101,064 fixed vulnerabilities in the WoC dataset. These files make up only 3.3%
of cloned vulnerabilities in our dataset. 

There are about 69,000 or 2.3% of cloned vulnerabilities with status unknown. 
Such files have changed since the vulnerability was introduced, but are not 
identical to the fixed file from the original project. David's more detailed 
explanation of how files are classified as fixed, notfixed, or unknown is:

    I get the history of the vulnerable file in the original project that is
    listed in cve fixes. All blobs before the fixing commit are considered
    bad blobs. The blob in the fixing commit and all subsequent blobs are
    considered good blobs.
    
    If a file has a blob from the good blobs list, it is considered fixed. If
    the latest version of the file is in the bad_blobs list, it is considered
    not fixed. If the latest version of the file is not in the good or bad
    list, then we know the file was changed, but we don't know if the change
    fixed the vulnerability.

```{r}
fixedvulns <- wocnona |> filter(status == "fixed")
```

```{r}
wocnona |> 
  group_by(status) |>
  summarize( nvulns=n(),
             percent=100*nvulns/nrow(wocnona)
           )
```

Fixed vulnerabilities are found in 26,809 different projects. Projects with fixed
vulnerabilities represent only 3.7% of all projects.
```{r}
fixedvulns |> 
  distinct(Project) |> 
  summarize(nprojects=n(), percent=100*nprojects/nrow(projects))
```

How many projects have fixed all of their cloned vulnerabilities? 14,721 projects (2% of all projects, 55% of projects that have fixed at least one vulnerability).
```{r}
fixedprojects <- projects |> filter(NumFixed > 0 & NumUnfixed == 0)
fixedprojects |> 
  summarize(nprojects=n(), 
            pct_all=100*nprojects/nrow(projects),
            pct_fixed=100*nprojects/nrow(fixedvulns |> distinct(Project))
  )
```

76% of projects that fixed all vulnerabilities had only one vulnerability to fix.
```{r}
projects |>
  filter(NumFixed == 1 & NumUnfixed == 0) |>
  summarize(nprojects=n(), 
            pct_all=100*nprojects/nrow(projects),
            pct_fixed=100*nprojects/nrow(fixedvulns |> distinct(Project))
  )
```

```{r}
onefixed <-  projects |> filter(NumVulns == 1)
onefixed |> 
  filter(NumFixed == 1) |>
  summarize(nprojects=n(), 
            pct_one=100*nprojects/nrow(onefixed),
            pct_fixed_all=100*nprojects/nrow(fixedprojects |> filter(NumVulns==NumFixed)),
            pct_all=100*nprojects/nrow(projects)
  )
```

Let's look at how many vulnerabilities projects have fixed.
```{r}
fixedprojects |> arrange(desc(NumFixed))
```

We plot the distribution of the number of fixed vulnerabilities, removing
outliers (projects with more than 1 and less than 11 fixed vulnerabilities).
```{r}
ggplot(fixedprojects |> filter(NumFixed > 1 & NumFixed <= 10), aes(x=NumFixed)) + 
  geom_density() +
  scale_x_continuous(name="Number of Fixed Vulnerabilities", breaks=1:10)
```

How many projects have both fixed and unfixed vulnerabilities? There are over 12,000 projects with both, making up 1.7% of all projects and 45% of projects that have fixed at least one vulnerability.
```{r}
mixedprojects <- projects |> filter(NumFixed > 0 & NumUnfixed > 0)
mixedprojects |> 
  summarize(nprojects=n(), 
            pct_all=100*nprojects/nrow(projects),
            pct_fixed=100*nprojects/nrow(fixedvulns |> distinct(Project))
  )
```

Let's look at the mixed projects.
```{r}
mixedprojects
```

Mixed projects have a median of 1 fixed and 5 unfixed vulnerabilities and a mean of 4.8 fixed and 47.7 unfixed vulnerabilities.
```{r}
summary(mixedprojects |> select(where(is.numeric)))
```

Let's plot the distribution of fixed and unfixed vulnerabilities for mixed projects.
```{r}
cutoff <- 10
ggplot(mixedprojects |> filter(NumFixed <= cutoff & NumUnfixed <= cutoff)) + 
  geom_density(aes(x=NumFixed), color="blue") +
  geom_density(aes(x=NumUnfixed), color="red") +
  scale_x_continuous(name="Number of Fixed Vulnerabilities", breaks=1:10)
```

### How many vulnerabilities are fixed before they're vulnerable?

```{r}
validfixed <- fixedvulns |> filter(ValidDates == "OK")
nrow(validfixed)
```

We find that 11,794 projects (12.3%) with valid dates have cloned files that
replicate the fixed file before they replicate the vulnerable file.
```{r}
validfixed |> 
  filter(FirstGoodTime < FirstBadTime) |>
  summarize(nvulns = n(), percent=100*nvulns/nrow(validfixed))
```

### Time Distribution of Vulnerability Introduction and Fixing

To examine the time evolution of adding and fixing vulnerable cloned files, I
plotted several of the projects with both fixed and unfixed vulnerabilities over
time, but I found only 2 different dates in each of them:

```{r fig.height=2}
p <- validfixed |> filter(Project == '5l1v3r1_medusa-4')
ggplot(data=p) + geom_point(color="red", aes(x=FirstBadTime,y=1)) + geom_point(color="blue", aes(x=FirstGoodTime, y=2))
```


```{r fig.height=2}
p2 <- validfixed |> filter(Project == 'yllg_wasm-ffmpeg')
ggplot(data=p2) + geom_point(color="red", aes(x=FirstBadTime,y=1)) + geom_point(color="blue", aes(x=FirstGoodTime, y=2))
```

Let us see how many different good and bad times each project has.
```{r}
woctimes <- wocnona |>
  group_by(Project) |>
  summarize(nbadtimes=n_distinct(FirstBadTime),
            ngoodtimes=n_distinct(FirstGoodTime),
            nvulns=n()) |>
  arrange(desc(nbadtimes), desc(ngoodtimes))
woctimes
```

While there are projects that have many more than one or two of each time, 94% 
of projects only have one of each time.
```{r}
woctimes |> 
  filter(nbadtimes==1 & ngoodtimes == 1) |>
  summarize(nprojects=n(),
            percent=100*nprojects/nrow(woctimes))
```

36% of projects have more than one vulnerability but still only have one 
vulnerability introduction time and one vulnerability fix time.
```{r}
woctimes |> 
  filter(nbadtimes==1 & ngoodtimes == 1 & nvulns > 1) |>
  summarize(nprojects=n(),
            percent=100*nprojects/nrow(woctimes))
```

There is a power law distribution of projects with more than 1 vulnerability but
only a single bad time and a single fixed time. The distribution is of the same
form as the overall distributions of number of projects with n vulnerabilities.
```{r}
wt_freq <- woctimes |> filter(nbadtimes==1 & ngoodtimes == 1 & nvulns > 1) |>
  group_by(nvulns) |>
  summarize(nprojects=n(),
            percent=format(100*nprojects/nrow(woctimes), scientific=FALSE))
wt_freq
```

## RQ4: How do project characteristics affect the percentage of fixed vulnerabilities?

Let's create a multiple regression model for the number of fixed vulnerabilities.

```{r}
fixed_model <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")
```

Let's use the projects dataset with valid dates and GHStars data. There are
over 600,000 such projects.
```{r}
valid_projects <- projects |> 
  filter(ValidDates == "OK") |>
  drop_na(GHStars)
```

We'll use the non-highly correlated numerical columns as predictors.
```{r}
fit_fixed_model <- fixed_model |>
  fit(NumFixed ~ GHStars + NumCommits + NumActiveMon,  data=valid_projects)
```

While each predictor has a low-enough p-value to be significant, the fit is
poor with an adjusted R^2 of 0.01.
```{r}
fit_fixed_model |> extract_fit_engine() |> summary()
```

Since our predictors are non-normal power law data, it is recommended to log
transform them before using them in a regression model. Since GHStars contains
zeros, we need to use a Box-Cox transformation on it. See the web page at
https://robjhyndman.com/hyndsight/transformations/ for more details.

All predictors are still significant, but the fit is somewhat worse,
with an adjusted R^2 of 0.005.
```{r}
log_fixed_model <- fixed_model |>
  fit(NumFixed ~ log1p(GHStars) + log(NumCommits) + log(NumActiveMon), 
      data=valid_projects)
log_fixed_model |> extract_fit_engine() |> summary()
```


### Relationship between fixed vulnerabilities and number of commits

There are 3785 fixed files in projects with only a single commit. Projects with
a single commit fix <1% of their vulnerable files, but how can this be?

```{r}
files_1commit <- wocnona |> filter(NumCommits == 1)
tibble(onecommit=files_1commit |> count() |> pull(),
       fixed=files_1commit |> filter(status=="fixed") |> count() |> pull(),
       percent_all=100*fixed/onecommit,
       percent_fixed=100*fixed/nrow(fixedvulns))
```

Let's examine these single commit fixed projects in more detail. If we add a
filter for ValidDates, we find that zero projects remain, so all of the single
commit fixed projects result from invalid dates. This reinforces that we need
to be sure to exclude rows that do not have valid dates.
```{r}
files_1fix <- files_1commit |>
  filter(status == "fixed") |>
  filter(ValidDates == "OK")
files_1fix
```

Let us create a valid dates versions of woc.
```{r}
woc_valid <- wocnona |> filter(ValidDates == "OK")
nrow(woc_valid)
```

The majority of fixed files (81.9%) are in projects with 10 or more commits.
Projects with 10 or more commits fix 6.0% of their vulnerable cloned files.
```{r}
files_10commits <- woc_valid |> filter(NumCommits >= 10)
tibble(tencommits=files_10commits |> count() |> pull(),
       fixed=files_10commits |> filter(!is.na(TimeVulnRemained)) |> count() |> pull(),
       percent_all=100*fixed/tencommits,
       percent_fixed=100*fixed/nrow(fixedvulns))
```

### Relationship between fixed vulnerabilities and number of authors

Projects with a single author fix only 2.3% of their vulnerable clone files.
About 36.3% of all fixed vulnerable clones are in single author projects.
```{r}
files_1author <- woc_valid |> filter(NumAuthors == 1)
tibble(oneauthor=files_1author |> count() |> pull(),
       fixed=files_1author |> filter(!is.na(TimeVulnRemained)) |> count() |> pull(),
       percent_all=100*fixed/oneauthor,
       percent_fixed=100*fixed/nrow(fixedvulns))
```

Substantially more vulnerable clones are fixed (10.3%) in projects with 10 or
more authors. About 21.8% of all fixed vulnerable clones are in projects with
10 or more authors.
```{r}
files_10authors <- woc_valid |> filter(NumAuthors >= 10)
tibble(tenauthors=files_10authors |> count() |> pull(),
       fixed=files_10authors |> filter(!is.na(TimeVulnRemained)) |> count() |> pull(),
       percent_all=100*fixed/tenauthors,
       percent_fixed=100*fixed/nrow(fixedvulns))
```

## RQ5: How long does it take to fix a cloned vulnerability?

We analyze the TimeVulnRemained column, which measures the time between FirstGoodTime and max(FirstBadTime, Orig fix)) in days according to the file `VCAnalyzer/README.md`. An NA
value indicates that the vulnerability was never fixed. We have some questions about this
data column:

  - What do negative values mean?
  - Are very large (-18631 days = - 52 years) negative values meaningful?
  - How are values rounded?
  - What do zero values mean? Could they be small negative values before rounding, indicating a quick fix?

We only analyze rows that have valid dates, as determined by the value of ValidDates being set to the string "OK". This eliminates a few hundred thousand rows.
```{r}
woctime <- wocnona |> 
    filter(ValidDates == "OK")
nrow(wocnona) - nrow(woctime)
```

The large majority (96.7%) of cloned vulnerabilities are never fixed.
```{r}
unfixed <- woctime |> filter(is.na(TimeVulnRemained)) |> count() |> pull()
tibble(unfixed=unfixed, percent=100*unfixed/nrow(woctime))
```

Let us create a data frame containing only the fixed vulnerabilities. There are
103,196 fixed vulnerabilities in the woctime dataset.
```{r}
fixedvulns <- woctime |> filter(!is.na(TimeVulnRemained))
fixedvulns |> select(TimeVulnRemained) |> summary()
```

If vulnerabilities are fixed, many vulnerabilities (40.5% of those that are
ever fixed) are fixed in less than 1 day.  _FIXME: is that what 0 means?_

```{r}
zerofixed <- woctime |> filter(TimeVulnRemained == 0) |> count() |> pull()
tibble(zerofixed=zerofixed, 
       pct_total=100*zerofixed/nrow(woctime),
       pct_fixed=100*zerofixed/nrow(fixedvulns))
```

While 0 is the most common time to fix, 1, 2, and 3 days are also very common, as is 19 for some reason.
```{r}
daystofix <- fixedvulns |> 
    group_by(TimeVulnRemained) |> 
    select(TimeVulnRemained) |> 
    summarize(nvulns=n()) |> 
    arrange(desc(nvulns))
daystofix
```

Visualize vulnerabilities that are fixed in 30 or less days, excluding 0 to avoid shrinking the other bars to invisibility.
```{r fig.width=10}
daystofix_nooutliers <- daystofix |> 
  filter(TimeVulnRemained < 30 & TimeVulnRemained > -30) |> 
  filter(TimeVulnRemained != 0)
ggplot(data=daystofix_nooutliers, aes(x=TimeVulnRemained, y=nvulns)) + 
  geom_col() +
  ggtitle("Number of Vulnerabilities by Time to Fix")
```

There are about twice as many positive fix times than negative fix times.

```{r}
daystofix_positive <- daystofix |> filter(TimeVulnRemained > 0) |> count() |> pull()
daystofix_negative <- daystofix |> filter(TimeVulnRemained < 0) |> count() |> pull()
tibble(negative = daystofix_negative, positive = daystofix_positive)
```

Let's compare the distribution of the negative and positive fix times. We see that the negative days have a much greater range (-18631 days compared to 3530 days) and mean (-6717 days compared to 904 days).
```{r}
daystofix |> select(TimeVulnRemained) |> filter(TimeVulnRemained < 0) |> summary()
```

```{r}
daystofix |> select(TimeVulnRemained) |> filter(TimeVulnRemained > 0) |> summary()
```

## Selecting a Subset of Active Projects

We could select such projects by GHStars (which will lose the 1.5% of projects that are from GitLab), NumCommits, or NumActiveMon. Since NumCommits has >90% correlation with NumAuthors, NumForks, and CommunitySize, and those metrics have 70-80% correlations with NumCore, we focus on using only one of the highly correlated metrics.

Let's examine the high number of stars datasets.
```{r}
gh10 <- projects |> filter(GHStars > 10) |> count() |> pull()
gh100 <- projects |> filter(GHStars > 100) |> count() |> pull()
gh1000 <- projects |> filter(GHStars > 1000) |> count() |> pull()
tibble(stars10=gh10, stars100=gh100, stars1000=gh1000)
```

Let's examine the sizes of the large number of commits datasets.
```{r}
nc10 <- projects |> filter(NumCommits > 10) |> count() |> pull()
nc100 <- projects |> filter(NumCommits > 100) |> count() |> pull()
nc1000 <- projects |> filter(NumCommits > 1000) |> count() |> pull()
tibble(commits10=nc10, commits100=nc100, commits1000=nc1000)
```

If we filter on the same number of stars and commits, we get the same results
that we obtained just focusing on stars.
```{r}
both10 <- projects |> filter(GHStars > 10 & NumCommits > 10) |> count() |> pull()
both100 <- projects |> filter(GHStars > 100 & NumCommits > 100) |> count() |> pull()
both1000 <- projects |> filter(GHStars > 1000 & NumCommits > 1000) |> count() |> pull()
tibble(both10=gh10, both100=gh100, both1000=gh1000)
```