README.Rmd

---
title: "The Pandemic Penalty: COVID-19's Gendered Impact on Scientific Productivity"
author: "Megan Frederickson and Molly King"
date: "February 6, 2021"
output: github_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This repo contains the data and code for the analyses in:

King MM, Frederickson ME. The Pandemic Penalty: COVID-19's gendered impact on scientific productivity. Currently under review, preprint available on SocArXiv. The repo was updated in early February, 2021 after receiving reviewer comments.

In the paper, we quantified how the COVID-19 pandemic is affecting the gender breakdown of preprint submissions to [arXiv](https://arxiv.org/) and [bioRxiv](https://www.biorxiv.org/), two preprint servers that together cover many STEM fields. 

We used these packages:

```{r Load packages, message=FALSE, warning=FALSE}
#Load packages
library(tidyverse) #includes ggplot2, dplyr, readr, stringr
library(knitr)
library(cowplot)
library(gender)
library(aRxiv)
#install.packages("devtools") 
#devtools::install_github("nicholasmfraser/rbiorxiv") 
library(rbiorxiv) 
library(lubridate)
library(anytime)
library(car)
library(rcrossref)
library(lme4)
```

## arXiv submissions

We scraped submission data from arXiv, a preprint server for physics, math, computer science, statistics, and other quantitative disciplines. We used the aRxiv package to scrape the data, see:

Karthik, R. and K. Broman (2019). aRxiv: Interface to the arXiv API. R package version 0.5.19. https://CRAN.R-project.org/package=aRxiv
  
We began by scraping all records for March 15-April 15, 2020, during the COVID-19 pandemic, and for the same date range in 2019. Then, we expanded to scrape all the data for Jan 1, 2020 to June 30, 2020, inclusive. We scraped the data in batches, as recommended in the aRxiv package tutorial. For brevity, we are not reproducing the code here, but it is available in the [R markdown file](https://github.com/drfreder/king-and-frederickson/blob/master/README.Rmd) included in this repo.

```{r Scrape arXiv, include=FALSE, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
#Get all submissions between March 15, 2020 and April 15, 2020 (during the COVID-19 pandemic)
n.2020 <- arxiv_count(query = 'submittedDate:[202003150000 TO 202004152400]')
df.2020.1 <- arxiv_search(query = 'submittedDate:[202003150000 TO 202003212400]', limit=n.2020, batchsize=1000)
df.2020.2 <- arxiv_search(query = 'submittedDate:[202003220000 TO 202003282400]', limit=n.2020, batchsize=1000)
df.2020.3 <- arxiv_search(query = 'submittedDate:[202003290000 TO 202004032400]', limit=n.2020, batchsize=1000)
df.2020.4 <- arxiv_search(query = 'submittedDate:[202004040000 TO 202004092400]', limit=n.2020, batchsize=1000)
df.2020.5 <- arxiv_search(query = 'submittedDate:[202004100000 TO 202004121415]', limit=n.2020, batchsize=2000)
df.2020.6 <- arxiv_search(query = 'submittedDate:[202004121420 TO 202004152400]', limit=n.2020, batchsize=1000)
df.2020.full <- rbind(df.2020.1, df.2020.2, df.2020.3, df.2020.4, df.2020.5, df.2020.6)
n.2020-length(df.2020.full$id) #Check that the number of records matches
write.csv(df.2020.full, file="Data/arxiv_2020_data.csv")

#Get all submission between March 15, 2019 and April 15, 2019 (the same dates last year)
n.2019 <- arxiv_count(query = 'submittedDate:[201903150000 TO 201904152400]')
df.2019.1 <- arxiv_search(query = 'submittedDate:[201903150000 TO 201903222400]', limit=n.2019, batchsize=2000)
df.2019.2 <- arxiv_search(query = 'submittedDate:[201903230000 TO 201903292400]', limit=n.2019, batchsize=2000)
df.2019.3 <- arxiv_search(query = 'submittedDate:[201903300000 TO 201904052400]', limit=n.2019, batchsize=2000)
df.2019.4 <- arxiv_search(query = 'submittedDate:[201904060000 TO 201904122400]', limit=n.2019, batchsize=2000)
df.2019.5 <- arxiv_search(query = 'submittedDate:[201904130000 TO 201904152400]', limit=n.2019, batchsize=2000)
df.2019.full <- rbind(df.2019.1, df.2019.2, df.2019.3, df.2019.4, df.2019.5)
n.2019-length(df.2019.full$id)
write.csv(df.2019.full, file="Data/arxiv_2019_data.csv")

#Get all submissions between Jan. 1, 2020 and March 15, 2020
n.early2020 <- arxiv_count(query = 'submittedDate:[202001010000 TO 202003152400]')
df.early2020.1 <- arxiv_search(query = 'submittedDate:[202001010000 TO 202001152400]', limit=n.early2020, batchsize=2000)
df.early2020.2 <- arxiv_search(query = 'submittedDate:[202001160000 TO 202001312400]', limit=n.early2020, batchsize=2000)
df.early2020.3 <- arxiv_search(query = 'submittedDate:[202002010000 TO 202002152400]', limit=n.early2020, batchsize=2000)
df.early2020.4 <- arxiv_search(query = 'submittedDate:[202002160000 TO 202002292400]', limit=n.early2020, batchsize=2000)
df.early2020.5 <- arxiv_search(query = 'submittedDate:[202003010000 TO 202003152400]', limit=n.early2020, batchsize=2000)
df.early2020.full <- rbind(df.early2020.1, df.early2020.2, df.early2020.3, df.early2020.4, df.early2020.5)
n.early2020-length(df.early2020.full$id)
write.csv(df.early2020.full, file="Data/arxiv_early2020_data.csv")

#Get all submissions between Apr. 16, 2020 and April 30, 2020
n.update <- arxiv_count(query = 'submittedDate:[202004160000 TO 202004222400]')
df.update <- arxiv_search(query = 'submittedDate:[202004160000 TO 202004222400]', limit=n.update, batchsize=2000)
n.update - length(df.update$id)
write.csv(df.update, file="Data/arxiv_update2020_data.csv")

n.update.2 <- arxiv_count(query = 'submittedDate:[202004230000 TO 202004302400]')
df.update.2 <- arxiv_search(query = 'submittedDate:[202004230000 TO 202004302400]', limit=n.update.2, batchsize=2000)
n.update.2 - length(df.update.2$id)
write.csv(df.update.2, file="Data/arxiv_update2020_2_data.csv")

#May and June data
n.update.3 <- arxiv_count(query = 'submittedDate:[202005010000 TO 202006302400]')
df.update.3 <- arxiv_search(query = 'submittedDate:[202005010000 TO 202005152400]', limit=n.update.3, batchsize=2000)
df.update.4 <- arxiv_search(query = 'submittedDate:[202005160000 TO 202005312400]', limit=n.update.3, batchsize=2000)
df.update.5 <- arxiv_search(query = 'submittedDate:[202006010000 TO 202006152400]', limit=n.update.3, batchsize=2000)
df.update.6 <- arxiv_search(query = 'submittedDate:[202006160000 TO 202006302400]', limit=n.update.3, batchsize=2000)
df.update.7 <- rbind(df.update.3, df.update.4, df.update.5, df.update.6)
n.update.3 - length(df.update.7$id)

#n.update - length(df.update$id)
write.csv(df.update.7, file="Data/arxiv_updateMayJune2020_data.csv")
```
 
Next, we assigned gender to author names using the gender package, see:

Mullen, L. (2019). gender: Predict Gender from Names Using Historical Data. R package version 0.5.3, https://github.com/ropensci/gender. 

This package returns the probability that a first name belongs to a woman or a man by comparing the name to names in a database; we used the U.S. Social Security Administration baby names database. 

Please note: this is a brute force method of predicting gender, and it has many limitations, as discussed by the package authors on their GitHub repo and included links. By using this method, we are not assuming that individuals are correctly gendered in the resulting dataset, but merely that it provides insight into gender's effects in aggregate across the population of preprint authors. 

This code takes a while to run, so it is not run when rendering this markdown document.

```{r Predict arXiv author gender, eval=FALSE, message=FALSE, warning=FALSE}
#Not run
#First combine data for year-over-year comparison
df.2020 <- read.csv("Data/arxiv_2020_data.csv") #Read in data
df.2019 <- read.csv("Data/arxiv_2019_data.csv")
df.full <- rbind(df.2019, df.2020) #Combine in one dataframe

#Next combine data for early 2020 comparison
df.early2020 <- read.csv("Data/arxiv_early2020_data.csv")
df.update <- read.csv("Data/arxiv_update2020_data.csv")
df.update.2 <- read.csv("Data/arxiv_update2020_2_data.csv")
df.mayjune2020 <- read.csv("Data/arxiv_updateMayJune2020_data.csv")
df.all2020 <- rbind(df.2020, df.early2020, df.update, df.update.2, df.mayjune2020) #Full 2020 data

split.names <- function(x){strsplit(as.character(x), "|", fixed=TRUE)} #Function to split strings of author names
last.author <- function(x){gsub(".*\\|", "", as.character(x))} #Function to extract last author
first.author <- function(x){gsub("\\|.*", "", as.character(x))} #Function to extract first author

#For the year-over-year dataset
df.full$split.names <- lapply(df.full$authors, split.names) #Apply functions
df.full$first.author <- lapply(df.full$authors, first.author)
df.full$last.author <- lapply(df.full$authors, last.author)

all_first_names <- word(unlist(df.full$split.names),1) #Make a list of all author first names
#install.packages("genderdata", repos = "http://packages.ropensci.org") #In case you need the gender data package
gender <- gender(all_first_names, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)]) #Keep only unique names

#This loop is an inelegant way of counting the number of men and women authors for each paper
tmp <- NULL
for(i in 1:length(df.full$authors)){
  tmp <- as.data.frame(word(unlist(df.full$split.names[[i]]), 1))
  colnames(tmp) <- "name"
  tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
  df.full$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
  df.full$female.n[i] <-  sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}

#Predict first author gender (does not include sole authors)
df.full$author.n <- str_count(df.full$authors, pattern = "\\|")+1 #Count author number
df.full$first.author.first.name <- ifelse(df.full$author.n > 1, word(df.full$first.author, 1), NA)
gender <- gender(df.full$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.full$first.author.gender <- getgender[df.full$first.author.first.name]

#Predict last author gender (again, omits sole authors)
df.full$author.n <- str_count(df.full$authors, pattern = "\\|")+1 #Count author number
df.full$last.author.first.name <- ifelse(df.full$author.n > 1, word(df.full$last.author, 1), NA)
gender <- gender(df.full$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.full$last.author.gender <- getgender[df.full$last.author.first.name]

#Count middle authors for each gender
df.full$female.mid.authors.n <- ifelse(df.full$author.n > 1, (df.full$female.n - ifelse(df.full$first.author.gender %in% "female", 1, 0) - ifelse(df.full$last.author.gender %in% "female", 1, 0)), 0)
df.full$male.mid.authors.n <- ifelse(df.full$author.n > 1, (df.full$male.n - ifelse(df.full$first.author.gender %in% "male", 1, 0) - ifelse(df.full$last.author.gender %in% "male", 1, 0)), 0)

df.full.output <- as.data.frame(apply(df.full,2,as.character)) 
write.csv(df.full.output, "Data/arxiv_full_gender.csv") #Save data

#Same for the early 2020 dataset
df.all2020$split.names <- lapply(df.all2020$authors, split.names)
df.all2020$first.author <- lapply(df.all2020$authors, first.author)
df.all2020$last.author <- lapply(df.all2020$authors, last.author)

tmp <- NULL
all_first_names <- word(unlist(df.all2020$split.names),1)
gender <- gender(all_first_names, method = "ssa")
gender <- unique(gender[ , c(1,2,4)])

for(i in 1:length(df.all2020$authors)){
  tmp <- as.data.frame(word(unlist(df.all2020$split.names[[i]]), 1))
  colnames(tmp) <- "name"
  tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
  df.all2020$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
  df.all2020$female.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}

df.all2020$author.n <- str_count(df.all2020$authors, pattern = "\\|")+1 #Count author number
df.all2020$first.author.first.name <- ifelse(df.all2020$author.n > 1, word(df.all2020$first.author, 1), NA)
df.all2020$first.author.first.name <- word(df.all2020$first.author, 1)
gender <- gender(df.all2020$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.all2020$first.author.gender <- getgender[df.all2020$first.author.first.name]

df.all2020$last.author.first.name <- ifelse(df.all2020$author.n > 1, word(df.all2020$last.author, 1), NA)
gender <- gender(df.all2020$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.all2020$last.author.gender <- getgender[df.all2020$last.author.first.name]

df.all2020$female.mid.authors.n <- ifelse(df.all2020$author.n > 1, (df.all2020$female.n - ifelse(df.all2020$first.author.gender %in% "female", 1, 0) - ifelse(df.all2020$last.author.gender %in% "female", 1, 0)), 0)
df.all2020$male.mid.authors.n <- ifelse(df.all2020$author.n > 1, (df.all2020$male.n - ifelse(df.all2020$first.author.gender %in% "male", 1, 0) - ifelse(df.all2020$last.author.gender %in% "male", 1, 0)), 0)

df.all2020.output <- as.data.frame(apply(df.all2020,2,as.character))
write.csv(df.all2020.output, "Data/arxiv_all2020_gender.csv") #Save data
```

Next, we calculated some summary statistics for the arXiv dataset we assembled. We also determined which preprints list authors in alphabetical order in order to better understand author order conventions in this dataset. 

```{r Summarize arXiv dataset, message=FALSE, warning=FALSE}
df.full <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/arxiv_full_gender.csv") #Read in data
df.full <- df.full[!duplicated(df.full), ] #Remove duplicates, if any
df.full$year <- as.factor(year(as.Date(df.full$submitted))) #Extract year

df.all2020 <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/arxiv_all2020_gender.csv") #Read in data
df.all2020 <- df.all2020[!duplicated(df.all2020), ] #Remove duplicated rows, if any
df.all2020$year <- as.factor(year(as.Date(df.all2020$submitted))) #Extract year

all.arxiv <- rbind(df.full, df.all2020) #Combine datasets
all.arxiv <- all.arxiv[!duplicated(all.arxiv), ] #Remove duplicates
total.preprints <- length(all.arxiv$id) #Total number of preprints
total.authors <- sum(all.arxiv$author.n) #Total number of authors
total.authors.with.gender <- sum(all.arxiv$male.n+all.arxiv$female.n) #Total number of authors with gender inferred
per.gender <- round((total.authors.with.gender/total.authors)*100, 1) #Percent of authors with gender

year.arxiv.preprints <- length(df.full$"id") #Total number of preprints for year-over-year comparison
year.arxiv.authors <- sum(df.full[, "male.n"]+df.full[, "female.n"]) #Authors with gender inferred for year-over-year comparison

#preprints per year
yr.summary <- df.full %>% group_by(year) %>% summarize(n=n())

#Write a function to test whether author names are in alphabetical order
last.names.alpha <- function(x) {
      tmp <- as.data.frame(strsplit(as.character(x), "|", fixed=TRUE))
      colnames(tmp) <- c("names")
      tmp$names <- word(tmp$names, -1)
      tmp$names.alpha <- str_sort(tmp$names)
      tmp$match <- identical(tmp$names, tmp$names.alpha)
      ifelse(length(tmp$names) >= 2, ifelse(sum(tmp$match, na.rm=TRUE) > 0, "Y", "N"), NA)
}

#Apply alphabetical function
all.arxiv$authors.alpha <- lapply(all.arxiv$authors, last.names.alpha) #Apply functions
df.full$authors.alpha <- lapply(df.full$authors, last.names.alpha)
df.all2020$authors.alpha <- lapply(df.all2020$authors, last.names.alpha)

#Make 'big' categories
all.arxiv$big.category <- word(all.arxiv$primary_category,1,sep = "\\.")
```

There are `r total.preprints` preprints in the full arXiv dataset, with a total of `r format(total.authors, scientific=FALSE)` non-unique authors. We inferred the gender of `r total.authors.with.gender` authors, or `r per.gender`%, with the rest omitted from subsequent analyses. This a lower success rate for predicting author gender for the arXiv dataset than for the bioRxiv dataset (see below), reflecting the fact that arXiv preprints are more likely to list large consortia as authors (e.g., CMS Collaboration) or have authors who have names not in the U.S. Social Security names database. 

For just the comparison of March 15-April 15, 2020 with the same dates in 2019, there are `r year.arxiv.preprints` arXiv preprints with `r year.arxiv.authors` authors for whom we inferred gender.

We analyzed the data by authorship position (e.g., first, middle, last, or sole author). In many but not all STEM fields, it is convention to list the Principal Investigator (PI) as last author on the paper, and the person who led the study and wrote the first draft of the manuscript as first author on the paper, with other authors as middle authors. However, some fields list authors in alphabetical order. In the full arXiv dataset, among preprints with multiple (i.e., 2 or more) authors, there are `r length(all.arxiv[all.arxiv$authors.alpha == "N", "authors.alpha"])` preprints with authors not in alphabetical order, and `r length(all.arxiv[all.arxiv$authors.alpha == "Y", "authors.alpha"])` preprints with authors in alphabetical order. In other words, `r round((length(all.arxiv[all.arxiv$authors.alpha == "Y", "authors.alpha"])/(length(all.arxiv[all.arxiv$authors.alpha == "N", "authors.alpha"])+length(all.arxiv[all.arxiv$authors.alpha == "Y", "authors.alpha"])))*100, 1)`% of preprints with multiple authors list authors in alphabetical order.

### Exploring author order across fields

Reviewers also asked for some more information about the fields represented in the arXiv dataset. We used the category taxonomy available on the arXiv website itself (https://arxiv.org/category_taxonomy) to categorize preprints into 8 broad fields: physics, math, computer science, statistics, quantitative biology, economics, quantitative finance, and electrical engineering and systems science. We further combined economics and quantitative finance preprints into a single category, because each one had few preprints. How does the number of arXiv preprints compare among fields?

```{r Author order, message=FALSE, warning=FALSE, dpi=300}
#Bin all physics sub-disciplines into "physics"
all.arxiv$field <- ifelse(all.arxiv$big.category == "astro-ph" | all.arxiv$big.category == "cond-mat" | all.arxiv$big.category == "gr-qc" | all.arxiv$big.category == "math-ph" | all.arxiv$big.category == "quant-ph" | all.arxiv$big.category == "nucl-ex" | all.arxiv$big.category == "nucl-th" | all.arxiv$big.category == "hep-ex" | all.arxiv$big.category == "hep-ph" | all.arxiv$big.category == "hep-th" | all.arxiv$big.category == "hep-lat" | all.arxiv$big.category == "nlin", "physics", all.arxiv$big.category)

#Lump quantitative finance and economics together
all.arxiv$field <- ifelse(all.arxiv$field == "q-fin", "econ", all.arxiv$field)

#Count preprints per field
field.summary <- all.arxiv %>% group_by(field) %>% summarize(n=n())

#Make a figure
p25 <- ggplot(data=field.summary, aes(x=reorder(field, n), y=n))+xlab("Field")+ylab("Preprints (no.)")+geom_bar(stat="identity", alpha=0.5)+theme_cowplot()+theme(axis.text.x=element_text(angle=90))+labs(title="arXiv", subtitle="by field")
p25
```

The arXiv dataset is dominated by preprints in physics, math, and computer science. 

How many preprints in each field list authors in alphabetical order? 

```{r Fields in alphabetical order, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
#Count the number of preprints with authors in alphabetical order, and the total number of preprints in each field
author.order.summary <- subset(all.arxiv, !is.na(authors.alpha)) %>% group_by(field, authors.alpha) %>% summarize(n=n())
author.order.summary.wide <- spread(author.order.summary, key=authors.alpha, value=n)
author.order.summary.wide$per <- 100*(author.order.summary.wide$Y/(author.order.summary.wide$N+author.order.summary.wide$Y))
author.order.summary.wide$field <-factor(author.order.summary.wide$field, levels=c("econ", "q-bio", "stat", "eess", "math", "cs", "physics"))

#Plot percent of preprints with authors in alphabetical order
p26 <- ggplot(data=author.order.summary.wide, aes(x=field, y=per))+xlab("Field")+ylab("Authors in alphabetical order (%)")+geom_bar(stat="identity", alpha=0.5)+theme_cowplot()+theme(axis.text.x=element_text(angle=90))+labs(title="arXiv", subtitle="by field")
p26
p40 <- plot_grid(p25, p26, labels="auto")
save_plot("Figure S1.png", p40, base_width = 8, base_height =4, dpi=600)
```

Economics and quantitative finance (combined into "econ" in the figure) and math are the fields with most preprints listing authors in alphabetical order. These are the fields for which author position (e.g., first, middle, or last) is least likely to be meaningful. 

### Comparing arXiv preprint authorships between Mar/Apr 2019 and Mar/Apr 2020, by gender

How many men versus women authorships of preprints were there in Mar/Apr 2020, compared to the same dates last year? Note: this is not the number of unique authors; it includes authors who submitted multiple preprints. Thus, 1 "authorship" equals 1 author on 1 paper. 

```{r Visualize arXiv year-over-year data, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
all <- as.data.frame(ungroup(df.full %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019
yr.labels = c("Mar. 15 - Apr. 15, 2019", "Mar. 15 - Apr. 15, 2020") #Set legend labels
colours1 = c("#f4a582","#ca0020") #Set colours
fontsize = 10

#Make figure comparing 2020 to 2019
p1 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2),  y=c(9000,29000), label = paste0("+", round(all.t$per.dif.1920[1:2], 1), "%"))+labs(title="arXiv", subtitle="all authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p1 
```

arXiv preprint submissions are up overall, but the number of men authorships is currently growing faster than the number of women authorships. Comparing preprint submissions in late March and early April 2020 to the same dates in 2019, the number of men authorships has grown more than the number of women authorships, both as a percent change and in absolute terms. 

We further investigated this pattern in the fields with the greatest numbers of preprints in arXiv, namely physics, computer science, and math. 

```{r Visualize arXiv year-over-year data by field, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
#Bin all physics sub-disciplines into "physics"
df.full$big.category <- word(df.full$primary_category,1,sep = "\\.")
df.full$field <- ifelse(df.full$big.category == "astro-ph" | df.full$big.category == "cond-mat" | df.full$big.category == "gr-qc" | df.full$big.category == "math-ph" | df.full$big.category == "quant-ph" | df.full$big.category == "nucl-ex" | df.full$big.category == "nucl-th" | df.full$big.category == "hep-ex" | df.full$big.category == "hep-ph" | df.full$big.category == "hep-th" | df.full$big.category == "hep-lat" | df.full$big.category == "nlin", "physics", df.full$big.category)
df.full$field <- ifelse(df.full$field == "q-fin", "econ", df.full$field) #Lump quantitative finance and economics together

#Set labels and colours
yr.labels = c("Mar. 15 - Apr. 15, 2019", "Mar. 15 - Apr. 15, 2020") #Set legend labels
colours1 = c("#f4a582","#ca0020") #Set colours
fontsize = 10

#Physics
cat <- as.character(unique(df.full$field)[[1]])
all <- as.data.frame(ungroup(subset(df.full, field == cat) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019

#Make figure comparing 2020 to 2019
p27 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2),  y=c(4700,13800), label = paste0(round(all.t$per.dif.1920[1:2], 1), "%"))+labs(title="arXiv", subtitle=paste0(cat, ", all authorships"))+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))

#Math
cat <- as.character(unique(df.full$field)[[2]])
all <- as.data.frame(ungroup(subset(df.full, field == cat) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019

#Make figure comparing 2020 to 2019
p28 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+labs(title="arXiv", subtitle=paste0(cat, ", all authorships"))+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))+ggplot2::annotate("text", x=c(1, 2),  y=c(1200,4000), label = c(paste0(round(all.t$per.dif.1920[1], 1), "%"), paste0("+", round(all.t$per.dif.1920[2], 1), "%")))

#Computer Science
cat <- as.character(unique(df.full$field)[[3]])
all <- as.data.frame(ungroup(subset(df.full, field == cat) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019

#Make figure comparing 2020 to 2019
p29 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2),  y=c(3000,9500), label = paste0("+", round(all.t$per.dif.1920[1:2], 1), "%"))+labs(title="arXiv", subtitle="cs, all authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))

p30 <- plot_grid(p27, p28, p29, nrow=1)
p30
save_plot("all_authorships_by_field.png", p30, base_height=4, base_width=8, dpi=600)
```

There has been a greater decline in women authorships than men authorships in physics and math, but not in computer science. In computer science, women added more authorships as a percent change, but not in absolute terms, between Mar/Apr 2019 and Mar/Apr 2020.

### Comparing single-authored arXiv preprints between Mar/Apr 2019 and Mar/Apr 2020, by gender

How many arXiv preprints were authored by a single woman versus a single man in Mar/Apr, 2020, compared to the same dates last year? 

```{r Sole authors,  message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
sole.authors <- as.data.frame(ungroup(subset(df.full, author.n == 1) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
sole.long <- gather(sole.authors, Gender, number, Male:Female) #Make wide data long
sole.authors.t <- as.data.frame(t(sole.authors[,-1])) #Transpose
colnames(sole.authors.t) <- c("2019", "2020") #Fix column names
sole.authors.t$per.dif.1920 <- ((sole.authors.t$`2020`-sole.authors.t$`2019`)/(sole.authors.t$`2019`))*100 #Calculate percent change, 2020 over 2019

#Make figure for single-authored preprints
p2 <- ggplot(data=sole.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2),  y=c(270,1350), label = paste0("+", round(sole.authors.t$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="arXiv", subtitle = "sole authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p2
```

Single-authored arXiv submissions are also up overall, but again the number of men authorships is currently growing faster than the number of women authorships, both as a percent change and in absolute terms. 

### Comparing arXiv preprint submissions by authorship position between Mar/Apr 2019 and Mar/Apr 2020, by gender

What if we break it down further by author position, so first, middle, or last? First up, first authorships of multi-authored papers.

##### First authorships

```{r arXiv year-over-year first authors, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
first.authors <- subset(df.full, !is.na(first.author.gender)) %>% group_by(year,first.author.gender) %>% summarize(n=n()) #Summarize by year
first.authors$per.dif.1920 <- c(first.authors[3,3]/first.authors[1,3],first.authors[4,3]/first.authors[2,3] ,first.authors[3,3]/first.authors[1,3], first.authors[4,3]/first.authors[2,3])
first.authors$per.dif.1920 <- (as.numeric(first.authors$per.dif.1920)-1)*100 #Calculate percent change, 2020 over 2019

p3 <- ggplot(data=first.authors, aes(fill=as.factor(year), y=n, x=first.author.gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2),  y=c(2000,6550), label = paste0("+", round(first.authors$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="arXiv", subtitle = "first authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p3
```

The number of women first authorships has grown slightly faster than the number of men first authorships, as a percent change year-over-year (but not in absolute terms).

##### Last authorships

What about last, or "senior," authorships of multi-authored papers? 

```{r arXiv last authors year-over-year, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
last.authors <- subset(df.full, !is.na(last.author.gender)) %>% group_by(year,last.author.gender) %>% summarize(n=n()) #Summarize by year
last.authors$per.dif.1920 <- c(last.authors[3,3]/last.authors[1,3],last.authors[4,3]/last.authors[2,3] ,last.authors[3,3]/last.authors[1,3], last.authors[4,3]/last.authors[2,3])
last.authors$per.dif.1920 <- (as.numeric(last.authors$per.dif.1920)-1)*100 #Calculate percent change, 2020 over 2019

p4 <- ggplot(data=last.authors, aes(fill=as.factor(year), y=n, x=last.author.gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2),  y=c(1900,7020), label = paste0("+", round(last.authors$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="arXiv", subtitle = "last authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p4
```

The number of men last authorships has grown substantially year-over-year, but the number of women last authorships is almost unchanged from 2019. 

##### Middle authorships

And finally, middle authorships, or all authorships on multi-authored papers that are not in the first or last position. 

```{r arXiv middle authors year-over-year, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
middle <- as.data.frame(ungroup(df.full %>% group_by(year) %>% summarize(Female = sum(female.mid.authors.n, na.rm=TRUE), Male = sum(male.mid.authors.n, na.rm=TRUE)))) #Summarize by year
middle.long <- gather(middle, Gender, number, Female:Male) #Make wide data long
middle.t <- as.data.frame(t(middle[,-1])) #Transpose
colnames(middle.t) <- c("2019", "2020") #Fix column names
middle.t$per.dif.1920 <- ((middle.t$`2020`-middle.t$`2019`)/(middle.t$`2019`))*100 #Calculate percent change, 2020 over 2019
yr.labels = c("Mar. 15 - Apr. 15, 2019", "Mar. 15 - Apr. 15, 2020") #Set legend labels
colours1 = c("#f4a582","#ca0020") #Set colours
fontsize = 10

#Make figure comparing 2020 to 2019
p5 <- ggplot(data=middle.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2),  y=c(4770,14450), label = c(paste0("+", round(middle.t$per.dif.1920[1], 1), "%"), paste0("+", round(middle.t$per.dif.1920[2], 1), "%")))+labs(title="arXiv", subtitle="middle authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p5
```

Again, the number of men middle authorships increased more than the number of women middle authorships, year-over-year.

### Comparing arXiv preprint submissions in the months before and during the COVID-19 pandemic, by gender

Next, we looked back over the months leading up to widespread stay-at-home orders and school and childcare closures that North Americans experienced beginning in late March or early April, 2020. These measures were implemented to different degrees and on different dates in different parts of the world, but we assumed their effects would be most pronounced (globally) starting in March, 2020 and thereafter.

```{r Early 2020 arXiv all authors analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}

#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
who <- "2020-03-11" #Date the WHO declared COVID-19 a pandemic, for reference
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))
colours2 = c("#7fbf7b","#af8dc3") #Set colours

p6 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p6

#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm1 <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm1)
Anova(lm1, type=3)
#plot(lm1)
```

The number of male authorships is growing faster than the number of female authorships during the pandemic.

Again, we repeated the analysis for physics, math, and computer science separately. But first, let's check how many physics preprints are about COVID-19 in the full arXiv dataset. 

```{r Physics arXiv preprints about COVID-19}
#Check how many physics preprints are about COVID-19
physics.df <- subset(all.arxiv, field == "physics")
physics.df$COVID.in.abstract <- grepl('COVID-19|SARS-CoV-2|COVID|Covid|coronavirus', physics.df$abstract)
```

We calculated how many physics preprints mention COVID-19, SARS-CoV-2, or coronavirus in the abstract, and found that only `r sum(physics.df$COVID.in.abstract)` out of `r length(physics.df$COVID.in.abstract)`, or `r round(100*(sum(physics.df$COVID.in.abstract)/length(physics.df$COVID.in.abstract)),1)`%, mention one of these terms. 

Now for the linear model results for each of the 3 big fields in the arXiv dataset (physics, math, and computer science). For sake of completion, we also did the other 4 fields (economics and quantitative finance, electrical engineering and systems science, statistics, and quantitative biology). 

```{r Early 2020 arXiv all authors analysis by field, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
#Bin all physics sub-disciplines into "physics"
df.all2020$big.category <- word(df.all2020$primary_category,1,sep = "\\.")
df.all2020$field <- ifelse(df.all2020$big.category == "astro-ph" | df.all2020$big.category == "cond-mat" | df.all2020$big.category == "gr-qc" | df.all2020$big.category == "math-ph" | df.all2020$big.category == "quant-ph" | df.all2020$big.category == "nucl-ex" | df.all2020$big.category == "nucl-th" | df.all2020$big.category == "hep-ex" | df.all2020$big.category == "hep-ph" | df.all2020$big.category == "hep-th" | df.all2020$big.category == "hep-lat" | df.all2020$big.category == "nlin", "physics", df.all2020$big.category)

#Lump quantitative finance and economics together
df.all2020$field <- ifelse(df.all2020$field == "q-fin", "econ", df.all2020$field)

#Set colours and reference date
colours2 = c("#7fbf7b","#af8dc3") #Set colours
who <- "2020-03-11" #Date the WHO declared COVID-19 a pandemic, for reference

#Physics
#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "physics") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))

p31 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="physics, all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")

#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "physics") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm.physics <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.physics)
Anova(lm.physics, type=3)

#Math
#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "math") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))

p32 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="math, all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")

#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "math") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm.math <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.math)
Anova(lm.math, type=3)

#Computer science
#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "cs") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))

p33 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="cs, all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")

#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "cs") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm.cs <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.cs)
Anova(lm.cs, type=3)

p34 <- plot_grid(p31, p32, p33, nrow=1)
p34
save_plot("early2020_by_field.png", p34, base_height=4, base_width=10, dpi=600)

#Economics and quantitative finance
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "econ") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm.econ <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.econ)
Anova(lm.econ, type=3)

#Quantitative biology
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "q-bio") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm.qbio <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.qbio)
Anova(lm.qbio, type=3)

#Statistics
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "stat") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm.stat <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.stat)
Anova(lm.stat, type=3)

#Electrical engineering and systems science
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "eess") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)

#Fit mdoel
lm.eess <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.eess)
Anova(lm.eess, type=3)
```

The gender gap in authorships is growing in physics, math, and computer science. 

### Comparing single-authored arXiv preprint submissions in the months before and during the COVID-19 pandemic, by gender

Again, what about for sole authorships? How does early 2020 compare to during the pandemic?

```{r Early 2020 arXiv sole author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}

arxiv.sole <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & author.n==1) %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.sole.long <- gather(arxiv.sole, gender, n, female.n:male.n)
colnames(arxiv.sole.long) <- c("week", "gender", "n")
arxiv.sole.long$gender <- factor(arxiv.sole.long$gender, rev(levels(as.factor(arxiv.sole.long$gender))))

p7 <- ggplot(data=subset(arxiv.sole.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="sole authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p7

#Model
arxiv.sole <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & author.n == 1) %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE)))) #Summarize by month
arxiv.sole.long <- gather(arxiv.sole, gender, n, female.n:male.n)
arxiv.sole.long$day_of_week <- wday(arxiv.sole.long$`as.Date(submitted)`, label=TRUE)
arxiv.sole.long$date <- as.numeric(arxiv.sole.long$`as.Date(submitted)`)-18261
arxiv.sole.long$day_of_week <- factor(arxiv.sole.long$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm2 <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.sole.long)
summary(lm2)
Anova(lm2, type=3)
#plot(lm2)
```

Again, the number of preprints single-authored by men is growing faster than the number of preprints single-authored by women. 

### Comparing arXiv preprint submissions by authorship position in the months before and during the COVID-19 pandemic, by gender

Let's do the same for first, middle, and last authorships on multi-authored preprints.

##### First authorships

```{r Early 2020 arXiv first author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}

arxiv.first <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(first.author.gender)) %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1)), first.author.gender) %>% summarize(n=n())))
colnames(arxiv.first) <- c("week", "gender", "n")
arxiv.first$gender <- factor(arxiv.first$gender, rev(levels(as.factor(arxiv.first$gender))))

p8 <- ggplot(data=subset(arxiv.first, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date("2020-03-11")), linetype="dashed")+labs(title="arXiv", subtitle="first authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p8

#Model
arxiv.first <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(first.author.gender)) %>% group_by(as.Date(submitted), first.author.gender) %>% summarize(n=n())))
arxiv.first$day_of_week <- wday(arxiv.first$`as.Date(submitted)`, label=TRUE)
arxiv.first$date <- as.numeric(arxiv.first$`as.Date(submitted)`)-18261
arxiv.first$day_of_week <- factor(arxiv.first$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm3 <- lm(sqrt(n)~date*first.author.gender+day_of_week, data=arxiv.first)
summary(lm3)
Anova(lm3, type=3)
#plot(lm3)
```

Men first authorships are growing faster than women first authorships.

##### Last authors

```{r Early 2020 arXiv last author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
arxiv.last <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(last.author.gender)) %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1)), last.author.gender) %>% summarize(n=n())))
colnames(arxiv.last) <- c("week", "gender", "n")
arxiv.last$gender <- factor(arxiv.last$gender, rev(levels(as.factor(arxiv.last$gender))))

p9 <- ggplot(data=subset(arxiv.last, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date("2020-03-11")), linetype="dashed")+labs(title="arXiv", subtitle="last authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p9

#Model
arxiv.last <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(last.author.gender)) %>% group_by(as.Date(submitted), last.author.gender) %>% summarize(n=n())))
arxiv.last$day_of_week <- wday(arxiv.last$`as.Date(submitted)`, label=TRUE)
arxiv.last$date <- as.numeric(arxiv.last$`as.Date(submitted)`)-18261
arxiv.last$day_of_week <- factor(arxiv.last$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm4 <- lm(sqrt(n)~date*last.author.gender+day_of_week, data=arxiv.last)
summary(lm4)
Anova(lm4, type=3)
#plot(lm4)
```

Men last authorships are growing faster than women last authorships.

### Middle authors

```{r Early 2020 arXiv middle author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
arxiv.middle <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.mid.authors.n, na.rm=TRUE), male.n=sum(male.mid.authors.n, na.rm=TRUE))))
arxiv.middle.long <- gather(arxiv.middle, gender, n, female.n:male.n)
colnames(arxiv.middle.long) <- c("week", "gender", "n")
arxiv.middle.long$gender <- factor(arxiv.middle.long$gender, rev(levels(as.factor(arxiv.middle.long$gender))))

p10 <- ggplot(data=subset(arxiv.middle.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+labs(title="arXiv", subtitle="middle authorships")+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p10

#Model
arxiv.middle <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.mid.authors.n, na.rm=TRUE), male.n=sum(male.mid.authors.n, na.rm=TRUE))))
arxiv.middle.long <- gather(arxiv.middle, gender, n, female.n:male.n)
arxiv.middle.long$day_of_week <- wday(arxiv.middle.long$`as.Date(submitted)`, label=TRUE)
arxiv.middle.long$day <- as.numeric(arxiv.middle.long$`as.Date(submitted)`)-18261
arxiv.middle.long$day_of_week <- factor(arxiv.middle.long$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm5 <- lm(sqrt(n)~day*gender+day_of_week, data=arxiv.middle.long)
summary(lm5)
Anova(lm5, type=3)
#plot(lm5)
```

Men middle authorships are growing faster than women middle authorships.

### Omnibus figures

```{r Combine visualizations for omnibus figures, message=FALSE, warning=FALSE, fig.width=16, fig.height=16, dpi=300}
#Year over year
p11 <- plot_grid(p1, p2, p3, p4, p5, align = 'v', axis='l')
p11
save_plot("year-over-year_arxiv.png", p11, base_height=8, base_width=8, dpi=600)

#p11.v2 <- plot_grid(p1, p2, p3, p4, p5, align = 'v', axis='l', nrow=1)
#p11.v2
#save_plot("year-over-year_arxiv.v2.png", p11.v2, base_height=4, base_width=13.5, dpi=600)

p12 <- plot_grid(p6, p7, p8, p9, p10, align='v', axis='l')
p12
save_plot("early2020_arxiv.png", p12, base_height=8, base_width=10, dpi=600)

#p12.v2 <- plot_grid(p6, p7, p8, p9, p10, align='v', axis='l', nrow=1)
#p12.v2
#save_plot("early2020_arxiv.v2.png", p12.v2, base_height=4.5, base_width=16, dpi=600)
```

## bioRxiv submissions

Next, we scraped submission data from bioRxiv, the main preprint server for biology. We used the rbiorxiv package, see:

Fraser, N (2020). rbiorxiv. R package, https://github.com/nicholasmfraser/rbiorxiv

We scraped the same date ranges as for the arXiv analysis, above.

```{r Scrape biorxiv, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
#Get all submissions between Jan 1, 2020 and April 22, 2020
df.b.2020 <- biorxiv_content(from = "2020-01-01", to = "2020-04-22", limit = "*", format = "df")
#Get all submissions for March 15 to April 15, 2019
df.b.2019 <- biorxiv_content(from = "2019-03-15", to = "2019-04-15", limit = "*", format = "df")

#Update with April 22 to April 30, 2020 data
df.b.2020.update <- biorxiv_content(from = "2020-04-22", to = "2020-04-30", limit = "*", format = "df")

write.csv(df.b.2020, "Data/biorxiv_2020_data.csv")
write.csv(df.b.2019, "Data/biorxiv_2019_data.csv")
write.csv(df.b.2020.update, "Data/biorxiv_2020_update_data.csv")

#Update with May 1 to June 30, 2020 data
df.b.2020.may <- biorxiv_content(from = "2020-05-01", to = "2020-05-15", limit = "*", format = "df")
df.b.2020.may2 <- biorxiv_content(from = "2020-05-16", to = "2020-05-31", limit = "*", format = "df")
df.b.2020.june <- biorxiv_content(from = "2020-06-01", to = "2020-06-15", limit = "*", format = "df")
df.b.2020.june2 <- biorxiv_content(from = "2020-06-16", to = "2020-06-30", limit = "*", format = "df")
df.b.2020.mayjune <- rbind(df.b.2020.may, df.b.2020.may2, df.b.2020.june, df.b.2020.june2)
write.csv(df.b.2020.mayjune, "Data/biorxiv_2020_MayJune_data.csv")
```

Note that the bioRxiv API only returns first names for corresponding authors, and not for all authors, so we used a work-around to overcome this issue and get first names for all authors. Briefly, we used the package rcrossref to extract full citations (with first names of all authors) from all bioRxiv dois:

Scott Chamberlain, Hao Zhu, Najko Jahn, Carl Boettiger and Karthik Ram (2020). rcrossref:
  Client for Various 'CrossRef' 'APIs'. R package version 1.0.0.
  https://CRAN.R-project.org/package=rcrossref

(Thanks to @palolili23 for suggesting rcrossref as a work-around!)

```{r bioRxiv first names, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
df.b.2019 <- read.csv("Data/biorxiv_2019_data.csv")
df.b.all2020 <- read.csv("Data/biorxiv_2020_data.csv")
df.b.2020.update <- read.csv("Data/biorxiv_2020_update_data.csv")
df.b.2020.mayjune <- read.csv("Data/biorxiv_2020_MayJune_data.csv")

#Get first names for 2019 data
for (i in 1:length(df.b.2019$doi)){
 tmp_doi <- df.b.2019$doi[i] 
 tryCatch({
    if (is.na(tmp_doi)) next()
    df.b.2019$bib[i] <- cr_cn(dois=tmp_doi)
    tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.2019$bib[i])
    tmp_authors <- sub("\\},.*", "", tmp_authors)
    df.b.2019$authors_full[i] <- tmp_authors
 }, error=function(e){})
}
write.csv(df.b.2019, "Data/biorxiv_2019_data_first.names.csv")

#Get first names for updated 2020 data
for (i in 1:length(df.b.2020.update$doi)){
 tmp_doi <- df.b.2020.update$doi[i] 
 tryCatch({
    if (is.na(tmp_doi)) next()
    df.b.2020.update$bib[i] <- cr_cn(dois=tmp_doi)
    tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.2020.update$bib[i])
    tmp_authors <- sub("\\},.*", "", tmp_authors)
    df.b.2020.update$authors_full[i] <- tmp_authors
 }, error=function(e){})
}

write.csv(df.b.2020.update, "Data/biorxiv_2020_update_data_first.names.csv")

#Get first names for 2020 data
for (i in 1:5000){
 tmp_doi <- df.b.all2020$doi[i] 
 tryCatch({
    if (is.na(tmp_doi)) next()
    df.b.all2020$bib[i] <- cr_cn(dois=tmp_doi)
    tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.all2020$bib[i])
    tmp_authors <- sub("\\},.*", "", tmp_authors)
    df.b.all2020$authors_full[i] <- tmp_authors
 }, error=function(e){})
}

for (i in 5001:10000){
 tmp_doi <- df.b.all2020$doi[i] 
 tryCatch({
    if (is.na(tmp_doi)) next()
    df.b.all2020$bib[i] <- cr_cn(dois=tmp_doi)
    tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.all2020$bib[i])
    tmp_authors <- sub("\\},.*", "", tmp_authors)
    df.b.all2020$authors_full[i] <- tmp_authors
 }, error=function(e){})
}

for (i in 10000:length(df.b.all2020$doi)){
 tmp_doi <- df.b.all2020$doi[i] 
 tryCatch({
    if (is.na(tmp_doi)) next()
    df.b.all2020$bib[i] <- cr_cn(dois=tmp_doi)
    tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.all2020$bib[i])
    tmp_authors <- sub("\\},.*", "", tmp_authors)
    df.b.all2020$authors_full[i] <- tmp_authors
 }, error=function(e){})
}
write.csv(df.b.all2020, "Data/biorxiv_2020_data_first.names.csv")

#Get first names for May/June 2020 data    
for (i in 1:length(df.b.2020.mayjune$doi)){
 tmp_doi <- df.b.2020.mayjune$doi[i] 
 tryCatch({
    if (is.na(tmp_doi)) next()
    df.b.2020.mayjune$bib[i] <- cr_cn(dois=tmp_doi)
    tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.2020.mayjune$bib[i])
    tmp_authors <- sub("\\},.*", "", tmp_authors)
    df.b.2020.mayjune$authors_full[i] <- tmp_authors
 }, error=function(e){})
}

write.csv(df.b.2020.mayjune, "Data/biorxiv_2020_MayJune_data_first.names.csv")
```

We inferred the gender of bioRxiv preprint authors, as above. 
    
```{r bioRxiv gender, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
df.b.2019 <- read.csv("Data/biorxiv_2019_data_first.names.csv")
df.b.all2020 <- read.csv("Data/biorxiv_2020_data_first.names.csv")
df.b.2020.update <- read.csv("Data/biorxiv_2020_update_data_first.names.csv")
df.b.2020.mayjune <- read.csv("Data/biorxiv_2020_MayJune_data_first.names.csv")

df.b.full <- rbind(df.b.2019, subset(df.b.all2020, as.Date(date) >= "2020-03-15" & as.Date(date) <= "2020-04-15")) #Make year comparison, subsetting 2020 data to just March 15 to April 15
df.b.all2020 <- rbind(df.b.all2020, df.b.2020.update, df.b.2020.mayjune) #Make early 2020 dataset

df.b.full$year <- as.factor(year(as.Date(df.b.full$date))) #Extract year
df.b.all2020$year <- as.factor(year(as.Date(df.b.all2020$date)))

split.b.names <- function(x){strsplit(as.character(x), " and ", fixed=TRUE)} #Function to split strings of author names
last.b.author <- function(x){gsub(".*\\band\\b", "", as.character(x))} #Function to extract last author
first.b.author <- function(x){gsub("\\band\\b.*", "", as.character(x))} #Function to extract first author

#For the year-over-year dataset
df.b.full$split.names <- lapply(df.b.full$authors_full, split.b.names) #Apply functions
df.b.full$first.author <- lapply(df.b.full$authors_full, first.b.author)
df.b.full$last.author <- lapply(df.b.full$authors_full, last.b.author)

all_first_names <- word(unlist(df.b.full$split.names),1) #Make a list of all author first names
gender <- gender(all_first_names, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)]) #Keep only unique names

tmp <- NULL
for(i in 1:length(df.b.full$authors_full)){
  tmp <- as.data.frame(word(unlist(df.b.full$split.names[[i]]), 1))
  colnames(tmp) <- "name"
  tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
  df.b.full$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
  df.b.full$female.n[i] <-  sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}

#Predict first author gender (omits sole authors)
df.b.full$author.n <- str_count(df.b.full$authors, pattern = "\\;")+1 #Count author number
df.b.full$first.author.first.name <- ifelse(df.b.full$author.n > 1, word(trimws(as.character(df.b.full$first.author, 1))), NA)
gender <- gender(df.b.full$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.full$first.author.gender <- getgender[df.b.full$first.author.first.name]

#Predict last author gender (omits sole authors)
df.b.full$last.author.first.name <- ifelse(df.b.full$author.n > 1, word(trimws(as.character(df.b.full$last.author, 1))), NA)
gender <- gender(df.b.full$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.full$last.author.gender <- getgender[df.b.full$last.author.first.name]

#Count middle authors for each gender (omits sole authors)
df.b.full$female.mid.authors.n <- ifelse(df.b.full$author.n > 1, (df.b.full$female.n - ifelse(df.b.full$first.author.gender %in% "female", 1, 0) - ifelse(df.b.full$last.author.gender %in% "female", 1, 0)), 0)
df.b.full$male.mid.authors.n <- ifelse(df.b.full$author.n > 1, (df.b.full$male.n - ifelse(df.b.full$first.author.gender %in% "male", 1, 0) - ifelse(df.b.full$last.author.gender %in% "male", 1, 0)), 0)

df.b.full <- df.b.full[!duplicated(df.b.full),] #Remove duplicated rows, if any
df.b.full.output <- as.data.frame(apply(df.b.full,2,as.character)) 
write.csv(df.b.full.output, "Data/biorxiv_full_authors_gender.csv") #Save data

#For the 2020 dataset
df.b.all2020$split.names <- lapply(df.b.all2020$authors_full, split.b.names) #Apply function
df.b.all2020$first.author <- lapply(df.b.all2020$authors_full, first.b.author)
df.b.all2020$last.author <- lapply(df.b.all2020$authors_full, last.b.author)

all_first_names <- word(unlist(df.b.all2020$split.names),1) #Make a list of all author first names
gender <- gender(all_first_names, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)]) #Keep only unique names

tmp <- NULL
for(i in 1:length(df.b.all2020$authors_full)){
  tmp <- as.data.frame(word(unlist(df.b.all2020$split.names[[i]]), 1))
  colnames(tmp) <- "name"
  tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
  df.b.all2020$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
  df.b.all2020$female.n[i] <-  sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}

#Predict first author gender (omits sole authors)
df.b.all2020$author.n <- str_count(df.b.all2020$authors, pattern = "\\;")+1 #Count author number
df.b.all2020$first.author.first.name <- word(df.b.all2020$first.author, 1)
gender <- gender(df.b.all2020$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.all2020$first.author.gender <- getgender[df.b.all2020$first.author.first.name]

#Predict last author gender (omits sole authors)
df.b.all2020$last.author.first.name <- ifelse(df.b.all2020$author.n > 1, word(trimws(as.character(df.b.all2020$last.author, 1))), NA)
gender <- gender(df.b.all2020$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.all2020$last.author.gender <- getgender[df.b.all2020$last.author.first.name]

#Count middle authors for each gender
df.b.all2020$female.mid.authors.n <- ifelse(df.b.all2020$author.n > 1, (df.b.all2020$female.n - ifelse(df.b.all2020$first.author.gender %in% "female", 1, 0) - ifelse(df.b.all2020$last.author.gender %in% "female", 1, 0)),0)
df.b.all2020$male.mid.authors.n <- ifelse(df.b.all2020$author.n > 1, (df.b.all2020$male.n - ifelse(df.b.all2020$first.author.gender %in% "male", 1, 0) - ifelse(df.b.all2020$last.author.gender %in% "male", 1, 0)), 0)

df.b.all2020 <- df.b.all2020[!duplicated(df.b.all2020),] #Remove duplicated rows, if any
df.b.all2020.output <- as.data.frame(apply(df.b.all2020,2,as.character)) 
write.csv(df.b.all2020.output, "Data/biorxiv_full_authors_2020_gender.csv") #Save data
```

Next we calculated some summary statistics for the bioRxiv dataset. 

```{r Summary statistics for biorxiv, message=FALSE, warning=FALSE, fig.width=8, fig.height=4,dpi=600}

df.b.full <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/biorxiv_full_authors_gender.csv")
df.b.all2020 <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/biorxiv_full_authors_2020_gender.csv")

all.biorxiv <- rbind(df.b.all2020, df.b.full) #Combine datasets
all.biorxiv <- all.biorxiv[!duplicated(all.biorxiv), ] #Remove duplicates
total.b.preprints <- length(all.biorxiv$doi) #Total number of preprints
all.biorxiv$authors.with.gender <- all.biorxiv$female.n+all.biorxiv$male.n  
total.b.authors <- sum(all.biorxiv$author.n) #Total number of authors
total.b.authors.with.gender <- sum(all.biorxiv$male.n+all.biorxiv$female.n) #Total number of authors with gender inferred
per.b.gender <- round((total.b.authors.with.gender/total.b.authors)*100, 1) #Percent of authors with gender

year.biorxiv.preprints <- length(df.b.full[, "doi"]) #Preprints for just year-over-year comparison
year.biorxiv.authors <- sum(df.b.full[, "male.n"]+df.b.full[, "female.n"]) #Preprint authors with gender for year-over-year comparison

#preprints per year
yr.summary <- df.b.full %>% group_by(year) %>% summarize(n=n())

#How many corresponding authors are first versus last authors?
all.biorxiv$cor <- paste0(word(all.biorxiv$author_corresponding, 1), " ", word(all.biorxiv$author_corresponding, -1))
all.biorxiv$last <- paste0(word(trimws(all.biorxiv$last.author), 1), " ", word(all.biorxiv$last.author, -1))
all.biorxiv$cor_is_last <- ifelse(all.biorxiv$author.n > 1, ifelse(as.character(all.biorxiv$cor) == as.character(all.biorxiv$last), TRUE, FALSE), TRUE)
all.biorxiv$first <- paste0(word(trimws(as.character(all.biorxiv$first.author)), 1), " ", word(trimws(all.biorxiv$first.author), -1))
all.biorxiv$cor_is_first <- ifelse(all.biorxiv$author.n > 1, ifelse(as.character(all.biorxiv$cor) == as.character(all.biorxiv$first), TRUE, FALSE), TRUE)
all.biorxiv$cor_is_mid <- all.biorxiv$cor_is_last == FALSE & all.biorxiv$cor_is_first == FALSE
per.cor.last <- length(which(all.biorxiv$cor_is_last))/(length(which(all.biorxiv$cor_is_last))+length(which(all.biorxiv$cor_is_first)))
```

There are `r total.b.preprints` preprints in the full bioRxiv dataset, with a total of `r total.b.authors` authors. We inferred the gender of `r total.b.authors.with.gender` authors, or `r per.b.gender`%, with the rest omitted from subsequent analyses. 

For just the comparison of March 15-April 15, 2020 with the same dates in 2019, there are `r year.biorxiv.preprints` bioRxiv preprints with `r year.biorxiv.authors` authors for whom we inferred gender.

For the sake of interest, `r round(per.cor.last, 1)*100`% of bioRxiv corresponding authors are last authors. 

For both bioRxiv and arXiv (above) we want to understand how many preprints list authors in alphabetical order, and whether it is more than expected by chance alone. We calculated the probably that a preprint will have authors in alphabetical order by chance alone as 1/n!, where n is the number of authors on the preprint.  

```{r bioRxiv and arXiv alphabetical order, message=FALSE, warning=FALSE, fig.width=8, fig.height=4,dpi=600}
#bioRxiv alphabetical order analysis
#Write a function to test whether author names are alphabetical order
last.names.alpha.b <- function(x) {
      tmp <- as.data.frame(strsplit(as.character(x), ";", fixed=TRUE))
      colnames(tmp) <- c("names")
      tmp$names <- trimws(tmp$names)
      tmp$names <- word(tmp$names, 1)
      tmp$names.alpha <- str_sort(tmp$names)
      tmp$match <- identical(tmp$names, tmp$names.alpha)
      ifelse(length(tmp$names) >= 2, ifelse(sum(tmp$match, na.rm=TRUE) > 0, "Y", "N"), NA)
}

#Apply alphabetical function
all.biorxiv$authors.alpha <- lapply(all.biorxiv$authors, last.names.alpha.b) #Apply functions
df.b.full$authors.alpha <- lapply(df.b.full$authors, last.names.alpha.b)
df.b.all2020$authors.alpha <- lapply(df.b.all2020$authors, last.names.alpha.b)

#Summarize 
all.biorxiv$author.n.bin <- ifelse(all.biorxiv$author.n >= 8, 8, all.biorxiv$author.n)
all.biorxiv.summary.long <- all.biorxiv %>% group_by(author.n.bin, authors.alpha) %>% summarize(n=n())
all.biorxiv.summary.wide <- spread(all.biorxiv.summary.long, key=authors.alpha, value=n)
all.biorxiv.summary.wide$expected.y <- round(ifelse(all.biorxiv.summary.wide$author.n.bin < 8, 100*(1/factorial(all.biorxiv.summary.wide$author.n.bin)), 100*(1/factorial(8))), 1) #Expect percent alphabetical (if random)
all.biorxiv.summary.wide$per <- 100*(all.biorxiv.summary.wide$Y/(all.biorxiv.summary.wide$Y + all.biorxiv.summary.wide$N))
  
#Figure
p35 <- ggplot(data=subset(all.biorxiv.summary.wide, author.n.bin > 1))+geom_bar(aes(x=author.n.bin, y=per), stat="identity", alpha=0.5)+geom_line(aes(x=author.n.bin, y=expected.y), colour="red")+geom_point(aes(x=author.n.bin, y=expected.y), colour="red", size=3)+theme_cowplot()+xlab("Authors on a preprint (no.)")+ylab("Authors in alphabetical order (%)")+scale_x_continuous(breaks=c(2:8), labels=c("2", "3", "4", "5", "6", "7", "8+"))+labs(title="bioRxiv", subtitle="all fields")

#arXiv alphabetical order analysis
#Summarize 
all.arxiv$author.n.bin <- ifelse(all.arxiv$author.n >= 8, 8, all.arxiv$author.n)
all.arxiv.summary.long <- all.arxiv %>% group_by(author.n.bin, authors.alpha) %>% summarize(n=n())
all.arxiv.summary.wide <- spread(all.arxiv.summary.long, key=authors.alpha, value=n)
all.arxiv.summary.wide$expected.y <- round(ifelse(all.arxiv.summary.wide$author.n.bin < 8, 100*(1/factorial(all.arxiv.summary.wide$author.n.bin)), 100*(1/factorial(8))), 1) #Expect percent alphabetical (if random)
all.arxiv.summary.wide$per <- 100*(all.arxiv.summary.wide$Y/(all.arxiv.summary.wide$Y + all.arxiv.summary.wide$N))
  
#Figure
p36 <- ggplot(data=subset(all.arxiv.summary.wide, author.n.bin > 1))+geom_bar(aes(x=author.n.bin, y=per), stat="identity", alpha=0.5)+geom_line(aes(x=author.n.bin, y=expected.y), colour="red")+geom_point(aes(x=author.n.bin, y=expected.y), colour="red", size=3)+theme_cowplot()+xlab("Authors on a preprint (no.)")+ylab("Authors in alphabetical order (%)")+scale_x_continuous(breaks=c(2:8), labels=c("2", "3", "4", "5", "6", "7", "8+"))+labs(title="arXiv", subtitle="all fields")

p37 <- plot_grid(p36, p35, nrow=1, align='h')
p37
save_plot("alphabetical_order.png", p37, base_width=8, base_height=4, dpi=600)
```

The observed and expected percent of preprints with authors in alphabetical order are nearly identical in bioRxiv, but arXiv has an excess of preprints in alphabetical order.

### Comparing bioRxiv preprint authorships between Mar/Apr 2019 and Mar/Apr 2020, by gender

How many men and women authorships were there on bioRxiv preprints between Mar/Apr 2019 and 2020?

```{r Visualize bioRxiv 2019 versus 2020 data,  message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
all <- as.data.frame(ungroup(df.b.full %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019

#Make figure comparing 2020 to 2019
p13 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2),  y=c(10500,16500), label = paste0("+", round(all.t$per.dif.1920[1:2], 1), "%"))+labs(title="bioRxiv", subtitle="all authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p13
```

Overall, for bioRxiv preprints, women authorships actually increased more than men authorships year-over-year, as a percent change. (This is consistent with a long-standing trend in which women were narrowing the gender gap.) In absolute terms, there was nonetheless a bigger increase in men than women authorships: between Mar/Apr 2019 and Mar/Apr 2020, there was an increase of `r all$Female[2]-all$Female[1]` women authorships and `r all$Male[2]-all$Male[1]` men authorships.

### Comparing single-authored biorXiv preprints between Mar/Apr 2019 and Mar/Apr 2020, by gender

How many bioRxiv preprints were authored by a single woman versus a single man in Mar/Apr, 2020, compared to the same dates last year? 

```{r bioRxiv sole authors,  message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
sole.authors <- as.data.frame(ungroup(subset(df.b.full, author.n == 1) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
sole.long <- gather(sole.authors, Gender, number, Male:Female) #Make wide data long
sole.authors.t <- as.data.frame(t(sole.authors[,-1])) #Transpose
colnames(sole.authors.t) <- c("2019", "2020") #Fix column names
sole.authors.t$per.dif.1920 <- ((sole.authors.t$`2020`-sole.authors.t$`2019`)/(sole.authors.t$`2019`))*100 #Calculate percent change, 2020 over 2019

#Make figure for single-authored preprints
p14 <- ggplot(data=sole.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2),  y=c(28,90), label = paste0("+", round(sole.authors.t$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="bioRxiv", subtitle = "sole authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p14
```

The number of sole-authored preprints increased a little more for men than women. 

### Comparing bioRxiv preprint submissions by author position

What if we break it down further by author position, so first, middle, or last, as for arXiv preprints?

##### First authors

```{r bioRxiv year-over-year first authors, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
first.authors <- subset(df.b.full, !is.na(first.author.gender)) %>% group_by(year,first.author.gender) %>% summarize(n=n()) #Summarize by year
first.authors$per.dif.1920 <- c(first.authors[3,3]/first.authors[1,3],first.authors[4,3]/first.authors[2,3] ,first.authors[3,3]/first.authors[1,3], first.authors[4,3]/first.authors[2,3])
first.authors$per.dif.1920 <- (as.numeric(first.authors$per.dif.1920)-1)*100 #Calculate percent change, 2020 over 2019

p15 <- ggplot(data=first.authors, aes(fill=as.factor(year), y=n, x=first.author.gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2),  y=c(1450,1920), label = paste0("+", round(first.authors$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="bioRxiv", subtitle = "first authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p15
```

Both as a percent change and an absolute change year-over-year, there was a larger increase in the number of women first authorships than the number of men first authorships. In absolute terms, there were an additional `r first.authors$n[3]-first.authors$n[1]` women first authorships and `r first.authors$n[4]-first.authors$n[2]` men first authorships in Mar/Apr 2020, compared to Mar/Apr 2019.

##### Last authors

```{r bioRxiv last authors year-over-year, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
last.authors <- subset(df.b.full, !is.na(last.author.gender)) %>% group_by(year,last.author.gender) %>% summarize(n=n()) #Summarize by year
last.authors$per.dif.1920 <- c(last.authors[3,3]/last.authors[1,3],last.authors[4,3]/last.authors[2,3] ,last.authors[3,3]/last.authors[1,3], last.authors[4,3]/last.authors[2,3])
last.authors$per.dif.1920 <- (as.numeric(last.authors$per.dif.1920)-1)*100 #Calculate percent change, 2020 over 2019

p16 <- ggplot(data=last.authors, aes(fill=as.factor(year), y=n, x=last.author.gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2),  y=c(950,2520), label = paste0("+", round(last.authors$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="bioRxiv", subtitle = "last authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p16
```

In contrast to first authorships, growth in the number of men last authorships outpaced growth in the number of women first authorships between Mar/Apr 2019 and Mar/Apr 2020. 

#### Middle authors

```{r bioRxiv middle authors year-over-year, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
middle <- as.data.frame(ungroup(df.b.full %>% group_by(year) %>% summarize(Female = sum(female.mid.authors.n, na.rm=TRUE), Male = sum(male.mid.authors.n, na.rm=TRUE)))) #Summarize by year
middle.long <- gather(middle, Gender, number, Female:Male) #Make wide data long
middle.t <- as.data.frame(t(middle[,-1])) #Transpose
colnames(middle.t) <- c("2019", "2020") #Fix column names
middle.t$per.dif.1920 <- ((middle.t$`2020`-middle.t$`2019`)/(middle.t$`2019`))*100 #Calculate percent change, 2020 over 2019

#Make figure comparing 2020 to 2019
p17 <- ggplot(data=middle.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2),  y=c(8100,12000), label = c(paste0("+", round(middle.t$per.dif.1920[1], 1), "%"), paste0("+", round(middle.t$per.dif.1920[2], 1), "%")))+labs(title="bioRxiv", subtitle="middle authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p17
```

There was also slightly more growth in the number of men than women middle authorships in Mar/Apr 2020, compared to Mar/Apr 2019. 

### Comparing bioRxiv preprint submissions in the months before and during the COVID-19 pandemic, by gender

As for arXiv submissions, we also compared bioRxiv submissions across months for early 2020. 

```{r Visualize bioRxiv 2020 data,  message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
biorxiv <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30") %>% group_by(round_date(as.Date(date), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
biorxiv.long <- gather(biorxiv, gender, n, female.n:male.n)
colnames(biorxiv.long) <- c("week", "gender", "n")
biorxiv.long$gender <- factor(biorxiv.long$gender, rev(levels(as.factor(biorxiv.long$gender))))

p18 <- ggplot(data=subset(biorxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm",se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+labs(title="bioRxiv", subtitle="all authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p18

#Model
biorxiv <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30") %>% group_by(as.Date(date)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
biorxiv.long <- gather(biorxiv, gender, n, female.n:male.n)
biorxiv.long$day_of_week <- wday(biorxiv.long$`as.Date(date)`, label=TRUE)
biorxiv.long$date <- as.numeric(biorxiv.long$`as.Date(date)`)-18261
biorxiv.long$day_of_week <- factor(biorxiv.long$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm6 <- lm(sqrt(n)~date*gender+day_of_week, data=biorxiv.long)
summary(lm6)
Anova(lm6, type=3)
#plot(lm6)
```

The model results indicate no difference between men and women in the rate of increase of preprint submissions from January to June, 2020. 

### Comparing single-authored bioRxiv preprint submissions in the months before and during the COVID-19 pandemic, by gender

Again, what about for sole authorships? How does early 2020 compare to during the pandemic?

```{r Early 2020 bioRxiv sole author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
biorxiv.sole <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30" & author.n==1) %>% group_by(round_date(as.Date(date), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
biorxiv.sole.long <- gather(biorxiv.sole, gender, n, female.n:male.n)
colnames(biorxiv.sole.long) <- c("week", "gender", "n")
biorxiv.sole.long$gender <- factor(biorxiv.sole.long$gender, rev(levels(as.factor(biorxiv.sole.long$gender))))

p19 <- ggplot(data=subset(biorxiv.sole.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm",se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+labs(title="bioRxiv", subtitle="sole authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p19

#Model
biorxiv.sole <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30" & author.n == 1) %>% group_by(as.Date(date)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE)))) #Summarize by month
biorxiv.sole.long <- gather(biorxiv.sole, gender, n, female.n:male.n)
biorxiv.sole.long$day_of_week <- wday(biorxiv.sole.long$`as.Date(date)`, label=TRUE)
biorxiv.sole.long$date <- as.numeric(biorxiv.sole.long$`as.Date(date)`)-18261
biorxiv.sole.long$day_of_week <- factor(biorxiv.sole.long$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm7 <- lm(sqrt(n)~date*gender+day_of_week, data=biorxiv.sole.long)
summary(lm7)
Anova(lm7, type=3)
#plot(lm7)
```

Again, the model results indicate no difference between men and women in the rate of increase through time of sole-authored bioRxiv preprints.

##### First authorships

```{r Early 2020 bioRxiv first author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
biorxiv.first <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30" & !is.na(first.author.gender)) %>% group_by(round_date(as.Date(date), unit="weeks", week_start = getOption("lubridate.week.start", 1)), first.author.gender) %>% summarize(n=n())))
colnames(biorxiv.first) <- c("week", "gender", "n")
biorxiv.first$gender <- factor(biorxiv.first$gender, rev(levels(as.factor(biorxiv.first$gender))))

p20 <- ggplot(data=subset(biorxiv.first, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm",se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+labs(title="bioRxiv", subtitle="first authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p20

#Model
biorxiv.first <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30" & !is.na(first.author.gender)) %>% group_by(as.Date(date), first.author.gender) %>% summarize(n=n())))
biorxiv.first$day_of_week <- wday(biorxiv.first$`as.Date(date)`, label=TRUE)
biorxiv.first$date <- as.numeric(biorxiv.first$`as.Date(date)`)-18261
biorxiv.first$day_of_week <- factor(biorxiv.first$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm8 <- lm(sqrt(n)~date*first.author.gender+day_of_week, data=biorxiv.first)
summary(lm8)
Anova(lm8, type=3)
#plot(lm8)
```

Again, no gender difference in the rate of increase in first authorships of multi-authored bioRxiv preprints. 

##### Last authorships

```{r Early 2020 bioRxiv last author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
biorxiv.last <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30" & !is.na(last.author.gender)) %>% group_by(round_date(as.Date(date), unit="weeks", week_start = getOption("lubridate.week.start", 1)), last.author.gender) %>% summarize(n=n())))
colnames(biorxiv.last) <- c("week", "gender", "n")
biorxiv.last$gender <- factor(biorxiv.last$gender, rev(levels(as.factor(biorxiv.last$gender))))

p21 <- ggplot(data=subset(biorxiv.last, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm",se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+labs(title="bioRxiv", subtitle="last authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1),legend.position="top")
p21

#Model
biorxiv.last <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30" & !is.na(last.author.gender)) %>% group_by(as.Date(date), last.author.gender) %>% summarize(n=n())))
biorxiv.last$day_of_week <- wday(biorxiv.last$`as.Date(date)`, label=TRUE)
biorxiv.last$date <- as.numeric(biorxiv.last$`as.Date(date)`)-18261
biorxiv.last$day_of_week <- factor(biorxiv.last$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm9 <- lm(sqrt(n)~date*last.author.gender+day_of_week, data=biorxiv.last)
summary(lm9)
Anova(lm9, type=3)
#plot(lm9)
```

In contrast to sole and first authorships, the number of men last authorships has grown faster than the number of women last authorships of bioRxiv preprints from January to June, 2020. 

### Middle authorships

```{r Early 2020 biorxiv middle author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
biorxiv.middle <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30") %>% group_by(round_date(as.Date(date), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.mid.authors.n, na.rm=TRUE), male.n=sum(male.mid.authors.n, na.rm=TRUE))))
biorxiv.middle.long <- gather(biorxiv.middle, gender, n, female.n:male.n)
colnames(biorxiv.middle.long) <- c("week", "gender", "n")
biorxiv.middle.long$gender <- factor(biorxiv.middle.long$gender, rev(levels(as.factor(biorxiv.middle.long$gender))))

p22 <- ggplot(data=subset(biorxiv.middle.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week),y=n, color=gender, shape=gender))+geom_point(size=2)+theme_cowplot()+geom_smooth(method="lm",se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+labs(title="bioRxiv", subtitle="middle authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p22

#Model
biorxiv.middle <- as.data.frame(ungroup(subset(df.b.all2020, as.Date(date) >= "2020-01-01" & as.Date(date) <= "2020-06-30") %>% group_by(as.Date(date)) %>% summarize(female.n=sum(female.mid.authors.n, na.rm=TRUE), male.n=sum(male.mid.authors.n, na.rm=TRUE))))
biorxiv.middle.long <- gather(biorxiv.middle, gender, n, female.n:male.n)
biorxiv.middle.long$day_of_week <- wday(biorxiv.middle.long$`as.Date(date)`, label=TRUE)
biorxiv.middle.long$date <- as.numeric(biorxiv.middle.long$`as.Date(date)`)-18261
biorxiv.middle.long$day_of_week <- factor(biorxiv.middle.long$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)

lm10 <- lm(sqrt(n)~date*gender+day_of_week, data=biorxiv.middle.long)
summary(lm10)
Anova(lm10, type=3)
#plot(lm10)
```

Similar to first and sole authorships, there is no gender difference in the rate of increase in middle authorships from January to June, 2020. 

### Omnibus figures

```{r Combine biorxiv visualizations for omnibus figures, message=FALSE, warning=FALSE, fig.width=16, fig.height=16, dpi=300}
#Year over year
p23 <- plot_grid(p13,p14,p15,p16,p17, align = 'v', axis='l')
p23
save_plot("year-over-year_biorxiv.png", p23, base_height=8, base_width=8, dpi=600)

#p23.v2 <- plot_grid(p13,p14,p15,p16,p17, align = 'v', axis='l', nrow=1)
#p23.v2
#save_plot("year-over-year_biorxiv.v2.png", p23.v2, base_height=4, base_width=13.5, dpi=600)

p24 <- plot_grid(p18, p19, p20, p21, p22, align='v', axis='l')
p24
save_plot("early2020_biorxiv.png", p24, base_height=8, base_width=10, dpi=600)

#p24.v2 <- plot_grid(p18, p19, p20, p21, p22, align='v', axis='l', nrow=1)
#p24.v2
```

## Take-home message

The arXiv preprint dataset fairly consistently shows a widening gender gap across author positions, suggesting that women are falling behind men in submitting arXiv preprints during the COVID-19 pandemic. The picture is less clear-cut for bioRxiv, but women are still falling behind men in last authorships of multi-authored bioRxiv preprints.