-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
1310 lines (1019 loc) · 94.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "The Pandemic Penalty: COVID-19's Gendered Impact on Scientific Productivity"
author: "Megan Frederickson and Molly King"
date: "February 6, 2021"
output: github_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This repo contains the data and code for the analyses in:
King MM, Frederickson ME. The Pandemic Penalty: COVID-19's gendered impact on scientific productivity. Currently under review, preprint available on SocArXiv. The repo was updated in early February, 2021 after receiving reviewer comments.
In the paper, we quantified how the COVID-19 pandemic is affecting the gender breakdown of preprint submissions to [arXiv](https://arxiv.org/) and [bioRxiv](https://www.biorxiv.org/), two preprint servers that together cover many STEM fields.
We used these packages:
```{r Load packages, message=FALSE, warning=FALSE}
#Load packages
library(tidyverse) #includes ggplot2, dplyr, readr, stringr
library(knitr)
library(cowplot)
library(gender)
library(aRxiv)
#install.packages("devtools")
#devtools::install_github("nicholasmfraser/rbiorxiv")
library(rbiorxiv)
library(lubridate)
library(anytime)
library(car)
library(rcrossref)
library(lme4)
```
## arXiv submissions
We scraped submission data from arXiv, a preprint server for physics, math, computer science, statistics, and other quantitative disciplines. We used the aRxiv package to scrape the data, see:
Karthik, R. and K. Broman (2019). aRxiv: Interface to the arXiv API. R package version 0.5.19. https://CRAN.R-project.org/package=aRxiv
We began by scraping all records for March 15-April 15, 2020, during the COVID-19 pandemic, and for the same date range in 2019. Then, we expanded to scrape all the data for Jan 1, 2020 to June 30, 2020, inclusive. We scraped the data in batches, as recommended in the aRxiv package tutorial. For brevity, we are not reproducing the code here, but it is available in the [R markdown file](https://github.com/drfreder/king-and-frederickson/blob/master/README.Rmd) included in this repo.
```{r Scrape arXiv, include=FALSE, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
#Get all submissions between March 15, 2020 and April 15, 2020 (during the COVID-19 pandemic)
n.2020 <- arxiv_count(query = 'submittedDate:[202003150000 TO 202004152400]')
df.2020.1 <- arxiv_search(query = 'submittedDate:[202003150000 TO 202003212400]', limit=n.2020, batchsize=1000)
df.2020.2 <- arxiv_search(query = 'submittedDate:[202003220000 TO 202003282400]', limit=n.2020, batchsize=1000)
df.2020.3 <- arxiv_search(query = 'submittedDate:[202003290000 TO 202004032400]', limit=n.2020, batchsize=1000)
df.2020.4 <- arxiv_search(query = 'submittedDate:[202004040000 TO 202004092400]', limit=n.2020, batchsize=1000)
df.2020.5 <- arxiv_search(query = 'submittedDate:[202004100000 TO 202004121415]', limit=n.2020, batchsize=2000)
df.2020.6 <- arxiv_search(query = 'submittedDate:[202004121420 TO 202004152400]', limit=n.2020, batchsize=1000)
df.2020.full <- rbind(df.2020.1, df.2020.2, df.2020.3, df.2020.4, df.2020.5, df.2020.6)
n.2020-length(df.2020.full$id) #Check that the number of records matches
write.csv(df.2020.full, file="Data/arxiv_2020_data.csv")
#Get all submission between March 15, 2019 and April 15, 2019 (the same dates last year)
n.2019 <- arxiv_count(query = 'submittedDate:[201903150000 TO 201904152400]')
df.2019.1 <- arxiv_search(query = 'submittedDate:[201903150000 TO 201903222400]', limit=n.2019, batchsize=2000)
df.2019.2 <- arxiv_search(query = 'submittedDate:[201903230000 TO 201903292400]', limit=n.2019, batchsize=2000)
df.2019.3 <- arxiv_search(query = 'submittedDate:[201903300000 TO 201904052400]', limit=n.2019, batchsize=2000)
df.2019.4 <- arxiv_search(query = 'submittedDate:[201904060000 TO 201904122400]', limit=n.2019, batchsize=2000)
df.2019.5 <- arxiv_search(query = 'submittedDate:[201904130000 TO 201904152400]', limit=n.2019, batchsize=2000)
df.2019.full <- rbind(df.2019.1, df.2019.2, df.2019.3, df.2019.4, df.2019.5)
n.2019-length(df.2019.full$id)
write.csv(df.2019.full, file="Data/arxiv_2019_data.csv")
#Get all submissions between Jan. 1, 2020 and March 15, 2020
n.early2020 <- arxiv_count(query = 'submittedDate:[202001010000 TO 202003152400]')
df.early2020.1 <- arxiv_search(query = 'submittedDate:[202001010000 TO 202001152400]', limit=n.early2020, batchsize=2000)
df.early2020.2 <- arxiv_search(query = 'submittedDate:[202001160000 TO 202001312400]', limit=n.early2020, batchsize=2000)
df.early2020.3 <- arxiv_search(query = 'submittedDate:[202002010000 TO 202002152400]', limit=n.early2020, batchsize=2000)
df.early2020.4 <- arxiv_search(query = 'submittedDate:[202002160000 TO 202002292400]', limit=n.early2020, batchsize=2000)
df.early2020.5 <- arxiv_search(query = 'submittedDate:[202003010000 TO 202003152400]', limit=n.early2020, batchsize=2000)
df.early2020.full <- rbind(df.early2020.1, df.early2020.2, df.early2020.3, df.early2020.4, df.early2020.5)
n.early2020-length(df.early2020.full$id)
write.csv(df.early2020.full, file="Data/arxiv_early2020_data.csv")
#Get all submissions between Apr. 16, 2020 and April 30, 2020
n.update <- arxiv_count(query = 'submittedDate:[202004160000 TO 202004222400]')
df.update <- arxiv_search(query = 'submittedDate:[202004160000 TO 202004222400]', limit=n.update, batchsize=2000)
n.update - length(df.update$id)
write.csv(df.update, file="Data/arxiv_update2020_data.csv")
n.update.2 <- arxiv_count(query = 'submittedDate:[202004230000 TO 202004302400]')
df.update.2 <- arxiv_search(query = 'submittedDate:[202004230000 TO 202004302400]', limit=n.update.2, batchsize=2000)
n.update.2 - length(df.update.2$id)
write.csv(df.update.2, file="Data/arxiv_update2020_2_data.csv")
#May and June data
n.update.3 <- arxiv_count(query = 'submittedDate:[202005010000 TO 202006302400]')
df.update.3 <- arxiv_search(query = 'submittedDate:[202005010000 TO 202005152400]', limit=n.update.3, batchsize=2000)
df.update.4 <- arxiv_search(query = 'submittedDate:[202005160000 TO 202005312400]', limit=n.update.3, batchsize=2000)
df.update.5 <- arxiv_search(query = 'submittedDate:[202006010000 TO 202006152400]', limit=n.update.3, batchsize=2000)
df.update.6 <- arxiv_search(query = 'submittedDate:[202006160000 TO 202006302400]', limit=n.update.3, batchsize=2000)
df.update.7 <- rbind(df.update.3, df.update.4, df.update.5, df.update.6)
n.update.3 - length(df.update.7$id)
#n.update - length(df.update$id)
write.csv(df.update.7, file="Data/arxiv_updateMayJune2020_data.csv")
```
Next, we assigned gender to author names using the gender package, see:
Mullen, L. (2019). gender: Predict Gender from Names Using Historical Data. R package version 0.5.3, https://github.com/ropensci/gender.
This package returns the probability that a first name belongs to a woman or a man by comparing the name to names in a database; we used the U.S. Social Security Administration baby names database.
Please note: this is a brute force method of predicting gender, and it has many limitations, as discussed by the package authors on their GitHub repo and included links. By using this method, we are not assuming that individuals are correctly gendered in the resulting dataset, but merely that it provides insight into gender's effects in aggregate across the population of preprint authors.
This code takes a while to run, so it is not run when rendering this markdown document.
```{r Predict arXiv author gender, eval=FALSE, message=FALSE, warning=FALSE}
#Not run
#First combine data for year-over-year comparison
df.2020 <- read.csv("Data/arxiv_2020_data.csv") #Read in data
df.2019 <- read.csv("Data/arxiv_2019_data.csv")
df.full <- rbind(df.2019, df.2020) #Combine in one dataframe
#Next combine data for early 2020 comparison
df.early2020 <- read.csv("Data/arxiv_early2020_data.csv")
df.update <- read.csv("Data/arxiv_update2020_data.csv")
df.update.2 <- read.csv("Data/arxiv_update2020_2_data.csv")
df.mayjune2020 <- read.csv("Data/arxiv_updateMayJune2020_data.csv")
df.all2020 <- rbind(df.2020, df.early2020, df.update, df.update.2, df.mayjune2020) #Full 2020 data
split.names <- function(x){strsplit(as.character(x), "|", fixed=TRUE)} #Function to split strings of author names
last.author <- function(x){gsub(".*\\|", "", as.character(x))} #Function to extract last author
first.author <- function(x){gsub("\\|.*", "", as.character(x))} #Function to extract first author
#For the year-over-year dataset
df.full$split.names <- lapply(df.full$authors, split.names) #Apply functions
df.full$first.author <- lapply(df.full$authors, first.author)
df.full$last.author <- lapply(df.full$authors, last.author)
all_first_names <- word(unlist(df.full$split.names),1) #Make a list of all author first names
#install.packages("genderdata", repos = "http://packages.ropensci.org") #In case you need the gender data package
gender <- gender(all_first_names, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)]) #Keep only unique names
#This loop is an inelegant way of counting the number of men and women authors for each paper
tmp <- NULL
for(i in 1:length(df.full$authors)){
tmp <- as.data.frame(word(unlist(df.full$split.names[[i]]), 1))
colnames(tmp) <- "name"
tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
df.full$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
df.full$female.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}
#Predict first author gender (does not include sole authors)
df.full$author.n <- str_count(df.full$authors, pattern = "\\|")+1 #Count author number
df.full$first.author.first.name <- ifelse(df.full$author.n > 1, word(df.full$first.author, 1), NA)
gender <- gender(df.full$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.full$first.author.gender <- getgender[df.full$first.author.first.name]
#Predict last author gender (again, omits sole authors)
df.full$author.n <- str_count(df.full$authors, pattern = "\\|")+1 #Count author number
df.full$last.author.first.name <- ifelse(df.full$author.n > 1, word(df.full$last.author, 1), NA)
gender <- gender(df.full$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.full$last.author.gender <- getgender[df.full$last.author.first.name]
#Count middle authors for each gender
df.full$female.mid.authors.n <- ifelse(df.full$author.n > 1, (df.full$female.n - ifelse(df.full$first.author.gender %in% "female", 1, 0) - ifelse(df.full$last.author.gender %in% "female", 1, 0)), 0)
df.full$male.mid.authors.n <- ifelse(df.full$author.n > 1, (df.full$male.n - ifelse(df.full$first.author.gender %in% "male", 1, 0) - ifelse(df.full$last.author.gender %in% "male", 1, 0)), 0)
df.full.output <- as.data.frame(apply(df.full,2,as.character))
write.csv(df.full.output, "Data/arxiv_full_gender.csv") #Save data
#Same for the early 2020 dataset
df.all2020$split.names <- lapply(df.all2020$authors, split.names)
df.all2020$first.author <- lapply(df.all2020$authors, first.author)
df.all2020$last.author <- lapply(df.all2020$authors, last.author)
tmp <- NULL
all_first_names <- word(unlist(df.all2020$split.names),1)
gender <- gender(all_first_names, method = "ssa")
gender <- unique(gender[ , c(1,2,4)])
for(i in 1:length(df.all2020$authors)){
tmp <- as.data.frame(word(unlist(df.all2020$split.names[[i]]), 1))
colnames(tmp) <- "name"
tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
df.all2020$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
df.all2020$female.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}
df.all2020$author.n <- str_count(df.all2020$authors, pattern = "\\|")+1 #Count author number
df.all2020$first.author.first.name <- ifelse(df.all2020$author.n > 1, word(df.all2020$first.author, 1), NA)
df.all2020$first.author.first.name <- word(df.all2020$first.author, 1)
gender <- gender(df.all2020$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.all2020$first.author.gender <- getgender[df.all2020$first.author.first.name]
df.all2020$last.author.first.name <- ifelse(df.all2020$author.n > 1, word(df.all2020$last.author, 1), NA)
gender <- gender(df.all2020$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.all2020$last.author.gender <- getgender[df.all2020$last.author.first.name]
df.all2020$female.mid.authors.n <- ifelse(df.all2020$author.n > 1, (df.all2020$female.n - ifelse(df.all2020$first.author.gender %in% "female", 1, 0) - ifelse(df.all2020$last.author.gender %in% "female", 1, 0)), 0)
df.all2020$male.mid.authors.n <- ifelse(df.all2020$author.n > 1, (df.all2020$male.n - ifelse(df.all2020$first.author.gender %in% "male", 1, 0) - ifelse(df.all2020$last.author.gender %in% "male", 1, 0)), 0)
df.all2020.output <- as.data.frame(apply(df.all2020,2,as.character))
write.csv(df.all2020.output, "Data/arxiv_all2020_gender.csv") #Save data
```
Next, we calculated some summary statistics for the arXiv dataset we assembled. We also determined which preprints list authors in alphabetical order in order to better understand author order conventions in this dataset.
```{r Summarize arXiv dataset, message=FALSE, warning=FALSE}
df.full <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/arxiv_full_gender.csv") #Read in data
df.full <- df.full[!duplicated(df.full), ] #Remove duplicates, if any
df.full$year <- as.factor(year(as.Date(df.full$submitted))) #Extract year
df.all2020 <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/arxiv_all2020_gender.csv") #Read in data
df.all2020 <- df.all2020[!duplicated(df.all2020), ] #Remove duplicated rows, if any
df.all2020$year <- as.factor(year(as.Date(df.all2020$submitted))) #Extract year
all.arxiv <- rbind(df.full, df.all2020) #Combine datasets
all.arxiv <- all.arxiv[!duplicated(all.arxiv), ] #Remove duplicates
total.preprints <- length(all.arxiv$id) #Total number of preprints
total.authors <- sum(all.arxiv$author.n) #Total number of authors
total.authors.with.gender <- sum(all.arxiv$male.n+all.arxiv$female.n) #Total number of authors with gender inferred
per.gender <- round((total.authors.with.gender/total.authors)*100, 1) #Percent of authors with gender
year.arxiv.preprints <- length(df.full$"id") #Total number of preprints for year-over-year comparison
year.arxiv.authors <- sum(df.full[, "male.n"]+df.full[, "female.n"]) #Authors with gender inferred for year-over-year comparison
#preprints per year
yr.summary <- df.full %>% group_by(year) %>% summarize(n=n())
#Write a function to test whether author names are in alphabetical order
last.names.alpha <- function(x) {
tmp <- as.data.frame(strsplit(as.character(x), "|", fixed=TRUE))
colnames(tmp) <- c("names")
tmp$names <- word(tmp$names, -1)
tmp$names.alpha <- str_sort(tmp$names)
tmp$match <- identical(tmp$names, tmp$names.alpha)
ifelse(length(tmp$names) >= 2, ifelse(sum(tmp$match, na.rm=TRUE) > 0, "Y", "N"), NA)
}
#Apply alphabetical function
all.arxiv$authors.alpha <- lapply(all.arxiv$authors, last.names.alpha) #Apply functions
df.full$authors.alpha <- lapply(df.full$authors, last.names.alpha)
df.all2020$authors.alpha <- lapply(df.all2020$authors, last.names.alpha)
#Make 'big' categories
all.arxiv$big.category <- word(all.arxiv$primary_category,1,sep = "\\.")
```
There are `r total.preprints` preprints in the full arXiv dataset, with a total of `r format(total.authors, scientific=FALSE)` non-unique authors. We inferred the gender of `r total.authors.with.gender` authors, or `r per.gender`%, with the rest omitted from subsequent analyses. This a lower success rate for predicting author gender for the arXiv dataset than for the bioRxiv dataset (see below), reflecting the fact that arXiv preprints are more likely to list large consortia as authors (e.g., CMS Collaboration) or have authors who have names not in the U.S. Social Security names database.
For just the comparison of March 15-April 15, 2020 with the same dates in 2019, there are `r year.arxiv.preprints` arXiv preprints with `r year.arxiv.authors` authors for whom we inferred gender.
We analyzed the data by authorship position (e.g., first, middle, last, or sole author). In many but not all STEM fields, it is convention to list the Principal Investigator (PI) as last author on the paper, and the person who led the study and wrote the first draft of the manuscript as first author on the paper, with other authors as middle authors. However, some fields list authors in alphabetical order. In the full arXiv dataset, among preprints with multiple (i.e., 2 or more) authors, there are `r length(all.arxiv[all.arxiv$authors.alpha == "N", "authors.alpha"])` preprints with authors not in alphabetical order, and `r length(all.arxiv[all.arxiv$authors.alpha == "Y", "authors.alpha"])` preprints with authors in alphabetical order. In other words, `r round((length(all.arxiv[all.arxiv$authors.alpha == "Y", "authors.alpha"])/(length(all.arxiv[all.arxiv$authors.alpha == "N", "authors.alpha"])+length(all.arxiv[all.arxiv$authors.alpha == "Y", "authors.alpha"])))*100, 1)`% of preprints with multiple authors list authors in alphabetical order.
### Exploring author order across fields
Reviewers also asked for some more information about the fields represented in the arXiv dataset. We used the category taxonomy available on the arXiv website itself (https://arxiv.org/category_taxonomy) to categorize preprints into 8 broad fields: physics, math, computer science, statistics, quantitative biology, economics, quantitative finance, and electrical engineering and systems science. We further combined economics and quantitative finance preprints into a single category, because each one had few preprints. How does the number of arXiv preprints compare among fields?
```{r Author order, message=FALSE, warning=FALSE, dpi=300}
#Bin all physics sub-disciplines into "physics"
all.arxiv$field <- ifelse(all.arxiv$big.category == "astro-ph" | all.arxiv$big.category == "cond-mat" | all.arxiv$big.category == "gr-qc" | all.arxiv$big.category == "math-ph" | all.arxiv$big.category == "quant-ph" | all.arxiv$big.category == "nucl-ex" | all.arxiv$big.category == "nucl-th" | all.arxiv$big.category == "hep-ex" | all.arxiv$big.category == "hep-ph" | all.arxiv$big.category == "hep-th" | all.arxiv$big.category == "hep-lat" | all.arxiv$big.category == "nlin", "physics", all.arxiv$big.category)
#Lump quantitative finance and economics together
all.arxiv$field <- ifelse(all.arxiv$field == "q-fin", "econ", all.arxiv$field)
#Count preprints per field
field.summary <- all.arxiv %>% group_by(field) %>% summarize(n=n())
#Make a figure
p25 <- ggplot(data=field.summary, aes(x=reorder(field, n), y=n))+xlab("Field")+ylab("Preprints (no.)")+geom_bar(stat="identity", alpha=0.5)+theme_cowplot()+theme(axis.text.x=element_text(angle=90))+labs(title="arXiv", subtitle="by field")
p25
```
The arXiv dataset is dominated by preprints in physics, math, and computer science.
How many preprints in each field list authors in alphabetical order?
```{r Fields in alphabetical order, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
#Count the number of preprints with authors in alphabetical order, and the total number of preprints in each field
author.order.summary <- subset(all.arxiv, !is.na(authors.alpha)) %>% group_by(field, authors.alpha) %>% summarize(n=n())
author.order.summary.wide <- spread(author.order.summary, key=authors.alpha, value=n)
author.order.summary.wide$per <- 100*(author.order.summary.wide$Y/(author.order.summary.wide$N+author.order.summary.wide$Y))
author.order.summary.wide$field <-factor(author.order.summary.wide$field, levels=c("econ", "q-bio", "stat", "eess", "math", "cs", "physics"))
#Plot percent of preprints with authors in alphabetical order
p26 <- ggplot(data=author.order.summary.wide, aes(x=field, y=per))+xlab("Field")+ylab("Authors in alphabetical order (%)")+geom_bar(stat="identity", alpha=0.5)+theme_cowplot()+theme(axis.text.x=element_text(angle=90))+labs(title="arXiv", subtitle="by field")
p26
p40 <- plot_grid(p25, p26, labels="auto")
save_plot("Figure S1.png", p40, base_width = 8, base_height =4, dpi=600)
```
Economics and quantitative finance (combined into "econ" in the figure) and math are the fields with most preprints listing authors in alphabetical order. These are the fields for which author position (e.g., first, middle, or last) is least likely to be meaningful.
### Comparing arXiv preprint authorships between Mar/Apr 2019 and Mar/Apr 2020, by gender
How many men versus women authorships of preprints were there in Mar/Apr 2020, compared to the same dates last year? Note: this is not the number of unique authors; it includes authors who submitted multiple preprints. Thus, 1 "authorship" equals 1 author on 1 paper.
```{r Visualize arXiv year-over-year data, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
all <- as.data.frame(ungroup(df.full %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019
yr.labels = c("Mar. 15 - Apr. 15, 2019", "Mar. 15 - Apr. 15, 2020") #Set legend labels
colours1 = c("#f4a582","#ca0020") #Set colours
fontsize = 10
#Make figure comparing 2020 to 2019
p1 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2), y=c(9000,29000), label = paste0("+", round(all.t$per.dif.1920[1:2], 1), "%"))+labs(title="arXiv", subtitle="all authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p1
```
arXiv preprint submissions are up overall, but the number of men authorships is currently growing faster than the number of women authorships. Comparing preprint submissions in late March and early April 2020 to the same dates in 2019, the number of men authorships has grown more than the number of women authorships, both as a percent change and in absolute terms.
We further investigated this pattern in the fields with the greatest numbers of preprints in arXiv, namely physics, computer science, and math.
```{r Visualize arXiv year-over-year data by field, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
#Bin all physics sub-disciplines into "physics"
df.full$big.category <- word(df.full$primary_category,1,sep = "\\.")
df.full$field <- ifelse(df.full$big.category == "astro-ph" | df.full$big.category == "cond-mat" | df.full$big.category == "gr-qc" | df.full$big.category == "math-ph" | df.full$big.category == "quant-ph" | df.full$big.category == "nucl-ex" | df.full$big.category == "nucl-th" | df.full$big.category == "hep-ex" | df.full$big.category == "hep-ph" | df.full$big.category == "hep-th" | df.full$big.category == "hep-lat" | df.full$big.category == "nlin", "physics", df.full$big.category)
df.full$field <- ifelse(df.full$field == "q-fin", "econ", df.full$field) #Lump quantitative finance and economics together
#Set labels and colours
yr.labels = c("Mar. 15 - Apr. 15, 2019", "Mar. 15 - Apr. 15, 2020") #Set legend labels
colours1 = c("#f4a582","#ca0020") #Set colours
fontsize = 10
#Physics
cat <- as.character(unique(df.full$field)[[1]])
all <- as.data.frame(ungroup(subset(df.full, field == cat) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019
#Make figure comparing 2020 to 2019
p27 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2), y=c(4700,13800), label = paste0(round(all.t$per.dif.1920[1:2], 1), "%"))+labs(title="arXiv", subtitle=paste0(cat, ", all authorships"))+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
#Math
cat <- as.character(unique(df.full$field)[[2]])
all <- as.data.frame(ungroup(subset(df.full, field == cat) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019
#Make figure comparing 2020 to 2019
p28 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+labs(title="arXiv", subtitle=paste0(cat, ", all authorships"))+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))+ggplot2::annotate("text", x=c(1, 2), y=c(1200,4000), label = c(paste0(round(all.t$per.dif.1920[1], 1), "%"), paste0("+", round(all.t$per.dif.1920[2], 1), "%")))
#Computer Science
cat <- as.character(unique(df.full$field)[[3]])
all <- as.data.frame(ungroup(subset(df.full, field == cat) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
all.long <- gather(all, Gender, number, Female:Male) #Make wide data long
all.t <- as.data.frame(t(all[,-1])) #Transpose
colnames(all.t) <- c("2019", "2020") #Fix column names
all.t$per.dif.1920 <- ((all.t$`2020`-all.t$`2019`)/(all.t$`2019`))*100 #Calculate percent change, 2020 over 2019
#Make figure comparing 2020 to 2019
p29 <- ggplot(data=all.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2), y=c(3000,9500), label = paste0("+", round(all.t$per.dif.1920[1:2], 1), "%"))+labs(title="arXiv", subtitle="cs, all authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p30 <- plot_grid(p27, p28, p29, nrow=1)
p30
save_plot("all_authorships_by_field.png", p30, base_height=4, base_width=8, dpi=600)
```
There has been a greater decline in women authorships than men authorships in physics and math, but not in computer science. In computer science, women added more authorships as a percent change, but not in absolute terms, between Mar/Apr 2019 and Mar/Apr 2020.
### Comparing single-authored arXiv preprints between Mar/Apr 2019 and Mar/Apr 2020, by gender
How many arXiv preprints were authored by a single woman versus a single man in Mar/Apr, 2020, compared to the same dates last year?
```{r Sole authors, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
sole.authors <- as.data.frame(ungroup(subset(df.full, author.n == 1) %>% group_by(year) %>% summarize(Female = sum(female.n, na.rm=TRUE), Male = sum(male.n, na.rm=TRUE)))) #Summarize by year
sole.long <- gather(sole.authors, Gender, number, Male:Female) #Make wide data long
sole.authors.t <- as.data.frame(t(sole.authors[,-1])) #Transpose
colnames(sole.authors.t) <- c("2019", "2020") #Fix column names
sole.authors.t$per.dif.1920 <- ((sole.authors.t$`2020`-sole.authors.t$`2019`)/(sole.authors.t$`2019`))*100 #Calculate percent change, 2020 over 2019
#Make figure for single-authored preprints
p2 <- ggplot(data=sole.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2), y=c(270,1350), label = paste0("+", round(sole.authors.t$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="arXiv", subtitle = "sole authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p2
```
Single-authored arXiv submissions are also up overall, but again the number of men authorships is currently growing faster than the number of women authorships, both as a percent change and in absolute terms.
### Comparing arXiv preprint submissions by authorship position between Mar/Apr 2019 and Mar/Apr 2020, by gender
What if we break it down further by author position, so first, middle, or last? First up, first authorships of multi-authored papers.
##### First authorships
```{r arXiv year-over-year first authors, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
first.authors <- subset(df.full, !is.na(first.author.gender)) %>% group_by(year,first.author.gender) %>% summarize(n=n()) #Summarize by year
first.authors$per.dif.1920 <- c(first.authors[3,3]/first.authors[1,3],first.authors[4,3]/first.authors[2,3] ,first.authors[3,3]/first.authors[1,3], first.authors[4,3]/first.authors[2,3])
first.authors$per.dif.1920 <- (as.numeric(first.authors$per.dif.1920)-1)*100 #Calculate percent change, 2020 over 2019
p3 <- ggplot(data=first.authors, aes(fill=as.factor(year), y=n, x=first.author.gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2), y=c(2000,6550), label = paste0("+", round(first.authors$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="arXiv", subtitle = "first authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p3
```
The number of women first authorships has grown slightly faster than the number of men first authorships, as a percent change year-over-year (but not in absolute terms).
##### Last authorships
What about last, or "senior," authorships of multi-authored papers?
```{r arXiv last authors year-over-year, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
last.authors <- subset(df.full, !is.na(last.author.gender)) %>% group_by(year,last.author.gender) %>% summarize(n=n()) #Summarize by year
last.authors$per.dif.1920 <- c(last.authors[3,3]/last.authors[1,3],last.authors[4,3]/last.authors[2,3] ,last.authors[3,3]/last.authors[1,3], last.authors[4,3]/last.authors[2,3])
last.authors$per.dif.1920 <- (as.numeric(last.authors$per.dif.1920)-1)*100 #Calculate percent change, 2020 over 2019
p4 <- ggplot(data=last.authors, aes(fill=as.factor(year), y=n, x=last.author.gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values = colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank())+ggplot2::annotate("text", x=c(1, 2), y=c(1900,7020), label = paste0("+", round(last.authors$per.dif.1920[1:2], 1), "%"))+theme(legend.text=element_text(size=fontsize))+labs(title="arXiv", subtitle = "last authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p4
```
The number of men last authorships has grown substantially year-over-year, but the number of women last authorships is almost unchanged from 2019.
##### Middle authorships
And finally, middle authorships, or all authorships on multi-authored papers that are not in the first or last position.
```{r arXiv middle authors year-over-year, message=FALSE, warning=FALSE, fig.width=8, fig.height=4, dpi=300}
middle <- as.data.frame(ungroup(df.full %>% group_by(year) %>% summarize(Female = sum(female.mid.authors.n, na.rm=TRUE), Male = sum(male.mid.authors.n, na.rm=TRUE)))) #Summarize by year
middle.long <- gather(middle, Gender, number, Female:Male) #Make wide data long
middle.t <- as.data.frame(t(middle[,-1])) #Transpose
colnames(middle.t) <- c("2019", "2020") #Fix column names
middle.t$per.dif.1920 <- ((middle.t$`2020`-middle.t$`2019`)/(middle.t$`2019`))*100 #Calculate percent change, 2020 over 2019
yr.labels = c("Mar. 15 - Apr. 15, 2019", "Mar. 15 - Apr. 15, 2020") #Set legend labels
colours1 = c("#f4a582","#ca0020") #Set colours
fontsize = 10
#Make figure comparing 2020 to 2019
p5 <- ggplot(data=middle.long, aes(fill=as.factor(year), y=number, x=Gender))+geom_bar(position="dodge", stat="identity")+theme_cowplot()+xlab("Gender")+ylab("Authorships (no.)")+labs(fill="Year")+scale_fill_manual(values=colours1, labels=yr.labels)+theme(legend.position = "top", legend.justification="left", legend.title = element_blank(), legend.text = element_text(size=fontsize))+ggplot2::annotate("text", x=c(1, 2), y=c(4770,14450), label = c(paste0("+", round(middle.t$per.dif.1920[1], 1), "%"), paste0("+", round(middle.t$per.dif.1920[2], 1), "%")))+labs(title="arXiv", subtitle="middle authorships")+guides(fill=guide_legend(nrow=2))+scale_x_discrete(labels=c("Women", "Men"))
p5
```
Again, the number of men middle authorships increased more than the number of women middle authorships, year-over-year.
### Comparing arXiv preprint submissions in the months before and during the COVID-19 pandemic, by gender
Next, we looked back over the months leading up to widespread stay-at-home orders and school and childcare closures that North Americans experienced beginning in late March or early April, 2020. These measures were implemented to different degrees and on different dates in different parts of the world, but we assumed their effects would be most pronounced (globally) starting in March, 2020 and thereafter.
```{r Early 2020 arXiv all authors analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
who <- "2020-03-11" #Date the WHO declared COVID-19 a pandemic, for reference
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))
colours2 = c("#7fbf7b","#af8dc3") #Set colours
p6 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p6
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm1 <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm1)
Anova(lm1, type=3)
#plot(lm1)
```
The number of male authorships is growing faster than the number of female authorships during the pandemic.
Again, we repeated the analysis for physics, math, and computer science separately. But first, let's check how many physics preprints are about COVID-19 in the full arXiv dataset.
```{r Physics arXiv preprints about COVID-19}
#Check how many physics preprints are about COVID-19
physics.df <- subset(all.arxiv, field == "physics")
physics.df$COVID.in.abstract <- grepl('COVID-19|SARS-CoV-2|COVID|Covid|coronavirus', physics.df$abstract)
```
We calculated how many physics preprints mention COVID-19, SARS-CoV-2, or coronavirus in the abstract, and found that only `r sum(physics.df$COVID.in.abstract)` out of `r length(physics.df$COVID.in.abstract)`, or `r round(100*(sum(physics.df$COVID.in.abstract)/length(physics.df$COVID.in.abstract)),1)`%, mention one of these terms.
Now for the linear model results for each of the 3 big fields in the arXiv dataset (physics, math, and computer science). For sake of completion, we also did the other 4 fields (economics and quantitative finance, electrical engineering and systems science, statistics, and quantitative biology).
```{r Early 2020 arXiv all authors analysis by field, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
#Bin all physics sub-disciplines into "physics"
df.all2020$big.category <- word(df.all2020$primary_category,1,sep = "\\.")
df.all2020$field <- ifelse(df.all2020$big.category == "astro-ph" | df.all2020$big.category == "cond-mat" | df.all2020$big.category == "gr-qc" | df.all2020$big.category == "math-ph" | df.all2020$big.category == "quant-ph" | df.all2020$big.category == "nucl-ex" | df.all2020$big.category == "nucl-th" | df.all2020$big.category == "hep-ex" | df.all2020$big.category == "hep-ph" | df.all2020$big.category == "hep-th" | df.all2020$big.category == "hep-lat" | df.all2020$big.category == "nlin", "physics", df.all2020$big.category)
#Lump quantitative finance and economics together
df.all2020$field <- ifelse(df.all2020$field == "q-fin", "econ", df.all2020$field)
#Set colours and reference date
colours2 = c("#7fbf7b","#af8dc3") #Set colours
who <- "2020-03-11" #Date the WHO declared COVID-19 a pandemic, for reference
#Physics
#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "physics") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))
p31 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="physics, all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "physics") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm.physics <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.physics)
Anova(lm.physics, type=3)
#Math
#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "math") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))
p32 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="math, all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "math") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm.math <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.math)
Anova(lm.math, type=3)
#Computer science
#Aggregate data by week for figure
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "cs") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
colnames(arxiv.long) <- c("week", "gender", "n")
arxiv.long$gender <- factor(arxiv.long$gender, rev(levels(as.factor(arxiv.long$gender))))
p33 <- ggplot(data=subset(arxiv.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="cs, all authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "cs") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm.cs <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.cs)
Anova(lm.cs, type=3)
p34 <- plot_grid(p31, p32, p33, nrow=1)
p34
save_plot("early2020_by_field.png", p34, base_height=4, base_width=10, dpi=600)
#Economics and quantitative finance
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "econ") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm.econ <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.econ)
Anova(lm.econ, type=3)
#Quantitative biology
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "q-bio") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm.qbio <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.qbio)
Anova(lm.qbio, type=3)
#Statistics
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "stat") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm.stat <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.stat)
Anova(lm.stat, type=3)
#Electrical engineering and systems science
#Aggregate data by day for model
arxiv <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & field == "eess") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.long <- gather(arxiv, gender, n, female.n:male.n)
arxiv.long$day_of_week <- wday(arxiv.long$`as.Date(submitted)`, label=TRUE)
arxiv.long$date <- as.numeric(arxiv.long$`as.Date(submitted)`)-18261
arxiv.long$day_of_week <- factor(arxiv.long$day_of_week, ordered = FALSE)
#Fit mdoel
lm.eess <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.long)
summary(lm.eess)
Anova(lm.eess, type=3)
```
The gender gap in authorships is growing in physics, math, and computer science.
### Comparing single-authored arXiv preprint submissions in the months before and during the COVID-19 pandemic, by gender
Again, what about for sole authorships? How does early 2020 compare to during the pandemic?
```{r Early 2020 arXiv sole author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
arxiv.sole <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & author.n==1) %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE))))
arxiv.sole.long <- gather(arxiv.sole, gender, n, female.n:male.n)
colnames(arxiv.sole.long) <- c("week", "gender", "n")
arxiv.sole.long$gender <- factor(arxiv.sole.long$gender, rev(levels(as.factor(arxiv.sole.long$gender))))
p7 <- ggplot(data=subset(arxiv.sole.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+labs(title="arXiv", subtitle="sole authorships")+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p7
#Model
arxiv.sole <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & author.n == 1) %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.n, na.rm=TRUE), male.n=sum(male.n, na.rm=TRUE)))) #Summarize by month
arxiv.sole.long <- gather(arxiv.sole, gender, n, female.n:male.n)
arxiv.sole.long$day_of_week <- wday(arxiv.sole.long$`as.Date(submitted)`, label=TRUE)
arxiv.sole.long$date <- as.numeric(arxiv.sole.long$`as.Date(submitted)`)-18261
arxiv.sole.long$day_of_week <- factor(arxiv.sole.long$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)
lm2 <- lm(sqrt(n)~date*gender+day_of_week, data=arxiv.sole.long)
summary(lm2)
Anova(lm2, type=3)
#plot(lm2)
```
Again, the number of preprints single-authored by men is growing faster than the number of preprints single-authored by women.
### Comparing arXiv preprint submissions by authorship position in the months before and during the COVID-19 pandemic, by gender
Let's do the same for first, middle, and last authorships on multi-authored preprints.
##### First authorships
```{r Early 2020 arXiv first author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
arxiv.first <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(first.author.gender)) %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1)), first.author.gender) %>% summarize(n=n())))
colnames(arxiv.first) <- c("week", "gender", "n")
arxiv.first$gender <- factor(arxiv.first$gender, rev(levels(as.factor(arxiv.first$gender))))
p8 <- ggplot(data=subset(arxiv.first, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date("2020-03-11")), linetype="dashed")+labs(title="arXiv", subtitle="first authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p8
#Model
arxiv.first <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(first.author.gender)) %>% group_by(as.Date(submitted), first.author.gender) %>% summarize(n=n())))
arxiv.first$day_of_week <- wday(arxiv.first$`as.Date(submitted)`, label=TRUE)
arxiv.first$date <- as.numeric(arxiv.first$`as.Date(submitted)`)-18261
arxiv.first$day_of_week <- factor(arxiv.first$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)
lm3 <- lm(sqrt(n)~date*first.author.gender+day_of_week, data=arxiv.first)
summary(lm3)
Anova(lm3, type=3)
#plot(lm3)
```
Men first authorships are growing faster than women first authorships.
##### Last authors
```{r Early 2020 arXiv last author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
arxiv.last <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(last.author.gender)) %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1)), last.author.gender) %>% summarize(n=n())))
colnames(arxiv.last) <- c("week", "gender", "n")
arxiv.last$gender <- factor(arxiv.last$gender, rev(levels(as.factor(arxiv.last$gender))))
p9 <- ggplot(data=subset(arxiv.last, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+geom_vline(aes(xintercept=as.Date("2020-03-11")), linetype="dashed")+labs(title="arXiv", subtitle="last authorships")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p9
#Model
arxiv.last <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30" & !is.na(last.author.gender)) %>% group_by(as.Date(submitted), last.author.gender) %>% summarize(n=n())))
arxiv.last$day_of_week <- wday(arxiv.last$`as.Date(submitted)`, label=TRUE)
arxiv.last$date <- as.numeric(arxiv.last$`as.Date(submitted)`)-18261
arxiv.last$day_of_week <- factor(arxiv.last$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)
lm4 <- lm(sqrt(n)~date*last.author.gender+day_of_week, data=arxiv.last)
summary(lm4)
Anova(lm4, type=3)
#plot(lm4)
```
Men last authorships are growing faster than women last authorships.
### Middle authors
```{r Early 2020 arXiv middle author analysis, fig.height=4, fig.width=8, message=FALSE, warning=FALSE, dpi=300}
arxiv.middle <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(round_date(as.Date(submitted), unit="weeks", week_start = getOption("lubridate.week.start", 1))) %>% summarize(female.n=sum(female.mid.authors.n, na.rm=TRUE), male.n=sum(male.mid.authors.n, na.rm=TRUE))))
arxiv.middle.long <- gather(arxiv.middle, gender, n, female.n:male.n)
colnames(arxiv.middle.long) <- c("week", "gender", "n")
arxiv.middle.long$gender <- factor(arxiv.middle.long$gender, rev(levels(as.factor(arxiv.middle.long$gender))))
p10 <- ggplot(data=subset(arxiv.middle.long, as.Date(week) >= "2020-01-01" & as.Date(week) <= "2020-06-28"), aes(x=as.Date(week), y=n, color=gender, shape=gender))+geom_point(size=2)+geom_smooth(method="lm", se=FALSE)+ylab("Authorships (no.)")+xlab("Date")+theme_cowplot()+scale_color_manual(name="Gender", labels=c("Men", "Women"), values=colours2)+scale_shape_discrete(name="Gender", labels=c("Men", "Women"))+labs(title="arXiv", subtitle="middle authorships")+geom_vline(aes(xintercept=as.Date(who)), linetype="dashed")+scale_x_date(date_labels = "%b %Y")+theme(axis.text.x=element_text(angle=60, hjust=1), legend.position="top")
p10
#Model
arxiv.middle <- as.data.frame(ungroup(subset(df.all2020, as.Date(submitted) >= "2020-01-01" & as.Date(submitted) <= "2020-06-30") %>% group_by(as.Date(submitted)) %>% summarize(female.n=sum(female.mid.authors.n, na.rm=TRUE), male.n=sum(male.mid.authors.n, na.rm=TRUE))))
arxiv.middle.long <- gather(arxiv.middle, gender, n, female.n:male.n)
arxiv.middle.long$day_of_week <- wday(arxiv.middle.long$`as.Date(submitted)`, label=TRUE)
arxiv.middle.long$day <- as.numeric(arxiv.middle.long$`as.Date(submitted)`)-18261
arxiv.middle.long$day_of_week <- factor(arxiv.middle.long$day_of_week, levels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat"), ordered = FALSE)
lm5 <- lm(sqrt(n)~day*gender+day_of_week, data=arxiv.middle.long)
summary(lm5)
Anova(lm5, type=3)
#plot(lm5)
```
Men middle authorships are growing faster than women middle authorships.
### Omnibus figures
```{r Combine visualizations for omnibus figures, message=FALSE, warning=FALSE, fig.width=16, fig.height=16, dpi=300}
#Year over year
p11 <- plot_grid(p1, p2, p3, p4, p5, align = 'v', axis='l')
p11
save_plot("year-over-year_arxiv.png", p11, base_height=8, base_width=8, dpi=600)
#p11.v2 <- plot_grid(p1, p2, p3, p4, p5, align = 'v', axis='l', nrow=1)
#p11.v2
#save_plot("year-over-year_arxiv.v2.png", p11.v2, base_height=4, base_width=13.5, dpi=600)
p12 <- plot_grid(p6, p7, p8, p9, p10, align='v', axis='l')
p12
save_plot("early2020_arxiv.png", p12, base_height=8, base_width=10, dpi=600)
#p12.v2 <- plot_grid(p6, p7, p8, p9, p10, align='v', axis='l', nrow=1)
#p12.v2
#save_plot("early2020_arxiv.v2.png", p12.v2, base_height=4.5, base_width=16, dpi=600)
```
## bioRxiv submissions
Next, we scraped submission data from bioRxiv, the main preprint server for biology. We used the rbiorxiv package, see:
Fraser, N (2020). rbiorxiv. R package, https://github.com/nicholasmfraser/rbiorxiv
We scraped the same date ranges as for the arXiv analysis, above.
```{r Scrape biorxiv, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
#Get all submissions between Jan 1, 2020 and April 22, 2020
df.b.2020 <- biorxiv_content(from = "2020-01-01", to = "2020-04-22", limit = "*", format = "df")
#Get all submissions for March 15 to April 15, 2019
df.b.2019 <- biorxiv_content(from = "2019-03-15", to = "2019-04-15", limit = "*", format = "df")
#Update with April 22 to April 30, 2020 data
df.b.2020.update <- biorxiv_content(from = "2020-04-22", to = "2020-04-30", limit = "*", format = "df")
write.csv(df.b.2020, "Data/biorxiv_2020_data.csv")
write.csv(df.b.2019, "Data/biorxiv_2019_data.csv")
write.csv(df.b.2020.update, "Data/biorxiv_2020_update_data.csv")
#Update with May 1 to June 30, 2020 data
df.b.2020.may <- biorxiv_content(from = "2020-05-01", to = "2020-05-15", limit = "*", format = "df")
df.b.2020.may2 <- biorxiv_content(from = "2020-05-16", to = "2020-05-31", limit = "*", format = "df")
df.b.2020.june <- biorxiv_content(from = "2020-06-01", to = "2020-06-15", limit = "*", format = "df")
df.b.2020.june2 <- biorxiv_content(from = "2020-06-16", to = "2020-06-30", limit = "*", format = "df")
df.b.2020.mayjune <- rbind(df.b.2020.may, df.b.2020.may2, df.b.2020.june, df.b.2020.june2)
write.csv(df.b.2020.mayjune, "Data/biorxiv_2020_MayJune_data.csv")
```
Note that the bioRxiv API only returns first names for corresponding authors, and not for all authors, so we used a work-around to overcome this issue and get first names for all authors. Briefly, we used the package rcrossref to extract full citations (with first names of all authors) from all bioRxiv dois:
Scott Chamberlain, Hao Zhu, Najko Jahn, Carl Boettiger and Karthik Ram (2020). rcrossref:
Client for Various 'CrossRef' 'APIs'. R package version 1.0.0.
https://CRAN.R-project.org/package=rcrossref
(Thanks to @palolili23 for suggesting rcrossref as a work-around!)
```{r bioRxiv first names, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
df.b.2019 <- read.csv("Data/biorxiv_2019_data.csv")
df.b.all2020 <- read.csv("Data/biorxiv_2020_data.csv")
df.b.2020.update <- read.csv("Data/biorxiv_2020_update_data.csv")
df.b.2020.mayjune <- read.csv("Data/biorxiv_2020_MayJune_data.csv")
#Get first names for 2019 data
for (i in 1:length(df.b.2019$doi)){
tmp_doi <- df.b.2019$doi[i]
tryCatch({
if (is.na(tmp_doi)) next()
df.b.2019$bib[i] <- cr_cn(dois=tmp_doi)
tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.2019$bib[i])
tmp_authors <- sub("\\},.*", "", tmp_authors)
df.b.2019$authors_full[i] <- tmp_authors
}, error=function(e){})
}
write.csv(df.b.2019, "Data/biorxiv_2019_data_first.names.csv")
#Get first names for updated 2020 data
for (i in 1:length(df.b.2020.update$doi)){
tmp_doi <- df.b.2020.update$doi[i]
tryCatch({
if (is.na(tmp_doi)) next()
df.b.2020.update$bib[i] <- cr_cn(dois=tmp_doi)
tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.2020.update$bib[i])
tmp_authors <- sub("\\},.*", "", tmp_authors)
df.b.2020.update$authors_full[i] <- tmp_authors
}, error=function(e){})
}
write.csv(df.b.2020.update, "Data/biorxiv_2020_update_data_first.names.csv")
#Get first names for 2020 data
for (i in 1:5000){
tmp_doi <- df.b.all2020$doi[i]
tryCatch({
if (is.na(tmp_doi)) next()
df.b.all2020$bib[i] <- cr_cn(dois=tmp_doi)
tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.all2020$bib[i])
tmp_authors <- sub("\\},.*", "", tmp_authors)
df.b.all2020$authors_full[i] <- tmp_authors
}, error=function(e){})
}
for (i in 5001:10000){
tmp_doi <- df.b.all2020$doi[i]
tryCatch({
if (is.na(tmp_doi)) next()
df.b.all2020$bib[i] <- cr_cn(dois=tmp_doi)
tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.all2020$bib[i])
tmp_authors <- sub("\\},.*", "", tmp_authors)
df.b.all2020$authors_full[i] <- tmp_authors
}, error=function(e){})
}
for (i in 10000:length(df.b.all2020$doi)){
tmp_doi <- df.b.all2020$doi[i]
tryCatch({
if (is.na(tmp_doi)) next()
df.b.all2020$bib[i] <- cr_cn(dois=tmp_doi)
tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.all2020$bib[i])
tmp_authors <- sub("\\},.*", "", tmp_authors)
df.b.all2020$authors_full[i] <- tmp_authors
}, error=function(e){})
}
write.csv(df.b.all2020, "Data/biorxiv_2020_data_first.names.csv")
#Get first names for May/June 2020 data
for (i in 1:length(df.b.2020.mayjune$doi)){
tmp_doi <- df.b.2020.mayjune$doi[i]
tryCatch({
if (is.na(tmp_doi)) next()
df.b.2020.mayjune$bib[i] <- cr_cn(dois=tmp_doi)
tmp_authors <- sub(".*\n\tauthor = \\{", "", df.b.2020.mayjune$bib[i])
tmp_authors <- sub("\\},.*", "", tmp_authors)
df.b.2020.mayjune$authors_full[i] <- tmp_authors
}, error=function(e){})
}
write.csv(df.b.2020.mayjune, "Data/biorxiv_2020_MayJune_data_first.names.csv")
```
We inferred the gender of bioRxiv preprint authors, as above.
```{r bioRxiv gender, message=FALSE, warning=FALSE, eval=FALSE}
#Not run
df.b.2019 <- read.csv("Data/biorxiv_2019_data_first.names.csv")
df.b.all2020 <- read.csv("Data/biorxiv_2020_data_first.names.csv")
df.b.2020.update <- read.csv("Data/biorxiv_2020_update_data_first.names.csv")
df.b.2020.mayjune <- read.csv("Data/biorxiv_2020_MayJune_data_first.names.csv")
df.b.full <- rbind(df.b.2019, subset(df.b.all2020, as.Date(date) >= "2020-03-15" & as.Date(date) <= "2020-04-15")) #Make year comparison, subsetting 2020 data to just March 15 to April 15
df.b.all2020 <- rbind(df.b.all2020, df.b.2020.update, df.b.2020.mayjune) #Make early 2020 dataset
df.b.full$year <- as.factor(year(as.Date(df.b.full$date))) #Extract year
df.b.all2020$year <- as.factor(year(as.Date(df.b.all2020$date)))
split.b.names <- function(x){strsplit(as.character(x), " and ", fixed=TRUE)} #Function to split strings of author names
last.b.author <- function(x){gsub(".*\\band\\b", "", as.character(x))} #Function to extract last author
first.b.author <- function(x){gsub("\\band\\b.*", "", as.character(x))} #Function to extract first author
#For the year-over-year dataset
df.b.full$split.names <- lapply(df.b.full$authors_full, split.b.names) #Apply functions
df.b.full$first.author <- lapply(df.b.full$authors_full, first.b.author)
df.b.full$last.author <- lapply(df.b.full$authors_full, last.b.author)
all_first_names <- word(unlist(df.b.full$split.names),1) #Make a list of all author first names
gender <- gender(all_first_names, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)]) #Keep only unique names
tmp <- NULL
for(i in 1:length(df.b.full$authors_full)){
tmp <- as.data.frame(word(unlist(df.b.full$split.names[[i]]), 1))
colnames(tmp) <- "name"
tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
df.b.full$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
df.b.full$female.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}
#Predict first author gender (omits sole authors)
df.b.full$author.n <- str_count(df.b.full$authors, pattern = "\\;")+1 #Count author number
df.b.full$first.author.first.name <- ifelse(df.b.full$author.n > 1, word(trimws(as.character(df.b.full$first.author, 1))), NA)
gender <- gender(df.b.full$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.full$first.author.gender <- getgender[df.b.full$first.author.first.name]
#Predict last author gender (omits sole authors)
df.b.full$last.author.first.name <- ifelse(df.b.full$author.n > 1, word(trimws(as.character(df.b.full$last.author, 1))), NA)
gender <- gender(df.b.full$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.full$last.author.gender <- getgender[df.b.full$last.author.first.name]
#Count middle authors for each gender (omits sole authors)
df.b.full$female.mid.authors.n <- ifelse(df.b.full$author.n > 1, (df.b.full$female.n - ifelse(df.b.full$first.author.gender %in% "female", 1, 0) - ifelse(df.b.full$last.author.gender %in% "female", 1, 0)), 0)
df.b.full$male.mid.authors.n <- ifelse(df.b.full$author.n > 1, (df.b.full$male.n - ifelse(df.b.full$first.author.gender %in% "male", 1, 0) - ifelse(df.b.full$last.author.gender %in% "male", 1, 0)), 0)
df.b.full <- df.b.full[!duplicated(df.b.full),] #Remove duplicated rows, if any
df.b.full.output <- as.data.frame(apply(df.b.full,2,as.character))
write.csv(df.b.full.output, "Data/biorxiv_full_authors_gender.csv") #Save data
#For the 2020 dataset
df.b.all2020$split.names <- lapply(df.b.all2020$authors_full, split.b.names) #Apply function
df.b.all2020$first.author <- lapply(df.b.all2020$authors_full, first.b.author)
df.b.all2020$last.author <- lapply(df.b.all2020$authors_full, last.b.author)
all_first_names <- word(unlist(df.b.all2020$split.names),1) #Make a list of all author first names
gender <- gender(all_first_names, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)]) #Keep only unique names
tmp <- NULL
for(i in 1:length(df.b.all2020$authors_full)){
tmp <- as.data.frame(word(unlist(df.b.all2020$split.names[[i]]), 1))
colnames(tmp) <- "name"
tmp <- merge(tmp, gender, by="name", all.x=TRUE, all.y=FALSE)
df.b.all2020$male.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "male")))), na.rm=TRUE)
df.b.all2020$female.n[i] <- sum(as.numeric(str_count(as.character(tmp$gender), pattern = paste(sprintf("\\b%s\\b", "female")))), na.rm=TRUE)
}
#Predict first author gender (omits sole authors)
df.b.all2020$author.n <- str_count(df.b.all2020$authors, pattern = "\\;")+1 #Count author number
df.b.all2020$first.author.first.name <- word(df.b.all2020$first.author, 1)
gender <- gender(df.b.all2020$first.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.all2020$first.author.gender <- getgender[df.b.all2020$first.author.first.name]
#Predict last author gender (omits sole authors)
df.b.all2020$last.author.first.name <- ifelse(df.b.all2020$author.n > 1, word(trimws(as.character(df.b.all2020$last.author, 1))), NA)
gender <- gender(df.b.all2020$last.author.first.name, method = "ssa") #Predict gender
gender <- unique(gender[ , c(1,2,4)])
getgender <- gender$gender
names(getgender) <- gender$name
df.b.all2020$last.author.gender <- getgender[df.b.all2020$last.author.first.name]
#Count middle authors for each gender
df.b.all2020$female.mid.authors.n <- ifelse(df.b.all2020$author.n > 1, (df.b.all2020$female.n - ifelse(df.b.all2020$first.author.gender %in% "female", 1, 0) - ifelse(df.b.all2020$last.author.gender %in% "female", 1, 0)),0)
df.b.all2020$male.mid.authors.n <- ifelse(df.b.all2020$author.n > 1, (df.b.all2020$male.n - ifelse(df.b.all2020$first.author.gender %in% "male", 1, 0) - ifelse(df.b.all2020$last.author.gender %in% "male", 1, 0)), 0)
df.b.all2020 <- df.b.all2020[!duplicated(df.b.all2020),] #Remove duplicated rows, if any
df.b.all2020.output <- as.data.frame(apply(df.b.all2020,2,as.character))
write.csv(df.b.all2020.output, "Data/biorxiv_full_authors_2020_gender.csv") #Save data
```
Next we calculated some summary statistics for the bioRxiv dataset.
```{r Summary statistics for biorxiv, message=FALSE, warning=FALSE, fig.width=8, fig.height=4,dpi=600}
df.b.full <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/biorxiv_full_authors_gender.csv")
df.b.all2020 <- read.csv("~/Dropbox/Megan2020/Pandemic Penalty/biorxiv_full_authors_2020_gender.csv")
all.biorxiv <- rbind(df.b.all2020, df.b.full) #Combine datasets
all.biorxiv <- all.biorxiv[!duplicated(all.biorxiv), ] #Remove duplicates
total.b.preprints <- length(all.biorxiv$doi) #Total number of preprints