From 90477bc0f289e4b99688050d16e5f8ec289cdd9f Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Fri, 31 Jan 2020 11:23:49 -0500 Subject: [PATCH 1/7] nationality->etymology / describe limitations --- content/01.abstract.md | 4 ++-- content/02.introduction.md | 6 +++--- content/10.methods.md | 12 ++++++------ content/20.results.md | 18 +++++++++--------- content/30.conclusions.md | 8 +++++++- 5 files changed, 27 insertions(+), 21 deletions(-) diff --git a/content/01.abstract.md b/content/01.abstract.md index 1875b18..a5f6cec 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -5,6 +5,6 @@ Being invited to deliver a keynote at an international society meeting or named We sought to understand the extent to which such recognitions reflected the composition of their corresponding field. We collected keynote speaker invitations for the international meetings held by the International Society for Computational Biology as well as the names of Fellows, an honorary group within the society. We compared these honorees with last and corresponding author contributions in field-specific journals. -We used multiple methods to estimate the race, ethnicity, gender, and nationality of authors and the recipients of these honors. -To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people-nationality pairs from Wikipedia and trained long short-term memory neural networks to make predictions. +We used multiple methods to estimate the race, ethnicity, gender, and name etymology of authors and the recipients of these honors. +To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people name etymology pairs from Wikipedia and trained long short-term memory neural networks to make predictions. Every approach consistently shows that white scientists are overrepresented among speakers and honorees, while scientists of color are underrepresented. diff --git a/content/02.introduction.md b/content/02.introduction.md index 162f0b4..02bb286 100644 --- a/content/02.introduction.md +++ b/content/02.introduction.md @@ -13,12 +13,12 @@ Gender bias appears to also influence funding decisions: an examination of scori Challenges extend beyond gender: an analysis of awards at the NIH found that proposals by Asian, black or African-American applicants were less likely to be funded than those by white applicants [@doi:10.1126/science.1196783]. There are also potential interaction effects between gender and race or ethnicity that may particularly affect women of color's efforts to gain NIH funding [@doi:10.1097/ACM.0000000000001278]. -We sought to understand the extent to which honors and high-profile speaking invitations were distributed equitably among gender, race/ethnicity, and nationality groups by an international society and its associated meetings. +We sought to understand the extent to which honors and high-profile speaking invitations were distributed equitably among gender, race/ethnicity, and name etymology groups by an international society and its associated meetings. As computational biologists, we focused on the [International Society for Computational Biology](https://www.iscb.org/) (ISCB), its honorary Fellows as well as its affiliated international meetings: [Intelligent Systems for Molecular Biology](https://en.wikipedia.org/wiki/Intelligent_Systems_for_Molecular_Biology) (ISMB), [Research in Computational Molecular Biology](https://en.wikipedia.org/wiki/Research_in_Computational_Molecular_Biology) (RECOMB), and [Pacific Symposium on Biocomputing](https://psb.stanford.edu/) (PSB). -We used multiple methods to predict the gender, race/ethnicity, and nationality of honorees. +We used multiple methods to predict the gender, race/ethnicity, and name etymology of honorees. Existing methods were relatively US-centric because most of the data was derived in whole or in part from the US Census. -We scraped more than 700,000 entries from English-language Wikipedia that contained nationality information to complement these existing methods and built multiple machine learning classifiers to predict nationality. +We scraped more than 700,000 entries from English-language Wikipedia that contained name etymology information to complement these existing methods and built multiple machine learning classifiers to predict name etymology. We also examined the last and corresponding authors for publications in ISCB partner journals to establish a field-specific baseline using the same metrics. The results were consistent across all approaches: we found a dearth of non-white speakers and honorees. The lack of Asian scientists among keynote speakers and Fellows was particularly pronounced when compared against the field-specific background. diff --git a/content/10.methods.md b/content/10.methods.md index fd949b4..da948fe 100644 --- a/content/10.methods.md +++ b/content/10.methods.md @@ -95,7 +95,7 @@ Of 411 ISCB honorees, wru fails to provide race/ethnicity predictions for 98 nam Of 34,050 corresponding authors, 40 were missing a last name in the paper metadata, and 8,770 had a last name for which wru did not provide predictions. One limitation of wru and other methods that infer race, ethnicity, or nationality from last names is the potentially inaccurate prediction for scientists who changed their last name during marriage, a practice more common among women than men. -### Estimation of Nationality +### Estimation of Name Etymology To complement wru's race and ethnicity estimation, we developed a model to predict geographical origins of names. The existing Python package ethnicolr [@arxiv:1805.02109] produces reasonable predictions, but its international representation in the data curated from Wikipedia in 2009 [@doi:10.1145/1557019.1557032] is still limited. @@ -107,15 +107,15 @@ We tested multiple character sequence lengths and, based on this comparison, sel We trained our prediction model on 80% of the Wiki2019 dataset and evaluated its performance using the remaining 20%. This model, which we term Wiki2019-LSTM, is available in the online file [`LSTM.h5`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/models/LSTM.h5). -To generate a training dataset for nationality prediction, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people), which contained approximately 930,000 pages at the time of processing in November 2019. +To generate a training dataset for name etymology prediction, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people), which contained approximately 930,000 pages at the time of processing in November 2019. This category reflects a modern naming landscape. It is regularly curated and allowed us to avoid pages related to non-persons. -For each Wikipedia page, we used two strategies to find a full birth name and nationality for that person. +For each Wikipedia page, we used two strategies to find a full birth name and name etymology for that person. First, we used information from the personal details sidebar; the information in this sidebar varied widely but often contained a full name and a place of birth. Second, in the body of the text of most English-language biographical Wikipedia pages, the first sentence usually begins with, for example, "John Edward Smith (born 1 January 1970) is an American novelist known for ..." We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of possible nationalities. -We were able to define a name and nationality for 708,493 people by using the union of these strategies. -Our Wikipedia-based process returned a nationality or country of origin, which was more fine-grained than the broader regional patterns that we sought to examine among honorees and authors. +We were able to define a name and name etymology for 708,493 people by using the union of these strategies. +Our Wikipedia-based process returned a name etymology or country of origin, which was more fine-grained than the broader regional patterns that we sought to examine among honorees and authors. We initially grouped names by continent, but later decided to model our categorization after the hierarchical nationality taxonomy used by [NamePrism](http://www.name-prism.com/about) [@doi:10.1145/3132847.3133008]. Consequently, we used the following categories: Hispanic (including Latin America and Iberia), African, Israeli, Muslim, South Asian, East Asian, European (non-British, non-Iberian), and Celtic English (including US, Canada, and Australia). Table @tbl:example_names shows the size of the training set for each of these regions as well as a few examples of PubMed author names that had at least 90% prediction probability in that region. @@ -132,7 +132,7 @@ We refer to this dataset as Wiki2019 (available online in [`annotated_names.tsv` | African | 16,105 | Samuel A Assefa, Nyaradzo M. Mgodi, Stanley Kimbung Mbandi, Oyebode J Oyeyemi, Ezekiel Adebiyi | | Israeli | 4,549 | Tal Vider-Shalit, Itsik Pe'er, Michal Lavidor, Yoav Gothilf, Dvir Netanely | -Table: **Predicting nationality of names trained on Wikipedia's living people.** +Table: **Predicting name etymology of names trained on Wikipedia's living people.** The table lists the 8 grouped regions of countries and the number of living people for each region that the LSTM was trained on. Example names shows actual author names that received a high prediction for each region. Full information about which countries comprised each region can be found in the online dataset [`country_to_region.tsv`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/data/country_to_region.tsv). diff --git a/content/20.results.md b/content/20.results.md index 86df8d0..d4741a9 100644 --- a/content/20.results.md +++ b/content/20.results.md @@ -57,20 +57,20 @@ Separating honoree results by honor category did not reveal any clear difference We directly compared honoree and author results from 1997 to 2020 for the predicted proportion of white, Asian, and other categories (Fig. {@fig:racial_makeup}E). We found that white honorees have been significantly overrepresented and Asian honorees have been significantly underrepresented in most years. -### Predicting Nationality with LSTM Neural Networks and Wikipedia +### Predicting Name Etymology with LSTM Neural Networks and Wikipedia -We next aimed to predict the nationality of honorees and authors. -We constructed a training dataset with more than 700,000 name-nationality pairs by parsing the English-language Wikipedia. -We trained a LSTM neural network on n-grams to predict nationality. +We next aimed to predict the name etymology of honorees and authors. +We constructed a training dataset with more than 700,000 name-etymology pairs by parsing the English-language Wikipedia. +We trained a LSTM neural network on n-grams to predict name etymology. We found similar performance across 1, 2, and 3-grams; however, the classifier required fewer epochs to train with 3-grams so we used this length in the model that we term Wiki2019-LSTM. Our Wiki2019-LSTM returns, for each given name, a probability of that name originating from each of the specified eight regions. We observed a multiclass area under the receiver operating characteristic curve (AUC) score of 95.4% for the classifier, indicating that the classifier can recapitulate name origins with high sensitivity and specificity. For each individual region, the high AUC (above 94%, Fig. {@fig:wiki2019_lstm}A) suggests that our classifier was sufficient for use in a broad-scale examination of disparities. We also observed that the model was well calibrated (Fig. {@fig:wiki2019_lstm}B). -We also examined potential systematic errors between pairs of nationality groupings with a confusion heatmap and did not find off-diagonal enrichment for any pairing (Fig. {@fig:wiki2019_lstm}C). +We also examined potential systematic errors between pairs of name etymology groupings with a confusion heatmap and did not find off-diagonal enrichment for any pairing (Fig. {@fig:wiki2019_lstm}C). -![The Wiki2019-LSTM model ranks the true nationality of Wikipedia names highly on testing data. -The area under the ROC curve is above 94% for each category, showing strong performance regardless of nationality (A). +![The Wiki2019-LSTM model ranks the true name etymology of Wikipedia names highly on testing data. +The area under the ROC curve is above 94% for each category, showing strong performance regardless of name etymology (A). A calibration curve, computed with the caret R package, shows consistency between the predicted probabilities (midpoints of each fixed-width bin) and the observed fraction of names in each bin (B). Heatmap showing whether names from a given region (x-axis) received higher (purple) or lower (green) predictions for each region (y-axis) than would be expected by region prevalence alone (C). The values represent log~2~ fold change between the average predicted probability and the prevalence of the corresponding predicted region in the testing dataset (null). @@ -80,7 +80,7 @@ For off-diagonal cells, darker green indicates a lower mean prediction compared For example, the classifier does not often mistake Hispanic names as Israeli, but is more prone to mistaking Muslim names as South Asian. ](https://raw.githubusercontent.com/greenelab/iscb-diversity/master/figs/fig_3.png){#fig:wiki2019_lstm width=100%} -### Assessing the Nationality Diversity of Authors and Honorees +### Assessing the Name Etymology Diversity of Authors and Honorees We applied our Wiki2019-LSTM model to both our computational biology honorees dataset and our dataset of corresponding authors. We found that the proportion of authors in the Celtic English categories had decreased (Fig. {@fig:region_breakdown}A, left), particularly for papers published in _Bioinformatics_ and _BMC Bioinformatics_ (see [notebook](https://greenelab.github.io/iscb-diversity/11.visualize-nationality.html#sup_fig_s4)). @@ -91,7 +91,7 @@ When we directly compared honoree composition with PubMed, we observed discrepan Outside of the primary range of our analyses, the two names of 2020 PSB keynote speakers were predicted to be of Celtic English origin (65% probability) and African origin (99% probability), respectively. -![Compared to the name collection of Pubmed authors, Celtic English honorees are overrepresented while East Asian honorees are underrepresented. Estimated composition of nationality prediction over the years of +![Compared to the name collection of Pubmed authors, Celtic English honorees are overrepresented while East Asian honorees are underrepresented. Estimated composition of name etymology prediction over the years of (A, left) all Pubmed computational biology and bioinformatics journal authors, and (A, right) all ISCB Fellows and keynote speakers was computed as the average of prediction probabilities of Pubmed articles or ISCB honorees each year. diff --git a/content/30.conclusions.md b/content/30.conclusions.md index de76559..5c957d1 100644 --- a/content/30.conclusions.md +++ b/content/30.conclusions.md @@ -9,7 +9,13 @@ In these cases, our analyses may substantially understate the extent to which mi Biases in authorship practices may also result in our underestimation of the composition of minoritized scientists within the field. We estimate the composition of the field using corresponding author status, but in neuroscience [@doi:10.1101/275362] and other disciplines [@doi:10.1371/journal.pbio.2004956] women are underrepresented among such authors. Such an effect would cause us to underestimate the number of women in the field. -Though this effect has been studied with respect to gender, we are not aware of similar work examining race, ethnicity, or nationality. +Though this effect has been studied with respect to gender, we are not aware of similar work examining race, ethnicity, or name etymology. + +We measured honor and authorship rates worldwide, as we focused on an international society and meetings. +In this setting, we observe disparities by name etymology. +A future goal of research should be to understand the basis of the disparities. +It is possible that invitation and honor patterns are driven by geographic factors, biases, or other factors. +Cross-referencing name etymology predictions with author affiliations could make it feasible to disentangle the relationship between how geographic region and name etymology relate to invitation probabilities. An important questions to ask when measuring representation is what the right level of representation is. We suggest that considering equity may be more appropriate than strictly diversity. From 1672d6b368740ee142297711f98ae4902248d30e Mon Sep 17 00:00:00 2001 From: lelaboratoire Date: Fri, 31 Jan 2020 13:52:30 -0500 Subject: [PATCH 2/7] some rephrasing --- content/30.conclusions.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/30.conclusions.md b/content/30.conclusions.md index 5c957d1..cc44285 100644 --- a/content/30.conclusions.md +++ b/content/30.conclusions.md @@ -11,11 +11,11 @@ We estimate the composition of the field using corresponding author status, but Such an effect would cause us to underestimate the number of women in the field. Though this effect has been studied with respect to gender, we are not aware of similar work examining race, ethnicity, or name etymology. -We measured honor and authorship rates worldwide, as we focused on an international society and meetings. +Focusing on an international society and meetings, we measured honor and authorship rates worldwide. In this setting, we observe disparities by name etymology. -A future goal of research should be to understand the basis of the disparities. -It is possible that invitation and honor patterns are driven by geographic factors, biases, or other factors. -Cross-referencing name etymology predictions with author affiliations could make it feasible to disentangle the relationship between how geographic region and name etymology relate to invitation probabilities. +A future studies are needed to unravel the basis of the disparities. +It is possible that invitation and honor patterns are driven by not only biases but also geographic or other factors. +Cross-referencing name etymology predictions with author affiliations could disentangle the relationship between geographic regions, name etymology and invitation probabilities. An important questions to ask when measuring representation is what the right level of representation is. We suggest that considering equity may be more appropriate than strictly diversity. From f6e93ddbdc31666dd350b2ea93ad8b7d37a90cb6 Mon Sep 17 00:00:00 2001 From: lelaboratoire Date: Fri, 31 Jan 2020 13:59:01 -0500 Subject: [PATCH 3/7] typo --- content/30.conclusions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/30.conclusions.md b/content/30.conclusions.md index cc44285..5e5b6d9 100644 --- a/content/30.conclusions.md +++ b/content/30.conclusions.md @@ -13,7 +13,7 @@ Though this effect has been studied with respect to gender, we are not aware of Focusing on an international society and meetings, we measured honor and authorship rates worldwide. In this setting, we observe disparities by name etymology. -A future studies are needed to unravel the basis of the disparities. +Future studies are needed to unravel the basis of the disparities. It is possible that invitation and honor patterns are driven by not only biases but also geographic or other factors. Cross-referencing name etymology predictions with author affiliations could disentangle the relationship between geographic regions, name etymology and invitation probabilities. From 7a780237737386ad500184654f6b00a1e4204f83 Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Fri, 31 Jan 2020 14:07:30 -0500 Subject: [PATCH 4/7] revert --- content/01.abstract.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/01.abstract.md b/content/01.abstract.md index a5f6cec..29399d1 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -6,5 +6,5 @@ We sought to understand the extent to which such recognitions reflected the comp We collected keynote speaker invitations for the international meetings held by the International Society for Computational Biology as well as the names of Fellows, an honorary group within the society. We compared these honorees with last and corresponding author contributions in field-specific journals. We used multiple methods to estimate the race, ethnicity, gender, and name etymology of authors and the recipients of these honors. -To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people name etymology pairs from Wikipedia and trained long short-term memory neural networks to make predictions. +To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people with name-nationality pairs from Wikipedia and trained long short-term memory neural networks to make predictions. Every approach consistently shows that white scientists are overrepresented among speakers and honorees, while scientists of color are underrepresented. From e687b5b6c9faef94ede2b9a68111ebff43786c43 Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Fri, 31 Jan 2020 14:45:07 -0500 Subject: [PATCH 5/7] more precisely describe how wikipedia annotates the first sentence of biographies --- content/10.methods.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/content/10.methods.md b/content/10.methods.md index da948fe..b56a979 100644 --- a/content/10.methods.md +++ b/content/10.methods.md @@ -99,7 +99,7 @@ One limitation of wru and other methods that infer race, ethnicity, or nationali To complement wru's race and ethnicity estimation, we developed a model to predict geographical origins of names. The existing Python package ethnicolr [@arxiv:1805.02109] produces reasonable predictions, but its international representation in the data curated from Wikipedia in 2009 [@doi:10.1145/1557019.1557032] is still limited. -For instance, 76% of the names in ethnicolr's Wikipedia dataset are European in origin, and the dataset contains remarkably fewer Asian, African, and Middle Eastern names compared to that of wru. +For instance, 76% of the names in ethnicolr's Wikipedia dataset are European in origin, and the dataset contains remarkably fewer Asian, African, and Middle Eastern names than wru. To address the limitations of ethnicolr, we built a similar classifier, a Long Short-term Memory (LSTM) neural network, to infer the region of origin from patterns in the sequences of letters in full names. We applied this model on an updated, approximately 4.5 times larger training dataset called Wiki2019 (described below). @@ -107,15 +107,17 @@ We tested multiple character sequence lengths and, based on this comparison, sel We trained our prediction model on 80% of the Wiki2019 dataset and evaluated its performance using the remaining 20%. This model, which we term Wiki2019-LSTM, is available in the online file [`LSTM.h5`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/models/LSTM.h5). -To generate a training dataset for name etymology prediction, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people), which contained approximately 930,000 pages at the time of processing in November 2019. -This category reflects a modern naming landscape. -It is regularly curated and allowed us to avoid pages related to non-persons. -For each Wikipedia page, we used two strategies to find a full birth name and name etymology for that person. +To generate a training dataset for name etymology prediction that reflects a modern naming landscape, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people). +This category, which contained approximately 930,000 pages at the time of processing in November 2019, is regularly curated and allowed us to avoid pages related to non-persons. +For each Wikipedia page, we used two strategies to find a full birth name and location context for that person. First, we used information from the personal details sidebar; the information in this sidebar varied widely but often contained a full name and a place of birth. Second, in the body of the text of most English-language biographical Wikipedia pages, the first sentence usually begins with, for example, "John Edward Smith (born 1 January 1970) is an American novelist known for ..." -We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of possible nationalities. -We were able to define a name and name etymology for 708,493 people by using the union of these strategies. -Our Wikipedia-based process returned a name etymology or country of origin, which was more fine-grained than the broader regional patterns that we sought to examine among honorees and authors. +This structure comes from editor [guidance on biography articles](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Biography#Context) and is designed to capture: +> ... the country of which the person is a citizen, national or permanent resident, or if the person is notable mainly for past events, the country where the person was a citizen, national or permanent resident when the person became notable. + +We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of nationalities. +We were able to define a name and nationality for 708,493 people by using the union of these strategies. +This process produced country labels that were more fine-grained than the broader regional patterns that we sought to examine among honorees and authors. We initially grouped names by continent, but later decided to model our categorization after the hierarchical nationality taxonomy used by [NamePrism](http://www.name-prism.com/about) [@doi:10.1145/3132847.3133008]. Consequently, we used the following categories: Hispanic (including Latin America and Iberia), African, Israeli, Muslim, South Asian, East Asian, European (non-British, non-Iberian), and Celtic English (including US, Canada, and Australia). Table @tbl:example_names shows the size of the training set for each of these regions as well as a few examples of PubMed author names that had at least 90% prediction probability in that region. From 5a5c8210065d037e7854e277ea6ac9755351b2ef Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Sat, 1 Feb 2020 10:20:17 -0500 Subject: [PATCH 6/7] Update content/20.results.md Co-Authored-By: Trang Le --- content/20.results.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/20.results.md b/content/20.results.md index d4741a9..0e8d9c3 100644 --- a/content/20.results.md +++ b/content/20.results.md @@ -60,7 +60,7 @@ We found that white honorees have been significantly overrepresented and Asian h ### Predicting Name Etymology with LSTM Neural Networks and Wikipedia We next aimed to predict the name etymology of honorees and authors. -We constructed a training dataset with more than 700,000 name-etymology pairs by parsing the English-language Wikipedia. +We constructed a training dataset with more than 700,000 name-nationality pairs by parsing the English-language Wikipedia. We trained a LSTM neural network on n-grams to predict name etymology. We found similar performance across 1, 2, and 3-grams; however, the classifier required fewer epochs to train with 3-grams so we used this length in the model that we term Wiki2019-LSTM. Our Wiki2019-LSTM returns, for each given name, a probability of that name originating from each of the specified eight regions. From 1a285f3e89e4fe4c31e4b04aec45edfbec025206 Mon Sep 17 00:00:00 2001 From: Casey Greene Date: Sat, 1 Feb 2020 10:25:58 -0500 Subject: [PATCH 7/7] etymology -> origin --- content/01.abstract.md | 2 +- content/02.introduction.md | 6 +++--- content/10.methods.md | 6 +++--- content/20.results.md | 16 ++++++++-------- content/30.conclusions.md | 6 +++--- 5 files changed, 18 insertions(+), 18 deletions(-) diff --git a/content/01.abstract.md b/content/01.abstract.md index 29399d1..a04a68e 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -5,6 +5,6 @@ Being invited to deliver a keynote at an international society meeting or named We sought to understand the extent to which such recognitions reflected the composition of their corresponding field. We collected keynote speaker invitations for the international meetings held by the International Society for Computational Biology as well as the names of Fellows, an honorary group within the society. We compared these honorees with last and corresponding author contributions in field-specific journals. -We used multiple methods to estimate the race, ethnicity, gender, and name etymology of authors and the recipients of these honors. +We used multiple methods to estimate the race, ethnicity, gender, and name origin of authors and the recipients of these honors. To address weaknesses in existing approaches, we built a new dataset of more than 700,000 people with name-nationality pairs from Wikipedia and trained long short-term memory neural networks to make predictions. Every approach consistently shows that white scientists are overrepresented among speakers and honorees, while scientists of color are underrepresented. diff --git a/content/02.introduction.md b/content/02.introduction.md index 02bb286..1c9b44c 100644 --- a/content/02.introduction.md +++ b/content/02.introduction.md @@ -13,12 +13,12 @@ Gender bias appears to also influence funding decisions: an examination of scori Challenges extend beyond gender: an analysis of awards at the NIH found that proposals by Asian, black or African-American applicants were less likely to be funded than those by white applicants [@doi:10.1126/science.1196783]. There are also potential interaction effects between gender and race or ethnicity that may particularly affect women of color's efforts to gain NIH funding [@doi:10.1097/ACM.0000000000001278]. -We sought to understand the extent to which honors and high-profile speaking invitations were distributed equitably among gender, race/ethnicity, and name etymology groups by an international society and its associated meetings. +We sought to understand the extent to which honors and high-profile speaking invitations were distributed equitably among gender, race/ethnicity, and name origin groups by an international society and its associated meetings. As computational biologists, we focused on the [International Society for Computational Biology](https://www.iscb.org/) (ISCB), its honorary Fellows as well as its affiliated international meetings: [Intelligent Systems for Molecular Biology](https://en.wikipedia.org/wiki/Intelligent_Systems_for_Molecular_Biology) (ISMB), [Research in Computational Molecular Biology](https://en.wikipedia.org/wiki/Research_in_Computational_Molecular_Biology) (RECOMB), and [Pacific Symposium on Biocomputing](https://psb.stanford.edu/) (PSB). -We used multiple methods to predict the gender, race/ethnicity, and name etymology of honorees. +We used multiple methods to predict the gender, race/ethnicity, and name origins of honorees. Existing methods were relatively US-centric because most of the data was derived in whole or in part from the US Census. -We scraped more than 700,000 entries from English-language Wikipedia that contained name etymology information to complement these existing methods and built multiple machine learning classifiers to predict name etymology. +We scraped more than 700,000 entries from English-language Wikipedia that contained nationality information to complement these existing methods and built multiple machine learning classifiers to predict name origin. We also examined the last and corresponding authors for publications in ISCB partner journals to establish a field-specific baseline using the same metrics. The results were consistent across all approaches: we found a dearth of non-white speakers and honorees. The lack of Asian scientists among keynote speakers and Fellows was particularly pronounced when compared against the field-specific background. diff --git a/content/10.methods.md b/content/10.methods.md index b56a979..2188044 100644 --- a/content/10.methods.md +++ b/content/10.methods.md @@ -95,7 +95,7 @@ Of 411 ISCB honorees, wru fails to provide race/ethnicity predictions for 98 nam Of 34,050 corresponding authors, 40 were missing a last name in the paper metadata, and 8,770 had a last name for which wru did not provide predictions. One limitation of wru and other methods that infer race, ethnicity, or nationality from last names is the potentially inaccurate prediction for scientists who changed their last name during marriage, a practice more common among women than men. -### Estimation of Name Etymology +### Estimation of Name Origin To complement wru's race and ethnicity estimation, we developed a model to predict geographical origins of names. The existing Python package ethnicolr [@arxiv:1805.02109] produces reasonable predictions, but its international representation in the data curated from Wikipedia in 2009 [@doi:10.1145/1557019.1557032] is still limited. @@ -107,7 +107,7 @@ We tested multiple character sequence lengths and, based on this comparison, sel We trained our prediction model on 80% of the Wiki2019 dataset and evaluated its performance using the remaining 20%. This model, which we term Wiki2019-LSTM, is available in the online file [`LSTM.h5`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/models/LSTM.h5). -To generate a training dataset for name etymology prediction that reflects a modern naming landscape, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people). +To generate a training dataset for name origin prediction that reflects a modern naming landscape, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people). This category, which contained approximately 930,000 pages at the time of processing in November 2019, is regularly curated and allowed us to avoid pages related to non-persons. For each Wikipedia page, we used two strategies to find a full birth name and location context for that person. First, we used information from the personal details sidebar; the information in this sidebar varied widely but often contained a full name and a place of birth. @@ -134,7 +134,7 @@ We refer to this dataset as Wiki2019 (available online in [`annotated_names.tsv` | African | 16,105 | Samuel A Assefa, Nyaradzo M. Mgodi, Stanley Kimbung Mbandi, Oyebode J Oyeyemi, Ezekiel Adebiyi | | Israeli | 4,549 | Tal Vider-Shalit, Itsik Pe'er, Michal Lavidor, Yoav Gothilf, Dvir Netanely | -Table: **Predicting name etymology of names trained on Wikipedia's living people.** +Table: **Predicting name-origin regions of names trained on Wikipedia's living people.** The table lists the 8 grouped regions of countries and the number of living people for each region that the LSTM was trained on. Example names shows actual author names that received a high prediction for each region. Full information about which countries comprised each region can be found in the online dataset [`country_to_region.tsv`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/data/country_to_region.tsv). diff --git a/content/20.results.md b/content/20.results.md index 0e8d9c3..5904d8f 100644 --- a/content/20.results.md +++ b/content/20.results.md @@ -57,20 +57,20 @@ Separating honoree results by honor category did not reveal any clear difference We directly compared honoree and author results from 1997 to 2020 for the predicted proportion of white, Asian, and other categories (Fig. {@fig:racial_makeup}E). We found that white honorees have been significantly overrepresented and Asian honorees have been significantly underrepresented in most years. -### Predicting Name Etymology with LSTM Neural Networks and Wikipedia +### Predicting Name Origin Groups with LSTM Neural Networks and Wikipedia -We next aimed to predict the name etymology of honorees and authors. +We next aimed to predict the name origin groups of honorees and authors. We constructed a training dataset with more than 700,000 name-nationality pairs by parsing the English-language Wikipedia. -We trained a LSTM neural network on n-grams to predict name etymology. +We trained a LSTM neural network on n-grams to predict name origin regions. We found similar performance across 1, 2, and 3-grams; however, the classifier required fewer epochs to train with 3-grams so we used this length in the model that we term Wiki2019-LSTM. Our Wiki2019-LSTM returns, for each given name, a probability of that name originating from each of the specified eight regions. We observed a multiclass area under the receiver operating characteristic curve (AUC) score of 95.4% for the classifier, indicating that the classifier can recapitulate name origins with high sensitivity and specificity. For each individual region, the high AUC (above 94%, Fig. {@fig:wiki2019_lstm}A) suggests that our classifier was sufficient for use in a broad-scale examination of disparities. We also observed that the model was well calibrated (Fig. {@fig:wiki2019_lstm}B). -We also examined potential systematic errors between pairs of name etymology groupings with a confusion heatmap and did not find off-diagonal enrichment for any pairing (Fig. {@fig:wiki2019_lstm}C). +We also examined potential systematic errors between pairs of name origin groupings with a confusion heatmap and did not find off-diagonal enrichment for any pairing (Fig. {@fig:wiki2019_lstm}C). -![The Wiki2019-LSTM model ranks the true name etymology of Wikipedia names highly on testing data. -The area under the ROC curve is above 94% for each category, showing strong performance regardless of name etymology (A). +![The Wiki2019-LSTM model performs well on held-out test data. +The area under the ROC curve is above 94% for each category, showing strong performance across origin categories (A). A calibration curve, computed with the caret R package, shows consistency between the predicted probabilities (midpoints of each fixed-width bin) and the observed fraction of names in each bin (B). Heatmap showing whether names from a given region (x-axis) received higher (purple) or lower (green) predictions for each region (y-axis) than would be expected by region prevalence alone (C). The values represent log~2~ fold change between the average predicted probability and the prevalence of the corresponding predicted region in the testing dataset (null). @@ -80,7 +80,7 @@ For off-diagonal cells, darker green indicates a lower mean prediction compared For example, the classifier does not often mistake Hispanic names as Israeli, but is more prone to mistaking Muslim names as South Asian. ](https://raw.githubusercontent.com/greenelab/iscb-diversity/master/figs/fig_3.png){#fig:wiki2019_lstm width=100%} -### Assessing the Name Etymology Diversity of Authors and Honorees +### Assessing the Name Origin Diversity of Authors and Honorees We applied our Wiki2019-LSTM model to both our computational biology honorees dataset and our dataset of corresponding authors. We found that the proportion of authors in the Celtic English categories had decreased (Fig. {@fig:region_breakdown}A, left), particularly for papers published in _Bioinformatics_ and _BMC Bioinformatics_ (see [notebook](https://greenelab.github.io/iscb-diversity/11.visualize-nationality.html#sup_fig_s4)). @@ -91,7 +91,7 @@ When we directly compared honoree composition with PubMed, we observed discrepan Outside of the primary range of our analyses, the two names of 2020 PSB keynote speakers were predicted to be of Celtic English origin (65% probability) and African origin (99% probability), respectively. -![Compared to the name collection of Pubmed authors, Celtic English honorees are overrepresented while East Asian honorees are underrepresented. Estimated composition of name etymology prediction over the years of +![Compared to the name collection of Pubmed authors, Celtic English honorees are overrepresented while East Asian honorees are underrepresented. Estimated composition of name origin prediction over the years of (A, left) all Pubmed computational biology and bioinformatics journal authors, and (A, right) all ISCB Fellows and keynote speakers was computed as the average of prediction probabilities of Pubmed articles or ISCB honorees each year. diff --git a/content/30.conclusions.md b/content/30.conclusions.md index 5e5b6d9..7f901fa 100644 --- a/content/30.conclusions.md +++ b/content/30.conclusions.md @@ -9,13 +9,13 @@ In these cases, our analyses may substantially understate the extent to which mi Biases in authorship practices may also result in our underestimation of the composition of minoritized scientists within the field. We estimate the composition of the field using corresponding author status, but in neuroscience [@doi:10.1101/275362] and other disciplines [@doi:10.1371/journal.pbio.2004956] women are underrepresented among such authors. Such an effect would cause us to underestimate the number of women in the field. -Though this effect has been studied with respect to gender, we are not aware of similar work examining race, ethnicity, or name etymology. +Though this effect has been studied with respect to gender, we are not aware of similar work examining race, ethnicity, or name origins. Focusing on an international society and meetings, we measured honor and authorship rates worldwide. -In this setting, we observe disparities by name etymology. +In this setting, we observe disparities by name origin groups. Future studies are needed to unravel the basis of the disparities. It is possible that invitation and honor patterns are driven by not only biases but also geographic or other factors. -Cross-referencing name etymology predictions with author affiliations could disentangle the relationship between geographic regions, name etymology and invitation probabilities. +Cross-referencing name origin group predictions with author affiliations could disentangle the relationship between geographic regions, name origins and invitation probabilities. An important questions to ask when measuring representation is what the right level of representation is. We suggest that considering equity may be more appropriate than strictly diversity.