-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nationality->etymology / describe potential future research directions #29
Merged
Merged
Changes from 5 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
90477bc
nationality->etymology / describe limitations
cgreene 1672d6b
some rephrasing
trangdata f6e93dd
typo
trangdata 7a78023
revert
cgreene e687b5b
more precisely describe how wikipedia annotates the first sentence of…
cgreene 5a5c821
Update content/20.results.md
cgreene 1a285f3
etymology -> origin
cgreene File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -95,27 +95,29 @@ Of 411 ISCB honorees, wru fails to provide race/ethnicity predictions for 98 nam | |
Of 34,050 corresponding authors, 40 were missing a last name in the paper metadata, and 8,770 had a last name for which wru did not provide predictions. | ||
One limitation of wru and other methods that infer race, ethnicity, or nationality from last names is the potentially inaccurate prediction for scientists who changed their last name during marriage, a practice more common among women than men. | ||
|
||
### Estimation of Nationality | ||
### Estimation of Name Etymology | ||
|
||
To complement wru's race and ethnicity estimation, we developed a model to predict geographical origins of names. | ||
The existing Python package ethnicolr [@arxiv:1805.02109] produces reasonable predictions, but its international representation in the data curated from Wikipedia in 2009 [@doi:10.1145/1557019.1557032] is still limited. | ||
For instance, 76% of the names in ethnicolr's Wikipedia dataset are European in origin, and the dataset contains remarkably fewer Asian, African, and Middle Eastern names compared to that of wru. | ||
For instance, 76% of the names in ethnicolr's Wikipedia dataset are European in origin, and the dataset contains remarkably fewer Asian, African, and Middle Eastern names than wru. | ||
|
||
To address the limitations of ethnicolr, we built a similar classifier, a Long Short-term Memory (LSTM) neural network, to infer the region of origin from patterns in the sequences of letters in full names. | ||
We applied this model on an updated, approximately 4.5 times larger training dataset called Wiki2019 (described below). | ||
We tested multiple character sequence lengths and, based on this comparison, selected tri-characters for the primary results described in this work. | ||
We trained our prediction model on 80% of the Wiki2019 dataset and evaluated its performance using the remaining 20%. | ||
This model, which we term Wiki2019-LSTM, is available in the online file [`LSTM.h5`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/models/LSTM.h5). | ||
|
||
To generate a training dataset for nationality prediction, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people), which contained approximately 930,000 pages at the time of processing in November 2019. | ||
This category reflects a modern naming landscape. | ||
It is regularly curated and allowed us to avoid pages related to non-persons. | ||
For each Wikipedia page, we used two strategies to find a full birth name and nationality for that person. | ||
To generate a training dataset for name etymology prediction that reflects a modern naming landscape, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people). | ||
This category, which contained approximately 930,000 pages at the time of processing in November 2019, is regularly curated and allowed us to avoid pages related to non-persons. | ||
For each Wikipedia page, we used two strategies to find a full birth name and location context for that person. | ||
First, we used information from the personal details sidebar; the information in this sidebar varied widely but often contained a full name and a place of birth. | ||
Second, in the body of the text of most English-language biographical Wikipedia pages, the first sentence usually begins with, for example, "John Edward Smith (born 1 January 1970) is an American novelist known for ..." | ||
We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of possible nationalities. | ||
This structure comes from editor [guidance on biography articles](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Biography#Context) and is designed to capture: | ||
> ... the country of which the person is a citizen, national or permanent resident, or if the person is notable mainly for past events, the country where the person was a citizen, national or permanent resident when the person became notable. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is helpful to have here! |
||
|
||
We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of nationalities. | ||
We were able to define a name and nationality for 708,493 people by using the union of these strategies. | ||
Our Wikipedia-based process returned a nationality or country of origin, which was more fine-grained than the broader regional patterns that we sought to examine among honorees and authors. | ||
This process produced country labels that were more fine-grained than the broader regional patterns that we sought to examine among honorees and authors. | ||
We initially grouped names by continent, but later decided to model our categorization after the hierarchical nationality taxonomy used by [NamePrism](http://www.name-prism.com/about) [@doi:10.1145/3132847.3133008]. | ||
Consequently, we used the following categories: Hispanic (including Latin America and Iberia), African, Israeli, Muslim, South Asian, East Asian, European (non-British, non-Iberian), and Celtic English (including US, Canada, and Australia). | ||
Table @tbl:example_names shows the size of the training set for each of these regions as well as a few examples of PubMed author names that had at least 90% prediction probability in that region. | ||
|
@@ -132,7 +134,7 @@ We refer to this dataset as Wiki2019 (available online in [`annotated_names.tsv` | |
| African | 16,105 | Samuel A Assefa, Nyaradzo M. Mgodi, Stanley Kimbung Mbandi, Oyebode J Oyeyemi, Ezekiel Adebiyi | | ||
| Israeli | 4,549 | Tal Vider-Shalit, Itsik Pe'er, Michal Lavidor, Yoav Gothilf, Dvir Netanely | | ||
|
||
Table: **Predicting nationality of names trained on Wikipedia's living people.** | ||
Table: **Predicting name etymology of names trained on Wikipedia's living people.** | ||
The table lists the 8 grouped regions of countries and the number of living people for each region that the LSTM was trained on. | ||
Example names shows actual author names that received a high prediction for each region. | ||
Full information about which countries comprised each region can be found in the online dataset [`country_to_region.tsv`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/data/country_to_region.tsv). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. It's misleading to say that the Wikipedia entries contained name etymology information. I'd stick with nationality in the first part of the sentence, and then say name etymology in the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that in mind, I think I honestly prefer "name origins" over "name etymology", since the Wiki data will be loosely based on country of origin but not really on linguistics.