Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nationality->etymology / describe potential future research directions #29

Merged
merged 7 commits into from
Feb 1, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions content/10.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,23 +99,25 @@ One limitation of wru and other methods that infer race, ethnicity, or nationali

To complement wru's race and ethnicity estimation, we developed a model to predict geographical origins of names.
The existing Python package ethnicolr [@arxiv:1805.02109] produces reasonable predictions, but its international representation in the data curated from Wikipedia in 2009 [@doi:10.1145/1557019.1557032] is still limited.
For instance, 76% of the names in ethnicolr's Wikipedia dataset are European in origin, and the dataset contains remarkably fewer Asian, African, and Middle Eastern names compared to that of wru.
For instance, 76% of the names in ethnicolr's Wikipedia dataset are European in origin, and the dataset contains remarkably fewer Asian, African, and Middle Eastern names than wru.

To address the limitations of ethnicolr, we built a similar classifier, a Long Short-term Memory (LSTM) neural network, to infer the region of origin from patterns in the sequences of letters in full names.
We applied this model on an updated, approximately 4.5 times larger training dataset called Wiki2019 (described below).
We tested multiple character sequence lengths and, based on this comparison, selected tri-characters for the primary results described in this work.
We trained our prediction model on 80% of the Wiki2019 dataset and evaluated its performance using the remaining 20%.
This model, which we term Wiki2019-LSTM, is available in the online file [`LSTM.h5`](https://github.com/greenelab/wiki-nationality-estimate/blob/master/models/LSTM.h5).

To generate a training dataset for name etymology prediction, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people), which contained approximately 930,000 pages at the time of processing in November 2019.
This category reflects a modern naming landscape.
It is regularly curated and allowed us to avoid pages related to non-persons.
For each Wikipedia page, we used two strategies to find a full birth name and name etymology for that person.
To generate a training dataset for name etymology prediction that reflects a modern naming landscape, we scraped the English Wikipedia's category of [Living People](https://en.wikipedia.org/wiki/Category:Living_people).
This category, which contained approximately 930,000 pages at the time of processing in November 2019, is regularly curated and allowed us to avoid pages related to non-persons.
For each Wikipedia page, we used two strategies to find a full birth name and location context for that person.
First, we used information from the personal details sidebar; the information in this sidebar varied widely but often contained a full name and a place of birth.
Second, in the body of the text of most English-language biographical Wikipedia pages, the first sentence usually begins with, for example, "John Edward Smith (born 1 January 1970) is an American novelist known for ..."
We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of possible nationalities.
We were able to define a name and name etymology for 708,493 people by using the union of these strategies.
Our Wikipedia-based process returned a name etymology or country of origin, which was more fine-grained than the broader regional patterns that we sought to examine among honorees and authors.
This structure comes from editor [guidance on biography articles](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Biography#Context) and is designed to capture:
> ... the country of which the person is a citizen, national or permanent resident, or if the person is notable mainly for past events, the country where the person was a citizen, national or permanent resident when the person became notable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful to have here!


We used regular expressions to parse out the person's name from this structure and checked that the expression after "is a" matched a list of nationalities.
We were able to define a name and nationality for 708,493 people by using the union of these strategies.
This process produced country labels that were more fine-grained than the broader regional patterns that we sought to examine among honorees and authors.
We initially grouped names by continent, but later decided to model our categorization after the hierarchical nationality taxonomy used by [NamePrism](http://www.name-prism.com/about) [@doi:10.1145/3132847.3133008].
Consequently, we used the following categories: Hispanic (including Latin America and Iberia), African, Israeli, Muslim, South Asian, East Asian, European (non-British, non-Iberian), and Celtic English (including US, Canada, and Australia).
Table @tbl:example_names shows the size of the training set for each of these regions as well as a few examples of PubMed author names that had at least 90% prediction probability in that region.
Expand Down