Census Name Gender

Repo for building and analyzing US Census Name mapping to Gender. Source: https://www.ssa.gov/OACT/babynames/limits.html

As the data and analysis are built out, I will add to the description.

Description

I wanted to explore an interesting challenge that is facing companies that use machine learning to assess credit worthiness or employment based on non-gender factors. Some of these algorithsm have been shown to be biased and discriminatory even though gender was never part of algorithm (see the Apple Card and Goldman Sachs, https://observer.com/2019/11/goldman-sachs-bias-detection-apple-card/).

Typically, companies are NOT gathering gender information, but in order to evaluate algorithm assigning gender (Male vs. Female) to the data is essential. One way of doing this is to use US Census data. This data counts up the number of times a gender appear for each first name of an individual based on their birth year and territory.

There are many names that are clearly gender typed and always appear associated as Male or associated as Female. However, many names are not clearly male or female.

The purpose of this repository is to develop a model that could be used to determine the gender of a name based on the characteristics of names.

Script Directory

001_mk_data.r - Creates name_gender.rda (This is the SSA file downloaded from https://www.ssa.gov/OACT/babynames/limits.html) The data is split into territory data (organized by state or US territory, and year of birth) and year of birth data (organized by year of birth). Each observation contains a first name and the count of gender for individuals with that name.

This script processes the data and creates a gender tag for each name found in the data based the following: Male - All names for the first name were assigned to a male in a given year / territory Female - All names for the first name were assiged to a female in a given year / territory Unknown - Remaining names for a territory and year that were not assigned (This is the target for prediction)

002_mk_features.r - Creates name_gender_features.rda. This takes the data created in 001 and builds out features for use in the model. The script also build plots with interesting breakdowns of the data.

plot1 - Boxplot of Syllables by Gender
plot2 - Boxplot of Name Length By Gender
plot3 - Last Character in the name by Gender
plot4 - First Initial by Gender
plot5 - Start with a Vowel by Gender
plot6 - State by Gender
plot7 - Age by Gender (Age is another way of looking at Year but as a continuous variable)
plot8 - Age by Syllables (with coloring by gender)
plot9 - Age by Name length (with coloring by gender)

Features

name length - full string length of the name
syllable_count (based on the syllable package - this package contains NLP processes to determine the number of syllables in word.
age - age of the person based on 2018 - year
year - year of birth
state - state or territory
first inital - first character of the name
start with a vowel - does the name start with a vowel (a, e, i, o, or u)
end in a vowel - does the name end with a vowel (a, e, i, o, or u)
name count - number of names (this feature does not work based on the data. the SSA collapses all spaces from a name (e.g., mary elizabeth = maryelizabeth)

003_ml_model.r - Actual Modeling

60/20/20 training, test, holdout set process is used to evaluate model fit.
Key features: year (factor), territory (factor), name length (continuous), number of syllables (continuous), end with a vowel (factor), first initial (factor), last character (factor), age (continuous)
Outcome: binary male/female label (with a probability of that label)
Initial evaluation: AUC with ROCR curve
Final model will be compared to Unknown category. Evaluation will be the probability of which gender as assigned by the model versus a simple binary yes/no based on the proportion of records assigned male vs female.

004_ml_explain.r - Explanability of the features (TBD)

Simple variable importance of the features
LIME
SHAP

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
001_mk_data.r		001_mk_data.r
001_name_gender.rda		001_name_gender.rda
002_mk_features.r		002_mk_features.r
002_name_gender_features.rda		002_name_gender_features.rda
002_plot1.png		002_plot1.png
002_plot2.png		002_plot2.png
002_plot3.png		002_plot3.png
002_plot4.png		002_plot4.png
002_plot5.png		002_plot5.png
002_plot6.png		002_plot6.png
002_plot7.png		002_plot7.png
002_plot8.png		002_plot8.png
002_plot9.png		002_plot9.png
003_ConfusionMatrix.xlsx		003_ConfusionMatrix.xlsx
003_ml_model.r		003_ml_model.r
003_rocr1.png		003_rocr1.png
003_rocr2.png		003_rocr2.png
003_test_predictions.rds		003_test_predictions.rds
003_unk_predictions.rds		003_unk_predictions.rds
003_varimp.png		003_varimp.png
README.md		README.md
model_paths.h2o		model_paths.h2o

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Census Name Gender

Description

Script Directory

About

Releases

Packages

Languages

jjghockey/CensusNameGender

Folders and files

Latest commit

History

Repository files navigation

Census Name Gender

Description

Script Directory

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages