This repository includes all software for reproducing the figures and tables of the paper "BISG: When inferring race or ethnicity, does it matter that people often live near their relatives?" by Philip Greengard and Andrew Gelman (pre-print).
The code in this repository constructs race/ethnicity (subpopulation) predictions for registered voters in several states.
The main output of the codes of this repository is a Pandas dataframe named df_agg
,
that is built in raking_bisg.py
. Each row of df_agg
contains race/ethnicity predictions corresponding to one (surname, county) pair in one state.
For example, the New York dataframe contains a row for (Smith, Queens) which includes
race/ethnicity predictions for a registered voter named "Smith" in Queens County, New York.
The codes in this repository construct these dataframes for a number of predictions and states. Dataframes are available to download here. The states and predictions included (as of 4/7/2023) are as follows.
Predictions: Census race given county, Census 18+ race given county, Census race given surname, Voter-adjusted race given county, Voter-adjusted race given surname, BISG (full US population), Voter BISG (registered voter population in state), Raking (registered voter population in state)
States: Florida (race/ethnicity labeled), North Carolina (race/ethnicity labeled), New York, Ohio, Oklahoma, Vermont, Washington State
After downloading the dataframes here in pickle form, the dataframes can be loaded into memory with the Python commands
> import pandas as pd
> df_agg = pd.read_pickle('path/to/df_agg_{state}{year}_dataverse.pkl')
The columns of the dataframe are explained in the file ./fields.py
.
A unit test can be run from the home directory of this repository via the command
$ python3 ./test.py
that checks that if data is generated according to the BISG assumption, then BISG is exact, and when data is not generated with the BISG assumption, raking predictions still have the correct race/ethnicity margin.
The figures and tables of the paper from North Carolina can be reproduced by running the script
$ python3 ./download_data_and_run.py
This script i) downloads available voter file, census, and cps data, ii) builds dataframes of predictions, iii) constructs figures and latex tables of the paper. Warning: voter files downloaded by this script can be very large. For example, the 2020 North Carolina voter file is nearly 15GB.
In order to generate figures and tables for Florida, the Florida voter file must be saved to a location specified in the above script. The Florida voter file is sent by disk to those who request it. The North Carolina voter files are available for download online and are downloaded using the above script.
{state}_vf.py
: reads in a voter file from a particular state
raking_bisg.py
: the main statistical codes, including building df_agg
which involves computing
raking and BISG predictions
fields.py
: defines column names (mainly predictions) of df_agg
process_census.py
: reads in census data
process_cps.py
: reads in data from the voter supplement of cps
plotting.py
: plot figures for paper and construct LaTeX tables
test.py
: unit test
download_data_and_run.py
: script that