The script takes a csv file with column 'Name' containing 'dirty names' --- names with all different formats: lastname firstname, firstname lastname, middlename lastname firstname etc. (see sample input file). And it produces a csv file that has all the columns of the original csv file and the following columns: 'uniqid', 'FirstName', 'MiddleInitial/Name', 'LastName', 'RomanNumeral', 'Title', 'Suffix'. The script takes out duplicate names by default (see sample output file).
The script was used to fix names in CF-Scores from Database on Ideology, Money in Politics, and Elections. Processed database with clean names posted on Harvard DVN.
- Clone this repository
git clone https://github.com/soodoku/clean-names.git
-
Navigate to clean-names
-
Run
python setup.py install
Usage: process_names.py [options]
-h, --help show this help message and exit
-o OUTFILE, --out=OUTFILE
Output file in CSV (default: sample_output.csv)
-c COLUMN, --column=COLUMN
Column name in CSV that contains Names (default: Name)
-a, --all
Export all names (do not take duplicate names out) (default: False)
python process_names.py -a sample_input.csv
Scripts are released under the MIT License