Mitigate machine learning bias to ensure data ethics in U.S. national home mortgage dataset.
📝 Note: This document is still under writting.
- The project is related to the overall area of 'machine bias'. It uses the US national mortage dataset and
-
explore the machine bias (discrimination) as loan approvals benefits one group of people over another based on certain social attributes (legally known as protected classes such as race, gender, and religion). Specified three catrogies [Gender, Ethinicity and Race] by using the mean-difference method.
-
mitigating discrmination by implementing different methods (pre-processing, post-processing, naive-fairness etc.) and using machine learning algorithm (Prediction tree, random forest and logistic regression).
-
- At the end, it aims to train models which give best performance in both accuracy (utility) and transparency (fairness) which ensures the algorithms are categorically obejctive and diminish the social disparities.
clean.ipynb
includes the code to clean the databias_indentification.ipynb
contains the code to identify machine-bias in datade-biasing.py
contains the code to mitigate the machine-biasdocs/final_presentations.ppt
presents the slides deckREADME.md
summarzies and introduces the project
Colaboratory
is used to develop this project.PyDrive
is used to import data from Google drive into Colaboratory.themis-ML
is an open source Python library for speicifing, implementing and evluating the machine bias. (Official documentation for this package can be found here)Pandas, Numpy
is used in data cleaning.
Sha Sundaram, a privacy engineer at Snap who focuses on bias in machine learning, said engineers must put themselves in the shoes of their users and try to think like them. She noted that biases in machine learning have the potential to harm users, but it's very difficult to identify those biases.
She shared a checklist she uses to help identify bias in machine learning. What training data is used? What is being put in place to improve data quality? How sensitive is a model's accuracy to changes in test datasets? What is the risk to the user if something gets mislabeled? In what scenarios can your model be applied? When should a model be retrained?
You can find a complete set of references for the discrimination discovery and fairness-aware methods implemented in themis-ml
in this paper.
HMDA (Home Mortgage Dataset) Data generated by HMDA provides information on lending practices. This data set includes multiple files; the primary table is the Loan Application Register (LAR), which contains:
- demographic information about loan applicants, including race, gender and income; the purpose of the loan (i.e. home purchase or improvement);
- whether the buyer intends to live in the home; the type of loan (i.e. conventional, FHA insured, etc.);
- the outcome of the loan application (i.e. approved or declined).
- geographical information on applicants, such as Census tract, MA (metropolitan area), state and county, total population and percentage of minority population by Census tract.
A 1% sample CSV showed.
The section contains three parts:
- Feature selection
- Attributes transformation on: - Target variable - Protected attributes
- Null value Elimination