- Ramachandra Gopal Posina
- Akhil Morampudi
- Mahanth Mukesh Dadisetty
- Bharadwaj Aryasomayajula
- Dinesh Deshi
Income of US Citizes, Census Bureau
The goal of this project is to predict if an individual’s income exceeds 50K or not using machine learning classification algorithms and also finding patters in the dataset using Association rules. This helps us to determine various things such as lucrativeness of setting up a business in a city based on average income of the people, Real Estate preferences and bank loan eligibility for a particular person. In addition, we can also figure out what type of tourist places a particular strata of people would like to visit and whether that person’s children would prefer a public or private college in future.
https://www.kaggle.com/uciml/adult-census-income
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). The prediction task is to predict whether income exceeds $50k per year based on the provided. census data provided above. The Datasets consists of a list of records , each of which explains various features of a person along with his income per year.
A brief description of the features are as follows:
Target:
income : >50K, <=50K
Predictors:
age: continuous
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt: continuous
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
education-num: continuous
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex: Female, Male
capital-gain: continuous
capital-loss: continuous
hours-per-week: continuous
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands
- Data Understanding Phase
- Data Preparation Phase
- Modeling Phase
- Evaluation Phase
- Deployment Phase
Data Pre-Processing and Exploratory Data Analysis has been performed on the data and below are the observations
AI Fairness 360 by IBM implements several pre-processing mitigation algorithms. We will choose the Optimized Preprocess algorithm, which is implemented in “OptimPreproc” class in the “aif360.algorithms.preprocessing” directory. This algorithm will transform the dataset to have more equity in positive outcomes on the protected attribute for the privileged and unprivileged groups.
Bias in original dataset: Difference in mean outcomes between unprivileged and privileged groups = -0.104553
Bias after mitigation: Difference in mean outcomes between unprivileged and privileged groups = -0.051074
- The no of records having income less than 50k dollars is more than the no of records having income more than 50k dollars income. The dataset needs to be balanced with the target values so that the models do not overfit the data.
- The capital.gain and capital.loss values contain zeroes, so these columns can be dropped.
- Scatter plots and bar plots are plotted to find the distribution of various values of categorical values.
- Hours.per.week has a value of 40 in most of the records, so this field can be dropped.
- The fnlwgt values are mostly in the range of 0-40,000 and are of age 20 to 40.
- The outliers are present in some of the continuous variables which need to the handled properly.
1.Performed Logistic Regression on the model and calculated the performance metrics of the model and computed the scores
2.Association Rules(For finding patters in the Dataset)
3.K-Fold Cross Validation
4.Applying CART algorithms to choose the best algorithm based on the metrics obtained.
5.Using ensemble learning implemented neural networks
Model evaluation metrics are required to quantify model performance. The choice of evaluation metrics depends on a given machine learning task (such as classification, regression, ranking, clustering, topic modeling, among others). Some metrics, such as precision-recall, are useful for multiple tasks.
Classification Metrics include the following scores
1.Classification Accuracy
2.Confusion Matrix
3.F1 score on test
4.Precision
5.Specificity
6.Recall
1.Classification Accuracy: Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made. We use sklearn module to compute the accuracy of a classification task
2.Confusion Matrix A confusion matrix provides a more detailed breakdown of correct and incorrect classifications for each class.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal was mislabeled by the classifier. Therefore, the higher the diagonal values of the confusion matrix the better, indicating many correct predictions.
3.Area under Curve (AUC) Area under ROC Curve is a performance metric for measuring the ability of a binary classifier to discriminate between positive and negative classes.
4.F-Measure F-measure (also F-score) is a measure of a test’s accuracy that considers both the precision and the recall of the test to compute the score. Precision is the number of correct positive results divided by the total predicted positive observations.
5.Recall, Itis the number of correct positive results divided by the number of all relevant samples (total actual positives).
Evaulation scores & Performance metrics obtained in our model:
Accuracy on test: 0.8019891500904159
F1 score on test: 0.49
Precision : 0.7720364741641338
Specificity : 0.9697824335213537
Recall : 0.580
- Design a machine learning pipeline and feed the ensemble learning
Finally from the dataset we predict whether a person makes over $50K a year or not.
Find Patters in the dataset
K-Fold cross validation