This project analyzes the Adult Income Dataset to predict whether an individual's income exceeds $50K per year based on census data. The analysis focuses on addressing three key machine learning challenges: class imbalance, missing values, and outliers.
This project uses the Adult Income Dataset from the UCI Machine Learning Repository.
- Name: Adult Income Dataset
- Source: UCI Machine Learning Repository
- Original Name: Census Income Dataset
- Year: 1994
- Donor: Ronny Kohavi and Barry Becker, Data Mining and Visualization, Silicon Graphics
The dataset was extracted from the 1994 Census bureau database. The prediction task is to determine whether a person makes over $50K a year based on census data. The dataset contains census data with the following features:
- Demographic attributes (age, gender, race, native country)
- Educational information (education level, educational-num)
- Employment details (workclass, occupation, hours-per-week)
- Financial indicators (capital-gain, capital-loss)
- Target variable: income >50K (binary classification)
Total records: 43,957
adult_income_data_project/
├── config/
│ └── __init__.py
├── constant/
│ └── constants.py
├── data/
│ └── train.csv
├── library/
│ ├── data_preprocessing.py
│ ├── evaluation.py
│ ├── models.py
│ └── visualization.py
├── src/
│ ├── challenge_1.py
│ ├── challenge_2.py
│ ├── challenge_3.py
│ └── main.py
├── dataset_profile.py
├── README.md
└── requirements.txt
Three strategies implemented:
- Baseline (no changes)
- Over-sampling using SMOTE
- Cost-sensitive learning using class weights
Addressing missing data in workclass, occupation, and native-country fields:
- Baseline (no changes)
- Dropping rows with missing values
- Imputation using most frequent values
Handling outliers in numerical features:
- Baseline (no changes)
- Winsorizing at specific percentiles:
- capital-gain: 97th percentile
- capital-loss: 97th percentile
- hours-per-week: 95th percentile
- age: 95th percentile
- Dropping identified outliers
- Missing data analysis revealed:
- Workclass: MCAR (Missing Completely at Random)
- Occupation: NMAR (Not Missing at Random)
- Native-country: MCAR
- Significant class imbalance with majority of records having income ≤50K
- Non-normal distribution in numerical features, particularly in capital-gain and capital-loss
- Clone the repository
- Install required dependencies:
pip install -r requirements.txt
- Run the dataset profile analysis:
python dataset_profile.py
- Execute each challenge:
python -m challenge_1.py # Run class imbalance challenge
python -m challenge_2.py # Run missing values challenge
python -m challenge_3.py # Run outliers challenge