Project Overview
This project analyzes the Postpartum Depression (PPD) dataset (THP_clean.csv) to explore demographic, medical, and social factors linked to PPD and to build a machine learning model that predicts risk. Dataschema was alemployed as some column names were encrypted.
The dataset includes:
Demographics: age, marital status, education, employment, parity (number of children)
Medical history: previous depression, HAMD baseline score, birth complications
Social support: MSPSS baseline score
Target: Postpartum depression (yes/no)
Methodology
-
Data Loading Loaded the dataset (THP_clean.csv) into a Pandas DataFrame. Checked shape, column names, and data types.
-
Data Cleaning Removed duplicate rows. Standardized column names (lowercase, underscores). Filled missing values: Numeric columns → filled with median. Categorical columns → filled with mode (most frequent value).
-
Exploratory Data Analysis (EDA) & Visualizations
20 key research questions were asked and answered: Visualizations used: bar plots, boxplots, pie chart, histogram, and heatmap.
-
Does age group influence PPD?
-
Does marital status affect PPD?
-
Does social support reduce PPD risk?
-
what is depression rate based on parent educational level?
-
What percentage of mothers experienced PPD?
-
Does birth complications increase PPD risk?
-
Does the number of children (parity) affect PPD?
-
How does social relationship influence PPD?
-
Does employment status affect PPD?
-
What is the effect of hamd severity score of the mother on wppsi variables?
-
how does ppd affect the social behaviour of children using scas & sdq
-
What is the ratio of depressed and not depressed among mother's with no health issue?
-
Effect the influcence of external family members being on PPD?
-
Does the number of child being born influence PPD?
-
Does the living condaition influence PPD?
-
Finacial inflence on PPD?
-
How does BMI inflenece Mother's physical and mental health?
-
What is the effect of the Home on PPD?
-
Is PPD influenced by practicing birth spacing?
-
What is the mortality rate of children?
-
Predictive Modeling Preprocessed categorical variables using Label Encoding. Defined features (X) and target (y = depressed). Split dataset into training and testing sets (80/20). Trained a Random Forest Classifier. Evaluated with: Accuracy, Precision, Recall, F1-score (classification report). Confusion matrix.
-
Feature Importance Extracted feature importances from the Random Forest model. Identified the Top 10 predictors of PPD (e.g., social support score, HAMD score, marital status, birth complications). Results & Insights Mothers with lower social support had a higher risk of PPD. History of depression and higher HAMD scores were strong predictors. Marital status and birth complications were also linked to increased risk. Random Forest achieved good predictive performance, showing potential for risk screening.
How to Run
- Open the Jupyter Notebook PPD_Analysis.ipynb.
- Run all cells step by step.
- Upload THP_clean.csv when prompted.
- Visualizations and model results will display inline.
Libraries Used
- pandas(data handling)
- numpy(numerical processing)
- matplotlib& seaborn (visualization)
- scikit-learn(machine learning)