This project explores the effects of Ridge (L2) regularization in a multiclass logistic regression model. The analysis investigates how regularization impacts coefficients, cross-validation performance, and predictions. This project applies these techniques to classify ancestry groups based on human genetic data.
The dataset includes 183 individuals sampled globally, with genetic data projected onto the 10 top principal components, which together explain 24.16% of the variance. These components serve as features, replacing raw genetic data to reduce computational complexity while retaining essential genetic information. The goal is to develop a Ridge (L2) regularized logistic regression model that classifies an individual’s ancestry into one of five continental groups: African, European, East Asian, Oceanian, or Native American, using these principal components.
The notebook demonstrates:
- The impact of varying the regularization parameter (λ) on model coefficients.
- The relationship between λ and cross-validation performance (log-loss).
- Predictions for test samples using the optimal λ, including probabilities for each ancestry class.
- Visualizes the effect of λ on the coefficients of logistic regression for better interpretability.
- Identifies the optimal λ using cross-validation to minimize log-loss.
- Produces class probabilities for test samples and highlights ancestry patterns.
The analysis relies on the following Python libraries:
numpy
: Numerical operations and array manipulations.pandas
: Data handling and preprocessing.matplotlib
: Visualization of results and trends.scikit-learn
: Logistic regression modeling, cross-validation, and metrics.seaborn
: Heatmaps for predicted probabilities.
- Mixed Ancestry Patterns: Mexican and African American samples exhibit mixed ancestry, reflecting historical events like colonization and the transatlantic slave trade.
- Clear Assignments for Unknown Samples: Unknown samples were clearly matched to a single ancestry group.
- Historical Insights: The results align with historical records, including indigenous populations' mixing with colonizers and the slave trade's genetic impacts.
- Speculative Observations: Possible ancient seafaring connections (e.g., Polynesians reaching the Americas or Indian Ocean trade influencing genetics) are hinted at by shared traits.
Below is a sample of predicted probabilities and classifications:
African | European | EastAsian | Oceanian | NativeAmerican | Predicted Class | Actual Class |
---|---|---|---|---|---|---|
0.131156 | 0.093878 | 0.099505 | 0.081108 | 0.594353 | NativeAmerican | Unknown |
0.102278 | 0.064486 | 0.109084 | 0.651919 | 0.072233 | Oceanian | Unknown |
0.099014 | 0.167203 | 0.520602 | 0.097951 | 0.115230 | EastAsian | Unknown |
0.181682 | 0.216551 | 0.126798 | 0.216200 | 0.258769 | NativeAmerican | Unknown |
0.132245 | 0.620827 | 0.065980 | 0.098260 | 0.082688 | European | Unknown |
0.135332 | 0.062889 | 0.191956 | 0.537468 | 0.072355 | Oceanian | Mexican |
0.153923 | 0.084007 | 0.253703 | 0.387418 | 0.120948 | Oceanian | Mexican |
0.143852 | 0.103410 | 0.157563 | 0.509435 | 0.085740 | Oceanian | Mexican |
0.166508 | 0.117424 | 0.261901 | 0.347607 | 0.106560 | Oceanian | Mexican |
0.186156 | 0.121892 | 0.256650 | 0.337394 | 0.097908 | Oceanian | Mexican |
0.186488 | 0.096595 | 0.335412 | 0.272778 | 0.108727 | EastAsian | Mexican |
data/
: Training and test datasets.images/
: Visualizations of results.notebooks/
: Jupyter Notebook containing the analysis.requirements.txt
: List of required Python packages.
- Clone this repository:
git clone https://github.com/DevDizzle/ridge-logistic-regression.git