This project focuses on comparative machine learning analysis in the field of bioinformatics, specifically examining gene expression data. The analysis involves various machine learning techniques, including Random Forest, Support Vector Regression (SVR), and other regression models, to predict and analyze gene expression scores.
- Python: Used for data preprocessing, model building, and evaluation.
- Key Libraries:
pandas
,numpy
,sklearn
,seaborn
,matplotlib
- Key Libraries:
- R: Employed for statistical analysis and visualization.
- Key Libraries:
tidyverse
,caret
,e1071
,rpart
,randomForest
,ggplot2
,readr
,ggpubr
- Key Libraries:
The project uses preprocessed gene expression data, including various features and a target variable (score). The data is analyzed to understand the relationships between different genes and their expression levels.
- Random Forest Regression (Python): Used for hyperparameter tuning and model fitting.
- Support Vector Regression (SVR) (Python & R): Applied for modeling gene expression data with linear kernel.
- Feature Selection and Analysis: Mutual Information, Recursive Feature Elimination (RFE), and Correlation Analysis.
- Model Evaluation: Using Mean Squared Error (MSE) and Root Mean Squared Error (RMSE).
- Baseline Comparison: Comparison with a dummy regressor to establish baseline performance.
Key visualizations from the analysis are presented below:
Explanation of the scatter plot findings.
Details about the data shown in the bar chart.
Interpretation of the density plot.
Insights from the joint density plot.
Summary of key findings, including feature importance, model performance comparison, and visualization insights.
Instructions on setting up the environment and running the scripts.
Details on how to run the scripts and utilize the analysis.
Information on how others can contribute to the project.
For more information or inquiries, please contact motasem.youniss@gmail.com.