Skip to content

AmirabbasNavidi/Student-Performance-Prediction-ML

Repository files navigation

Student Performance Analysis and Prediction

This project analyzes a dataset of student academic performance and builds machine learning models to predict the Performance Index based on various features. It includes exploratory data analysis (EDA), feature engineering, modeling, evaluation, and visualization.

📂 Dataset File: Student_Performance.csv Contains 10,000 student records with the following columns:

Column Type Description
Hours Studied int Number of hours a student studies
Previous Scores int Previous academic scores
Extracurricular Activities str Participation in extracurricular activities ('Yes'/'No')
Sleep Hours int Hours of sleep per day
Sample Question Papers Practiced int Number of sample question papers practiced
Performance Index float Target variable representing overall performance

⚙️ Libraries Used

pandas, numpy – data manipulation

matplotlib, seaborn – data visualization

scikit-learn – preprocessing, train-test split, models, metrics

joblib – saving the final model

warnings – to suppress unnecessary warnings

📝 Project Steps

1. Data Exploration (EDA)

Checked the shape, info, missing values, duplicates.

Visualized distributions using histograms, pairplots, boxplots, and countplots.

Explored categorical columns like Extracurricular Activities with pie charts.

Explored numeric columns like Hours Studied, Sleep Hours, and Previous Scores.

Conducted bivariate analysis to see relationships between features and Performance Index.

Found that Hours Studied has a strong positive correlation with performance, while extracurricular activities have minor impact.

2. Feature Engineering

Converted Extracurricular Activities to numeric (Yes=1, No=0).

Created new features:

Effort = Hours Studied + Sample Question Papers Practiced

Performance_Interaction = Previous Scores * Hours Studied

Visualized correlation using a heatmap.

3. Data Preprocessing

Separated features (X) and target (y = Performance Index).

Split dataset: 80% train, 20% test.

Applied MinMaxScaler to normalize numeric features.

4. Modeling

Models used:

Linear Regression

Decision Tree Regressor

Random Forest Regressor

Hyperparameter tuning performed using a custom function tune_multiple_models.

Linear Regression gave the best results.

5. Evaluation

Metrics used:

R² Score

Mean Absolute Error (MAE)

Root Mean Squared Error (RMSE)

Residual analysis showed no obvious pattern, validating linear regression assumptions.

Visual comparison of actual vs predicted values using scatter plots.

6. Model Deployment

Final model saved as best_model.pkl using joblib.

Ready to be loaded for future predictions:

import joblib model = joblib.load('best_model.pkl') predictions = model.predict(new_data)

🔧 Hyperparameter Tuning Module

This project includes a custom module hyperparameter_tuning.py to simplify tuning multiple models with GridSearchCV.

Functions

1 grid_search_tuning – Tune a single model:

best_model, best_params, best_score = grid_search_tuning(model, param_grid, scoring, X_train, y_train)

2 tune_multiple_models – Tune multiple models and get results:

results = tune_multiple_models(models, param_grids, X_train, y_train)

Returns a dictionary with each model’s:

'best score' → Best R² score

'best params' → Best hyperparameters

'best model' → Trained model pipeline

Works with Pipelines, including preprocessing steps.

📊 Visualizations

Histograms and KDE plots for performance distribution.

Boxplots for numeric features and bivariate analysis.

Countplots for categorical variables.

Correlation heatmap to visualize feature relationships.

Scatter plots for actual vs predicted performance.

Residual plots to validate model assumptions.

🔖 Notes

This project demonstrates end-to-end data analysis and predictive modeling.

It’s suitable for beginners to understand the workflow of a regression problem using real-world educational data.

The pipeline approach ensures preprocessing and modeling are combined efficiently.

About

An end-to-end machine learning project that explores student performance data through EDA, feature engineering, and predictive modeling to forecast academic scores.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors