1. Data Exploration (EDA)

Student Performance Analysis and Prediction

This project analyzes a dataset of student academic performance and builds machine learning models to predict the Performance Index based on various features. It includes exploratory data analysis (EDA), feature engineering, modeling, evaluation, and visualization.

📂 Dataset File: Student_Performance.csv Contains 10,000 student records with the following columns:

Column	Type	Description
Hours Studied	int	Number of hours a student studies
Previous Scores	int	Previous academic scores
Extracurricular Activities	str	Participation in extracurricular activities ('Yes'/'No')
Sleep Hours	int	Hours of sleep per day
Sample Question Papers Practiced	int	Number of sample question papers practiced
Performance Index	float	Target variable representing overall performance

⚙️ Libraries Used

pandas, numpy – data manipulation

matplotlib, seaborn – data visualization

scikit-learn – preprocessing, train-test split, models, metrics

joblib – saving the final model

warnings – to suppress unnecessary warnings

📝 Project Steps

1. Data Exploration (EDA)

Checked the shape, info, missing values, duplicates.

Visualized distributions using histograms, pairplots, boxplots, and countplots.

Explored categorical columns like Extracurricular Activities with pie charts.

Explored numeric columns like Hours Studied, Sleep Hours, and Previous Scores.

Conducted bivariate analysis to see relationships between features and Performance Index.

Found that Hours Studied has a strong positive correlation with performance, while extracurricular activities have minor impact.

2. Feature Engineering

Converted Extracurricular Activities to numeric (Yes=1, No=0).

Created new features:

Effort = Hours Studied + Sample Question Papers Practiced

Performance_Interaction = Previous Scores * Hours Studied

Visualized correlation using a heatmap.

3. Data Preprocessing

Separated features (X) and target (y = Performance Index).

Split dataset: 80% train, 20% test.

Applied MinMaxScaler to normalize numeric features.

4. Modeling

Models used:

Linear Regression

Decision Tree Regressor

Random Forest Regressor

Hyperparameter tuning performed using a custom function tune_multiple_models.

Linear Regression gave the best results.

5. Evaluation

Metrics used:

R² Score

Mean Absolute Error (MAE)

Root Mean Squared Error (RMSE)

Residual analysis showed no obvious pattern, validating linear regression assumptions.

Visual comparison of actual vs predicted values using scatter plots.

6. Model Deployment

Final model saved as best_model.pkl using joblib.

Ready to be loaded for future predictions:

import joblib model = joblib.load('best_model.pkl') predictions = model.predict(new_data)

🔧 Hyperparameter Tuning Module

This project includes a custom module hyperparameter_tuning.py to simplify tuning multiple models with GridSearchCV.

Functions

1 grid_search_tuning – Tune a single model:

best_model, best_params, best_score = grid_search_tuning(model, param_grid, scoring, X_train, y_train)

2 tune_multiple_models – Tune multiple models and get results:

results = tune_multiple_models(models, param_grids, X_train, y_train)

Returns a dictionary with each model’s:

'best score' → Best R² score

'best params' → Best hyperparameters

'best model' → Trained model pipeline

Works with Pipelines, including preprocessing steps.

📊 Visualizations

Histograms and KDE plots for performance distribution.

Boxplots for numeric features and bivariate analysis.

Countplots for categorical variables.

Correlation heatmap to visualize feature relationships.

Scatter plots for actual vs predicted performance.

Residual plots to validate model assumptions.

🔖 Notes

This project demonstrates end-to-end data analysis and predictive modeling.

It’s suitable for beginners to understand the workflow of a regression problem using real-world educational data.

The pipeline approach ensures preprocessing and modeling are combined efficiently.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
.gitignore		.gitignore
README.md		README.md
best_model.pkl		best_model.pkl
hyperparameter_tuning.py		hyperparameter_tuning.py
requirements.txt		requirements.txt
student_performance.ipynb		student_performance.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Data Exploration (EDA)

2. Feature Engineering

3. Data Preprocessing

4. Modeling

5. Evaluation

6. Model Deployment

1 grid_search_tuning – Tune a single model:

2 tune_multiple_models – Tune multiple models and get results:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1. Data Exploration (EDA)

2. Feature Engineering

3. Data Preprocessing

4. Modeling

5. Evaluation

6. Model Deployment

1 grid_search_tuning – Tune a single model:

2 tune_multiple_models – Tune multiple models and get results:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages