Student Performance Analysis and Prediction
This project analyzes a dataset of student academic performance and builds machine learning models to predict the Performance Index based on various features. It includes exploratory data analysis (EDA), feature engineering, modeling, evaluation, and visualization.
📂 Dataset File: Student_Performance.csv Contains 10,000 student records with the following columns:
| Column | Type | Description |
|---|---|---|
| Hours Studied | int | Number of hours a student studies |
| Previous Scores | int | Previous academic scores |
| Extracurricular Activities | str | Participation in extracurricular activities ('Yes'/'No') |
| Sleep Hours | int | Hours of sleep per day |
| Sample Question Papers Practiced | int | Number of sample question papers practiced |
| Performance Index | float | Target variable representing overall performance |
⚙️ Libraries Used
pandas, numpy – data manipulation
matplotlib, seaborn – data visualization
scikit-learn – preprocessing, train-test split, models, metrics
joblib – saving the final model
warnings – to suppress unnecessary warnings
📝 Project Steps
Checked the shape, info, missing values, duplicates.
Visualized distributions using histograms, pairplots, boxplots, and countplots.
Explored categorical columns like Extracurricular Activities with pie charts.
Explored numeric columns like Hours Studied, Sleep Hours, and Previous Scores.
Conducted bivariate analysis to see relationships between features and Performance Index.
Found that Hours Studied has a strong positive correlation with performance, while extracurricular activities have minor impact.
Converted Extracurricular Activities to numeric (Yes=1, No=0).
Created new features:
Effort = Hours Studied + Sample Question Papers Practiced
Performance_Interaction = Previous Scores * Hours Studied
Visualized correlation using a heatmap.
Separated features (X) and target (y = Performance Index).
Split dataset: 80% train, 20% test.
Applied MinMaxScaler to normalize numeric features.
Models used:
Linear Regression
Decision Tree Regressor
Random Forest Regressor
Hyperparameter tuning performed using a custom function tune_multiple_models.
Linear Regression gave the best results.
Metrics used:
R² Score
Mean Absolute Error (MAE)
Root Mean Squared Error (RMSE)
Residual analysis showed no obvious pattern, validating linear regression assumptions.
Visual comparison of actual vs predicted values using scatter plots.
Final model saved as best_model.pkl using joblib.
Ready to be loaded for future predictions:
import joblib model = joblib.load('best_model.pkl') predictions = model.predict(new_data)
🔧 Hyperparameter Tuning Module
This project includes a custom module hyperparameter_tuning.py to simplify tuning multiple models with GridSearchCV.
Functions
best_model, best_params, best_score = grid_search_tuning(model, param_grid, scoring, X_train, y_train)
results = tune_multiple_models(models, param_grids, X_train, y_train)
Returns a dictionary with each model’s:
'best score' → Best R² score
'best params' → Best hyperparameters
'best model' → Trained model pipeline
Works with Pipelines, including preprocessing steps.
📊 Visualizations
Histograms and KDE plots for performance distribution.
Boxplots for numeric features and bivariate analysis.
Countplots for categorical variables.
Correlation heatmap to visualize feature relationships.
Scatter plots for actual vs predicted performance.
Residual plots to validate model assumptions.
🔖 Notes
This project demonstrates end-to-end data analysis and predictive modeling.
It’s suitable for beginners to understand the workflow of a regression problem using real-world educational data.
The pipeline approach ensures preprocessing and modeling are combined efficiently.