This repository contains the Python code implementation of the research paper: "Air Quality Prediction by Machine Learning Models: A Predictive Study on the Indian Coastal City of Visakhapatnam" by Gokulan Ravindiran et al. (2023). The study uses advanced machine learning techniques to predict the Air Quality Index (AQI) based on historical data.
Air pollution is a significant global challenge, impacting both human health and the environment. This project leverages machine learning models to predict AQI levels using air pollutant and meteorological data. The implementation includes:
- Data preprocessing and transformation
- Exploratory Data Analysis (EDA)
- Model training and evaluation using:
- LightGBM
- Random Forest
- CatBoost
- AdaBoost
- XGBoost
- Visualization of model performance and feature importance
- Handles missing and non-numeric values
- Performs data transformation for skewness and kurtosis normalization
- Predicts AQI with high accuracy using optimized machine learning models
- Generates feature importance and comparison metrics for the models
- Visualizes AQI trends and pollutant contributions
The dataset used in this implementation is based on the Central Pollution Control Board (CPCB) data from July 2017 to September 2022. It includes:
- 12 Air Pollutants: PM2.5, PM10, NO, NO2, NOx, NH3, SO2, CO, Ozone, Benzene, Toluene, Xylene
- 10 Meteorological Factors: Temperature, Relative Humidity, Wind Speed, Wind Direction, Solar Radiation, Air Pressure, Ambient Temperature, Rainfall, and Total Rainfall
The implementation requires the following Python libraries:
numpy
pandas
matplotlib
seaborn
scikit-learn
lightgbm
xgboost
catboost
Install all dependencies using:
pip install -r requirements.txt
- Data Preprocessing: Handles missing and non-numeric values, normalizes skewed data, and prepares features for modeling.
- EDA: Analyzes correlations between pollutants and AQI, visualizes monthly and annual pollutant variations.
- Model Training and Evaluation: Implements and compares the performance of five machine learning models.
- Prediction: Uses trained models to predict AQI and categorize its health impact.
Model_Training | MAE_Training | MSE_Training | RMSE_Training | R2_Training | MAE_Testing | MSE_Testing | RMSE_Testing | R2_Testing | |
---|---|---|---|---|---|---|---|---|---|
0 | LightGBM | 1.373889 | 15.846370 | 3.980750 | 0.995221 | 1.811602 | 19.478235 | 4.413415 | 0.992536 |
1 | RandomForest | 0.444116 | 3.171609 | 1.780901 | 0.999043 | 1.279939 | 20.688324 | 4.548442 | 0.992072 |
2 | CatBoost | 1.373889 | 15.846370 | 3.980750 | 0.995221 | 1.811602 | 19.478235 | 4.413415 | 0.992536 |
3 | AdaBoost | 1.373889 | 15.846370 | 3.980750 | 0.995221 | 1.811602 | 19.478235 | 4.413415 | 0.992536 |
4 | XGBoost | 0.439370 | 0.635832 | 0.797391 | 0.999808 | 1.623464 | 19.362135 | 4.400243 | 0.992580 |
The CatBoost model achieved the highest accuracy with an R² of 0.9998.
Key contributors to AQI prediction:
- PM2.5
- PM10
- NO2
- CO
- NOx
The repository includes scripts to visualize:
- Correlation matrices
- Feature importance
- AQI trends (monthly and annual)
-
Gokulan Ravindiran, Gasim Hayder, Karthick Kanagarathinam, Avinash Alagumalai, Christian Sonne. "Air Quality Prediction by Machine Learning Models: A Predictive Study on the Indian Coastal City of Visakhapatnam." Chemosphere, 2023. [DOI: 10.1016/j.chemosphere.2023.139518] (https://doi.org/10.1016/j.chemosphere.2023.139518)
-
Central Pollution Control Board (CPCB), India.
Special thanks to the authors of the research paper and the organizations involved for providing the foundational dataset and methodologies.