This project involves analyzing and modeling air quality data using various data processing, visualization, and machine learning techniques using the EDA guide by Shahid Zikria on Kaggle. The goal is to predict the concentration of Benzene (C6H6(GT)
) in the air based on several sensor readings. The project includes data preprocessing, exploratory data analysis, feature engineering, and the implementation of a simple Artificial Neural Network (ANN) using PyTorch for regression.
Ensure that the following Python libraries are installed:
pandas
: Data manipulation and analysismatplotlib
: Plotting and data visualizationseaborn
: Statistical data visualizationnumpy
: Numerical computingscikit-learn
: Machine learning toolstorch
: PyTorch library for deep learning
The dataset used in this project contains air quality measurements from different sensors. It includes features like the concentration of CO, NOx, NO2, and other gases, as well as the Benzene concentration, which is the target variable for prediction.
The dataset is stored in a CSV file (AirQuality.csv
), with the following characteristics:
- The delimiter used is a semicolon (
;
). - Decimal points are represented by commas (
,
). - The dataset contains some missing or invalid data, represented by
-200
.
- The dataset is loaded using
pandas
, with appropriate handling of the delimiter and decimal separator. - Unnecessary columns, such as unnamed index columns, are removed.
- Missing data represented by
-200
is replaced withNaN
, and rows containingNaN
values are dropped. - A heatmap is generated to visualize the distribution of missing data.
- The
Date
column is converted to a datetime format, and the month is extracted as a new feature. - The data is explored through various visualizations:
- Histogram of CO levels (
CO(GT)
) - Monthly average CO levels
- Summary statistics of the dataset
- Histogram of CO levels (
- Outliers are identified using the Interquartile Range (IQR) method.
- Detected outliers are replaced with the median of the respective column to minimize their impact on the model.
- Box plots are generated before and after outlier handling to visualize the changes.
- A correlation heatmap is created to examine the relationships between different features, which helps in feature selection for the model.
- Features (
X
) are selected based on their relevance to the target variable (C6H6(GT)
). - The dataset is split into training, validation, and test sets using
train_test_split
fromscikit-learn
.
- An ANN model is defined using PyTorch, with the following structure:
- Input layer: Equal to the number of features
- Hidden layer: 64 neurons with ReLU activation
- Output layer: Single neuron for regression
- The model is trained using the Mean Squared Error (MSE) loss function and the Adam optimizer.
- The model is trained for 100 epochs, with the loss printed every 10 epochs for both training and validation sets.
- The model's performance is evaluated using the Root Mean Squared Error (RMSE) and R-squared (R²) metrics on both training and test sets.
- These metrics provide insights into the model's accuracy and goodness-of-fit.
- Place the
AirQuality.csv
file in the appropriate directory (e.g.,USB/Air-Quality-Study/
). - Ensure that all required libraries are installed.
- Run the Python script in your preferred environment (e.g., Jupyter Notebook, PyCharm, or command line).
- Observe the output and visualizations to understand the data processing steps, model training, and evaluation results.
The trained model provides predictions for the Benzene concentration based on the input features. The performance metrics (RMSE and R²) help assess the accuracy of these predictions, guiding potential improvements or adjustments to the model.
- Experiment with more complex models (e.g., deeper neural networks, ensemble methods) to improve prediction accuracy.
- Implement more sophisticated imputation methods for missing data.
- Explore feature selection techniques to reduce dimensionality and improve model efficiency.