electricity_revenues_predictor.py

# -*- coding: utf-8 -*-
"""Electricity Revenues Predictor.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1_xVQPAmAPG0M0y_8899Nsq38yl_-avH5

# **Group Description**

**Group No: 25 Based on Spreadsheet**

**Group Name: ABRACADABRA**

**Team Member Details~**

1. Saad Ahmed Pathan (22114077)
2. Samio Ayman (22082403)
3. Nur Shaheila Ashriza Binti Mohd Saupi (22001745)
4. Nurina Humaira Binti Mohd Romzan (22002204)
5. Nur Aina Batrisyia Binti Zakaria (23005013)
6. Siti Hajar Binti Mohd Nor Azman (22002035)

# **1. Project Designing**

Electricity is essential for economic and social development, enabling nations to achieve higher living standards.

In today's world, effective planning and operation of electricity production, revenue generation from production, and energy consumption are imperative. Understanding how energy generates revenue and is utilized by consumers is crucial for better management. This presents an opportunity to develop a supervised machine learning model to forecast future electricity revenues.


1. Initial Phase: We brainstormed the problem and potential approaches to solve it using machine learning concepts. Then, we designed the workflow of our project.


2. Data Mining: We extracted a dataset from Data.gov, covering data from 2015 to 2022. The dataset includes revenue, units sold, and the average number of customers, categorized by customer class for each electric utility operating in Iowa, USA.


3. Data Preprocessing: We understood the data and identified some null values in the dataset, receiving a detailed description of the characteristics involved.


4. Feature Discussion: We discussed and renamed features for better readability and understanding, facilitating a smoother data environment.


5. Exploratory Data Analysis (EDA) and Visualization: EDA and visualization provided concise knowledge of the link between features and the label (the dependent variable). The heatmap was used to understand the association between independent variables, helping to choose important features. Selecting the right elements to improve accuracy was challenging.

6. Feature Selection: We decided to use PCA for feature selection, ultimately choosing PC1 as the feature for our project.

7. Model Training and Assessment: We employed Linear Regression, Random Forest Regression, Neural Network Regression, Decision Tree, and XGBoost techniques. After comparing numerous metrics, we determined that the Random Forest Regressor produced the best results.

8. Model Explainability: We used a bar chart to compare the performance of all five models, assisting in selecting the best one. The Random Forest Regressor emerged as the best model for our dataset.

9. Conclusion: We summarized our project, from model selection and evaluation to finding the most suitable model for our dataset. We also highlighted key findings from each model with their respective values.


**Problem Statement**

The goal is to develop a machine learning model capable of accurately forecasting electricity revenues based on the provided features. This model is valuable for utility companies, energy firms, and policymakers who need to optimize electricity consumption, reduce costs, and minimize the environmental impact of energy usage.

Specifically, the model should reliably predict electricity revenues by considering various factors influencing energy consumption, such as consumer types and the number of consumers. This can help utility companies, building managers, and energy firms identify patterns and trends in energy consumption, enabling them to make informed energy decisions. Policymakers can also use this data to create regulations and incentives that promote energy efficiency and sustainability.

# **2. Data Mining**

The dataset used for this project is acquired from the website Data.gov. Data.gov is a comprehensive and open data portal maintained by the United States government. It serves as a centralized repository for accessing a wide range of government datasets, providing the public, researchers, and developers with valuable information for analysis, innovation, and transparency.


The dataset titled **"Electric Utilities Revenue, Units Sold, and Customers by Year"** covers data from 2015 to 2022, detailing the revenue, units sold, and average number of customers categorized by customer class for each electric utility operating in the state of Iowa, USA. This publicly accessible dataset aims to provide insights into the performance and customer base of electric utilities in Iowa. However, no specific license information is provided for this dataset.

**Columns Description**

1. Reporting Year

2. Company Number & Year

3. Type of Utility

4. Utility

5. Operating Revenues - Residential Sales

6. Operating Revenues - Commercial & Industrial Sales
7. Operating Revenues - Sales for Resale

8. Operating Revenues - All Other Sales

9. MWh Sold - Residential

10. MWh Sold - Commercial & Industrial

11. MWh Sold - Sales for Resale

12. MWh Sold - All Other

13. Average No. of Customers - Residential

14. Average No. of Customers - Commercial & Industrial

15. Average No. of Customers - Sales for Resale

16. Average No. of Customers - All Other


**Dataset Source Link**

https://catalog.data.gov/dataset/electric-utilities-revenue-units-sold-and-customers-by-year

# **3. Data Preprocessing**
"""

# Line Wrapping in Collaboratory Google results
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

"""Check for missing values, outliers, and inconsistencies in the dataset and handle them appropriately. Missing values can be imputed or dropped based on the extent of missingness and their impact on the analysis."""

# Commented out IPython magic to ensure Python compatibility.
# Import Libraries for analysis and visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# %matplotlib inline

# To import datetime library
from datetime import datetime
import datetime as dt

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')

# Import necessary statistical libraries
import scipy.stats as stats
import statsmodels.api as sm
from scipy.stats import norm

# Import libraries for ML-Model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV

# Libraries for save the model
import pickle

# Mount Google Drive to access the dataset
from google.colab import drive
drive.mount('/content/drive')

# Load the dataset
file_path = '/content/drive/MyDrive/Machine Learning Project/electricity_consumption_data.csv'

df = pd.read_csv(file_path)

# Display the shape of the data
df.shape

# Display the first few rows to understand the data
print(df.head())

df.head(5)

df.iloc[745 : 751]

df.tail(5)

df.info()

# Determine the datatype of Each Column
df.dtypes

# Get a statistical summary to check for outliers
print(df.describe())

# Get duplicates count for each unique row
dup_Count =  len(df)-len(df.drop_duplicates())

# There is no duplicate values in the dataframe
dup_count1 = df[df.duplicated()].shape
dup_count1

# Find the missing values of each column
null_values = df.isnull().sum()

# Visualizing the missing values
plt.figure(figsize=(10,10))
sns.displot(
    data=df.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25
)
plt.savefig("visualizing_missing_data_with_barplot_Seaborn_distplot.png", dpi=100)

# Remove all rows with missing data
data = df.dropna()
data.isna().sum()

"""# **4. Variable Description**

RY - Reporting Year

ToU - Type of Utility

U - Utility

ORoRS - Operating Revenues of Residential Sales

ORoCIS - Operating Revenues of Commercial & Industrial Sales

ORoSR - Operating Revenues of Sales for Resale

ORoAOS - Operating Revenues of All Other Sales

ASforR - Amount Sold for Residential in MWh

ASforCI - Amount Sold for Commercial & Industrial in MWh

ASforSR - Amount Sold for Sales for Resale in MWh

ASforAO - Amount Sold for All Other in MWh

ANoCR - Average No. of Customers in Residential

ANoCCI - Average No. of Customers in Commercial & Industrial

ANoCSR - Average No. of Customers in Sales for Resale

ANoCAO - Average No. of Customers in All Other
"""

# Show all columns
df.columns

df_energy = df.copy()

# Convert to DataFrame
df_energy = pd.DataFrame(data)

# Apply One-Hot Encoding
df_energy = pd.get_dummies(df_energy, columns=['ToU', 'U'])

print("DataFrame after One-Hot Encoding:")
print(df_energy)

# # Rename the columns
# df_rename = df.copy()
# df_rename.rename(columns={'RY': 'reporting_year', 'ToU':'utility_type', 'U':'utility', 'ORoRS ': 'residential_revenues', 'ORoCIS':'commercial_revenues',
#        'ORoSR':'resale_revenues', 'ORoAOS':'other_revenues', 'ASforR ':'residential_sales', 'ASforCI':'commercial_sales', 'ASforSR':'resale_sales', 'ASforAO':'other_sales'
#        ,'ANoCR':'residential_customers', 'ANoCCI':'commercial_customers', 'ANoCSR':'resale_customers', 'ANoCAO':'other_customers'},inplace = True)

df_rename = df.copy()
df_rename.rename(columns={
    'Reporting Year': 'reporting_year',
    'Company Number & Year': 'company_number_year',
    'Type of Utility': 'utility_type',
    'Utility': 'utility',
    'Operating Revenues - Residential Sales': 'residential_revenues',
    'Operating Revenues - Commercial & Industrial Sales': 'commercial_revenues',
    'Operating Revenues - Sales for Resale': 'resale_revenues',
    'Operating Revenues - All Other Sales ': 'other_revenues',
    'MWh Sold - Residential': 'residential_sales',
    'MWh Sold - Commercial & Industrial': 'commercial_sales',
    'MWh Sold - Sales for Resale': 'resale_sales',
    'MWh Sold - All Other': 'other_sales',
    'Average No. of Customers - Residential': 'residential_customers',
    'Average No. of Customers - Commercial & Industrial': 'commercial_customers',
    'Average No. of Customers - Sales for Resale': 'resale_customers',
    'Average No. of Customers - All Other': 'other_customers'
}, inplace=True)

print(df_rename.columns)

# df_rename.columns

df_energy.columns

# Check Unique Values for each variable
def get_unqiuevalues(df1):
    unique_values=df1.apply(pd.Series.unique)
    return unique_values

unq_values = get_unqiuevalues(df)

for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique())

# Separate columns in list for better analysis
gen_cols=['reporting_year', 'utility_type', 'utility']
rev_cols=['residential_revenues', 'commercial_revenues', 'resale_revenues', 'other_revenues']
sal_cols=['residential_sales', 'commercial_sales', 'resale_sales', 'other_sales']
cus_cols=['residential_customers', 'commercial_customers', 'resale_customers','other_customers']

"""# **5. Data Vizualization**"""

# Chart - 01 visualization
# Dependent varaible "ORoCIS - commercial_revenues"
plt.figure(figsize=(5,5))
sns.distplot(df_energy['ANoCCI'], color = 'Blue')

# Chart - 02 visualization
# Dependent varaible "ASforR - residential_sales"
plt.figure(figsize=(5,5))
sns.distplot(df_energy['ASforR'], color = 'Blue')

# Chart - 03 visualization
# Dependent varaible "ANoCCI - commercial_customers"
plt.figure(figsize=(5,5))
sns.distplot(df_energy['ANoCCI'], color = 'Blue')

# Display the heatmap
data['ToU'] = data['ToU'].astype('category').cat.codes
data['U'] = data['U'].astype('category').cat.codes

correlation_matrix = data.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

plt.title('Correlation Matrix Heatmap')
plt.show()

# Handling outliers & outlier treatments
df = df_energy.copy()
col_list = list(df.describe().columns)

# Find the outliers using boxplot
plt.figure(figsize=(25, 20))
plt.suptitle("Box Plot", fontsize=18, y=0.95)

for n, ticker in enumerate(col_list):

    ax = plt.subplot(8, 4, n + 1)

    plt.subplots_adjust(hspace=0.5, wspace=0.2)

    sns.boxplot(x=df[ticker],color='pink', ax = ax)

    ax.set_title(ticker.upper())

"""# **6. Feature Selection**"""

# Feature Selection using PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

df_energy = pd.DataFrame(data)

df_energy = pd.get_dummies(df_energy, columns=['ToU', 'U'])

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_energy)

# Set the number of principal components
pca = PCA(n_components=5)
principal_components = pca.fit_transform(scaled_features)

pca_df = pd.DataFrame(data=principal_components, columns=[f'PC{i+1}' for i in range(principal_components.shape[1])])

print("PCA result:")
print(pca_df)

print("Explained variance ratio by each principal component:")
print(pca.explained_variance_ratio_)

import pandas as pd

# Get the explained variance ratio of each principal component
explained_variance_ratio = pca.explained_variance_ratio_

# Create a DataFrame to store the results
pca_results = pd.DataFrame({'Principal Component': [f'PC{i+1}' for i in range(len(explained_variance_ratio))],
                            'Explained Variance Ratio': explained_variance_ratio})

# Print the results
print(pca_results)

# Get the loadings of the principal components
df_raw = pd.read_csv('/content/drive/MyDrive/Machine Learning Project/electricity_consumption_data.csv')
pca = PCA(n_components=5)
pca.fit(df)
loadings = pca.components_

# Create a DataFrame to store the loadings
loadings_df = pd.DataFrame(data=loadings, columns=df.columns)

# Print the loadings
print(loadings_df)

most_important_feature_pc1 = loadings_df.iloc[:, 0].abs().idxmax()
print(most_important_feature_pc1)

# Select PC1 as the feature
X = pca_df[['PC1']]

# Assuming ORoRS as the dependent variable for regression
y = df_energy['ORoRS']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting datasets
print("Shapes of the datasets:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test: {y_test.shape}")

"""# **7. Model Selection**"""

from sklearn.linear_model import LinearRegression

# Assuming df_energy, pca_df, X, y, X_train, X_test, y_train, and y_test are already defined

# Initialize and train the Linear Regression model
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = linear_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Linear Regression Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")

# Plotting the results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.title('Linear Regression: Actual vs Predicted')
plt.legend()
plt.show()

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Separate features and target
features = df_energy.drop(columns=['ORoRS'])
target = df_energy['ORoRS']

# Standardize the features and target separately
scaler_features = StandardScaler()
scaled_features = scaler_features.fit_transform(features)

scaler_target = StandardScaler()
scaled_target = scaler_target.fit_transform(target.values.reshape(-1, 1))

# Select PC1 as the feature
X = pca_df[['PC1']]

# Use the scaled target for regression
y = scaled_target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the neural network model
model = keras.Sequential([
    layers.Input(shape=(X_train.shape[1],)),  # Input layer with the number of PCs as input shape
    layers.Dense(32, activation='relu'),  # Hidden layer with 32 neurons and ReLU activation
    layers.Dense(1)  # Output layer with a single neuron (for regression)
])

model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), verbose=0)

# Evaluate the model on the test data
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss:.4f}")

# Make predictions on the test data
y_pred = model.predict(X_test)

# Inverse transform predictions and true values
y_pred_inv = scaler_target.inverse_transform(y_pred)
y_test_inv = scaler_target.inverse_transform(y_test)

# Evaluate the model
mse = mean_squared_error(y_test_inv, y_pred_inv)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_inv, y_pred_inv)
print("Neural Network Regression Model Evaluation:")
print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'R-squared (R2): {r2}')

# Plot the actual data and model predictions
plt.figure(figsize=(10, 6))
plt.scatter(range(len(y_test_inv)), y_test_inv, label='Actual Data', color='blue')
plt.scatter(range(len(y_pred_inv)), y_pred_inv, label='Predicted Data', color='red')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.legend()
plt.title('Neural Network Regression: Actual vs Predicted')
plt.show()

from sklearn.tree import DecisionTreeRegressor, plot_tree

# Create a decision tree regressor
regressor = DecisionTreeRegressor(random_state=42)

# Fit the regressor to the training data
regressor.fit(X_train, y_train)

# Visualize the decision tree
fig, ax = plt.subplots(figsize=(15, 15))
plot_tree(regressor, max_depth=3, feature_names=['PC1'], class_names=['ORoRS'],
          filled=True, rounded=True, fontsize=10, label='all', ax=ax)
plt.tight_layout()  # Adjust layout to prevent overlapping
plt.show()

# Make predictions on the testing data
y_pred = regressor.predict(X_test)

print("Decision Tree Model Evaluation:")

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Calculate the Root Mean Squared Error
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Calculate the R-squared score
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R2): {r2}")

from sklearn.ensemble import RandomForestRegressor

# Select PC1 as the feature
X = pca_df[['PC1']]

# Assuming ORoRS as the dependent variable for regression
y = df_energy['ORoRS']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Random Forest Regressor Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")

# Plotting the results
import matplotlib.pyplot as plt

plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.title('Random Forest Regressor: Actual vs Predicted')
plt.legend()
plt.show()

import xgboost as xgb

# Assuming df_energy, pca_df, X, y, X_train, X_test, y_train, and y_test are already defined

# Initialize and train the XGBoost regression model
xgbr = xgb.XGBRegressor(verbosity=0)
xgbr.fit(X_train, y_train)

# Predictions on the test set
y_pred = xgbr.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("XGBoost Regression Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")

# Plotting the results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.title('XGBoost Regression: Actual vs Predicted')
plt.legend()
plt.show()

"""# **8. Model Evaluation**

Based on the evaluation metrics from above five models, the Random Forest Regressor model demonstrates superior performance compared to the other models. It achieves this by exhibiting the lowest Mean Squared Error (MSE) and the highest R-squared value among all models. These metrics indicate that the Random Forest Regressor provides more accurate predictions and better explains the variance in the target variable compared to the other regression models.
"""

# Evaluation results for each model
models = ['Linear Regression', 'Neural Network Regression', 'Decision Tree', 'Random Forest Regressor', 'XGBoost Regression']
mse_values = [231588760521001.2, 222835706743011.4, 6411923390724.246, 5297160256238.563, 13617885819724.05]
rmse_values = [15218040.626867875, 14927682.564383911, 2532177.5985748405, 2301556.051074699, 3690241.9730586843]
r2_values = [0.9067015665821766, 0.9102278439510403, 0.997416876336296, 0.9978659726302849, 0.9945138640986516]

# Plotting
fig, axs = plt.subplots(3, figsize=(15, 15))

# MSE comparison
axs[0].bar(models, mse_values, color=['blue', 'orange', 'green', 'red', 'purple'])
axs[0].set_title('Mean Squared Error (MSE) Comparison')
axs[0].set_ylabel('MSE')

# RMSE comparison
axs[1].bar(models, rmse_values, color=['blue', 'orange', 'green', 'red', 'purple'])
axs[1].set_title('Root Mean Squared Error (RMSE) Comparison')
axs[1].set_ylabel('RMSE')

# R-squared comparison
axs[2].bar(models, r2_values, color=['blue', 'orange', 'green', 'red', 'purple'])
axs[2].set_title('R-squared (R2) Comparison')
axs[2].set_ylabel('R-squared')

plt.tight_layout()
plt.show()

"""**Linear Regression Model Evaluation:**

Mean Squared Error (MSE): 231588760521001.2

Root Mean Squared Error (RMSE): 15218040.626867875

R-squared (R2): 0.9067015665821766


**Neural Network Model Evaluation:**

Mean Squared Error (MSE): 222835706743011.4

Root Mean Squared Error (RMSE): 14927682.564383911

R-squared (R2): 0.9102278439510403


**Decision Tree Model Evaluation:**

Mean Squared Error (MSE): 6411923390724.246

Root Mean Squared Error (RMSE): 2532177.5985748405

R-squared (R2): 0.997416876336296


**Random Forest Regressor Model Evaluation:**

Mean Squared Error (MSE): 5297160256238.563

Root Mean Squared Error (RMSE): 2301556.051074699

R-squared (R2): 0.9978659726302849


**XGBoost Regression Model Evaluation:**

Mean Squared Error (MSE): 13617885819724.05

Root Mean Squared Error (RMSE): 3690241.9730586843

R-squared (R2): 0.9945138640986516


**Best Model**

Random Forest Regressor

Mean Squared Error (MSE): 5297160256238.563

Root Mean Squared Error (RMSE): 2301556.051074699

R-squared (R2): 0.9978659726302849

# **9. Conclusion**

Based on the evaluation of the different models, several key findings can be concluded:

1. **Linear Regression:** The linear regression model performed the poorest among the models evaluated, with a high Mean Squared Error (MSE) of approximately 2.32 x 10^14 and a moderate R-squared (R2) value of 0.907. This indicates that the linear model did not effectively capture the relationships in the data.

2. **Neural Network Regression:** The neural network model showed slight improvement over the linear regression model, with a lower MSE of approximately 2.23 x 10^14 and a higher R-squared value of 0.910. However, it still exhibited a high MSE, suggesting room for further enhancement.

3. **Decision Tree:** The decision tree model demonstrated significantly lower MSE compared to linear regression and neural network models, with a value of approximately 6.41 x 10^12. It also exhibited a very high R-squared value of 0.997, indicating a strong fit to the data. However, decision trees can be prone to overfitting.

4. **Random Forest Regressor:** The Random Forest regressor outperformed all other models, with the lowest MSE of approximately 5.30 x 10^12 and the highest R-squared value of 0.998. This suggests that the Random Forest model provided the most accurate predictions and best explained the variance in the target variable.

5. **XGBoost Regression:** The XGBoost regression model also performed well, with a relatively low MSE of approximately 1.36 x 10^13 and a high R-squared value of 0.995. While not as high as the Random Forest, it still demonstrated strong predictive performance.

In conclusion, the Random Forest Regressor model is recommended for this project, as it exhibited the best performance in terms of predictive accuracy and model fit. It provided the lowest MSE and highest R-squared value among all models, indicating superior predictive capability. However, depending on specific project requirements, the XGBoost Regression model could also be considered as it demonstrated strong performance as well. The decision tree model, while showing promise, might require additional regularization techniques to mitigate overfitting. The neural network and linear regression models did not perform as well and are less suitable for this dataset.
"""