-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathelectricity_revenues_predictor.py
725 lines (481 loc) · 24.9 KB
/
electricity_revenues_predictor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
# -*- coding: utf-8 -*-
"""Electricity Revenues Predictor.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1_xVQPAmAPG0M0y_8899Nsq38yl_-avH5
# **Group Description**
**Group No: 25 Based on Spreadsheet**
**Group Name: ABRACADABRA**
**Team Member Details~**
1. Saad Ahmed Pathan (22114077)
2. Samio Ayman (22082403)
3. Nur Shaheila Ashriza Binti Mohd Saupi (22001745)
4. Nurina Humaira Binti Mohd Romzan (22002204)
5. Nur Aina Batrisyia Binti Zakaria (23005013)
6. Siti Hajar Binti Mohd Nor Azman (22002035)
# **1. Project Designing**
Electricity is essential for economic and social development, enabling nations to achieve higher living standards.
In today's world, effective planning and operation of electricity production, revenue generation from production, and energy consumption are imperative. Understanding how energy generates revenue and is utilized by consumers is crucial for better management. This presents an opportunity to develop a supervised machine learning model to forecast future electricity revenues.
1. Initial Phase: We brainstormed the problem and potential approaches to solve it using machine learning concepts. Then, we designed the workflow of our project.
2. Data Mining: We extracted a dataset from Data.gov, covering data from 2015 to 2022. The dataset includes revenue, units sold, and the average number of customers, categorized by customer class for each electric utility operating in Iowa, USA.
3. Data Preprocessing: We understood the data and identified some null values in the dataset, receiving a detailed description of the characteristics involved.
4. Feature Discussion: We discussed and renamed features for better readability and understanding, facilitating a smoother data environment.
5. Exploratory Data Analysis (EDA) and Visualization: EDA and visualization provided concise knowledge of the link between features and the label (the dependent variable). The heatmap was used to understand the association between independent variables, helping to choose important features. Selecting the right elements to improve accuracy was challenging.
6. Feature Selection: We decided to use PCA for feature selection, ultimately choosing PC1 as the feature for our project.
7. Model Training and Assessment: We employed Linear Regression, Random Forest Regression, Neural Network Regression, Decision Tree, and XGBoost techniques. After comparing numerous metrics, we determined that the Random Forest Regressor produced the best results.
8. Model Explainability: We used a bar chart to compare the performance of all five models, assisting in selecting the best one. The Random Forest Regressor emerged as the best model for our dataset.
9. Conclusion: We summarized our project, from model selection and evaluation to finding the most suitable model for our dataset. We also highlighted key findings from each model with their respective values.
**Problem Statement**
The goal is to develop a machine learning model capable of accurately forecasting electricity revenues based on the provided features. This model is valuable for utility companies, energy firms, and policymakers who need to optimize electricity consumption, reduce costs, and minimize the environmental impact of energy usage.
Specifically, the model should reliably predict electricity revenues by considering various factors influencing energy consumption, such as consumer types and the number of consumers. This can help utility companies, building managers, and energy firms identify patterns and trends in energy consumption, enabling them to make informed energy decisions. Policymakers can also use this data to create regulations and incentives that promote energy efficiency and sustainability.
# **2. Data Mining**
The dataset used for this project is acquired from the website Data.gov. Data.gov is a comprehensive and open data portal maintained by the United States government. It serves as a centralized repository for accessing a wide range of government datasets, providing the public, researchers, and developers with valuable information for analysis, innovation, and transparency.
The dataset titled **"Electric Utilities Revenue, Units Sold, and Customers by Year"** covers data from 2015 to 2022, detailing the revenue, units sold, and average number of customers categorized by customer class for each electric utility operating in the state of Iowa, USA. This publicly accessible dataset aims to provide insights into the performance and customer base of electric utilities in Iowa. However, no specific license information is provided for this dataset.
**Columns Description**
1. Reporting Year
2. Company Number & Year
3. Type of Utility
4. Utility
5. Operating Revenues - Residential Sales
6. Operating Revenues - Commercial & Industrial Sales
7. Operating Revenues - Sales for Resale
8. Operating Revenues - All Other Sales
9. MWh Sold - Residential
10. MWh Sold - Commercial & Industrial
11. MWh Sold - Sales for Resale
12. MWh Sold - All Other
13. Average No. of Customers - Residential
14. Average No. of Customers - Commercial & Industrial
15. Average No. of Customers - Sales for Resale
16. Average No. of Customers - All Other
**Dataset Source Link**
https://catalog.data.gov/dataset/electric-utilities-revenue-units-sold-and-customers-by-year
# **3. Data Preprocessing**
"""
# Line Wrapping in Collaboratory Google results
from IPython.display import HTML, display
def set_css():
display(HTML('''
<style>
pre {
white-space: pre-wrap;
}
</style>
'''))
get_ipython().events.register('pre_run_cell', set_css)
"""Check for missing values, outliers, and inconsistencies in the dataset and handle them appropriately. Missing values can be imputed or dropped based on the extent of missingness and their impact on the analysis."""
# Commented out IPython magic to ensure Python compatibility.
# Import Libraries for analysis and visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# %matplotlib inline
# To import datetime library
from datetime import datetime
import datetime as dt
# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')
# Import necessary statistical libraries
import scipy.stats as stats
import statsmodels.api as sm
from scipy.stats import norm
# Import libraries for ML-Model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV
# Libraries for save the model
import pickle
# Mount Google Drive to access the dataset
from google.colab import drive
drive.mount('/content/drive')
# Load the dataset
file_path = '/content/drive/MyDrive/Machine Learning Project/electricity_consumption_data.csv'
df = pd.read_csv(file_path)
# Display the shape of the data
df.shape
# Display the first few rows to understand the data
print(df.head())
df.head(5)
df.iloc[745 : 751]
df.tail(5)
df.info()
# Determine the datatype of Each Column
df.dtypes
# Get a statistical summary to check for outliers
print(df.describe())
# Get duplicates count for each unique row
dup_Count = len(df)-len(df.drop_duplicates())
# There is no duplicate values in the dataframe
dup_count1 = df[df.duplicated()].shape
dup_count1
# Find the missing values of each column
null_values = df.isnull().sum()
# Visualizing the missing values
plt.figure(figsize=(10,10))
sns.displot(
data=df.isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25
)
plt.savefig("visualizing_missing_data_with_barplot_Seaborn_distplot.png", dpi=100)
# Remove all rows with missing data
data = df.dropna()
data.isna().sum()
"""# **4. Variable Description**
RY - Reporting Year
ToU - Type of Utility
U - Utility
ORoRS - Operating Revenues of Residential Sales
ORoCIS - Operating Revenues of Commercial & Industrial Sales
ORoSR - Operating Revenues of Sales for Resale
ORoAOS - Operating Revenues of All Other Sales
ASforR - Amount Sold for Residential in MWh
ASforCI - Amount Sold for Commercial & Industrial in MWh
ASforSR - Amount Sold for Sales for Resale in MWh
ASforAO - Amount Sold for All Other in MWh
ANoCR - Average No. of Customers in Residential
ANoCCI - Average No. of Customers in Commercial & Industrial
ANoCSR - Average No. of Customers in Sales for Resale
ANoCAO - Average No. of Customers in All Other
"""
# Show all columns
df.columns
df_energy = df.copy()
# Convert to DataFrame
df_energy = pd.DataFrame(data)
# Apply One-Hot Encoding
df_energy = pd.get_dummies(df_energy, columns=['ToU', 'U'])
print("DataFrame after One-Hot Encoding:")
print(df_energy)
# # Rename the columns
# df_rename = df.copy()
# df_rename.rename(columns={'RY': 'reporting_year', 'ToU':'utility_type', 'U':'utility', 'ORoRS ': 'residential_revenues', 'ORoCIS':'commercial_revenues',
# 'ORoSR':'resale_revenues', 'ORoAOS':'other_revenues', 'ASforR ':'residential_sales', 'ASforCI':'commercial_sales', 'ASforSR':'resale_sales', 'ASforAO':'other_sales'
# ,'ANoCR':'residential_customers', 'ANoCCI':'commercial_customers', 'ANoCSR':'resale_customers', 'ANoCAO':'other_customers'},inplace = True)
df_rename = df.copy()
df_rename.rename(columns={
'Reporting Year': 'reporting_year',
'Company Number & Year': 'company_number_year',
'Type of Utility': 'utility_type',
'Utility': 'utility',
'Operating Revenues - Residential Sales': 'residential_revenues',
'Operating Revenues - Commercial & Industrial Sales': 'commercial_revenues',
'Operating Revenues - Sales for Resale': 'resale_revenues',
'Operating Revenues - All Other Sales ': 'other_revenues',
'MWh Sold - Residential': 'residential_sales',
'MWh Sold - Commercial & Industrial': 'commercial_sales',
'MWh Sold - Sales for Resale': 'resale_sales',
'MWh Sold - All Other': 'other_sales',
'Average No. of Customers - Residential': 'residential_customers',
'Average No. of Customers - Commercial & Industrial': 'commercial_customers',
'Average No. of Customers - Sales for Resale': 'resale_customers',
'Average No. of Customers - All Other': 'other_customers'
}, inplace=True)
print(df_rename.columns)
# df_rename.columns
df_energy.columns
# Check Unique Values for each variable
def get_unqiuevalues(df1):
unique_values=df1.apply(pd.Series.unique)
return unique_values
unq_values = get_unqiuevalues(df)
for i in df.columns.tolist():
print("No. of unique values in ",i,"is",df[i].nunique())
# Separate columns in list for better analysis
gen_cols=['reporting_year', 'utility_type', 'utility']
rev_cols=['residential_revenues', 'commercial_revenues', 'resale_revenues', 'other_revenues']
sal_cols=['residential_sales', 'commercial_sales', 'resale_sales', 'other_sales']
cus_cols=['residential_customers', 'commercial_customers', 'resale_customers','other_customers']
"""# **5. Data Vizualization**"""
# Chart - 01 visualization
# Dependent varaible "ORoCIS - commercial_revenues"
plt.figure(figsize=(5,5))
sns.distplot(df_energy['ANoCCI'], color = 'Blue')
# Chart - 02 visualization
# Dependent varaible "ASforR - residential_sales"
plt.figure(figsize=(5,5))
sns.distplot(df_energy['ASforR'], color = 'Blue')
# Chart - 03 visualization
# Dependent varaible "ANoCCI - commercial_customers"
plt.figure(figsize=(5,5))
sns.distplot(df_energy['ANoCCI'], color = 'Blue')
# Display the heatmap
data['ToU'] = data['ToU'].astype('category').cat.codes
data['U'] = data['U'].astype('category').cat.codes
correlation_matrix = data.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()
# Handling outliers & outlier treatments
df = df_energy.copy()
col_list = list(df.describe().columns)
# Find the outliers using boxplot
plt.figure(figsize=(25, 20))
plt.suptitle("Box Plot", fontsize=18, y=0.95)
for n, ticker in enumerate(col_list):
ax = plt.subplot(8, 4, n + 1)
plt.subplots_adjust(hspace=0.5, wspace=0.2)
sns.boxplot(x=df[ticker],color='pink', ax = ax)
ax.set_title(ticker.upper())
"""# **6. Feature Selection**"""
# Feature Selection using PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
df_energy = pd.DataFrame(data)
df_energy = pd.get_dummies(df_energy, columns=['ToU', 'U'])
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_energy)
# Set the number of principal components
pca = PCA(n_components=5)
principal_components = pca.fit_transform(scaled_features)
pca_df = pd.DataFrame(data=principal_components, columns=[f'PC{i+1}' for i in range(principal_components.shape[1])])
print("PCA result:")
print(pca_df)
print("Explained variance ratio by each principal component:")
print(pca.explained_variance_ratio_)
import pandas as pd
# Get the explained variance ratio of each principal component
explained_variance_ratio = pca.explained_variance_ratio_
# Create a DataFrame to store the results
pca_results = pd.DataFrame({'Principal Component': [f'PC{i+1}' for i in range(len(explained_variance_ratio))],
'Explained Variance Ratio': explained_variance_ratio})
# Print the results
print(pca_results)
# Get the loadings of the principal components
df_raw = pd.read_csv('/content/drive/MyDrive/Machine Learning Project/electricity_consumption_data.csv')
pca = PCA(n_components=5)
pca.fit(df)
loadings = pca.components_
# Create a DataFrame to store the loadings
loadings_df = pd.DataFrame(data=loadings, columns=df.columns)
# Print the loadings
print(loadings_df)
most_important_feature_pc1 = loadings_df.iloc[:, 0].abs().idxmax()
print(most_important_feature_pc1)
# Select PC1 as the feature
X = pca_df[['PC1']]
# Assuming ORoRS as the dependent variable for regression
y = df_energy['ORoRS']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shapes of the resulting datasets
print("Shapes of the datasets:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}")
print(f"y_test: {y_test.shape}")
"""# **7. Model Selection**"""
from sklearn.linear_model import LinearRegression
# Assuming df_energy, pca_df, X, y, X_train, X_test, y_train, and y_test are already defined
# Initialize and train the Linear Regression model
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)
# Predict on the test set
y_pred = linear_regressor.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Linear Regression Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")
# Plotting the results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.title('Linear Regression: Actual vs Predicted')
plt.legend()
plt.show()
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Separate features and target
features = df_energy.drop(columns=['ORoRS'])
target = df_energy['ORoRS']
# Standardize the features and target separately
scaler_features = StandardScaler()
scaled_features = scaler_features.fit_transform(features)
scaler_target = StandardScaler()
scaled_target = scaler_target.fit_transform(target.values.reshape(-1, 1))
# Select PC1 as the feature
X = pca_df[['PC1']]
# Use the scaled target for regression
y = scaled_target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the neural network model
model = keras.Sequential([
layers.Input(shape=(X_train.shape[1],)), # Input layer with the number of PCs as input shape
layers.Dense(32, activation='relu'), # Hidden layer with 32 neurons and ReLU activation
layers.Dense(1) # Output layer with a single neuron (for regression)
])
model.compile(optimizer='adam', loss='mean_squared_error')
# Train the model
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), verbose=0)
# Evaluate the model on the test data
test_loss = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss:.4f}")
# Make predictions on the test data
y_pred = model.predict(X_test)
# Inverse transform predictions and true values
y_pred_inv = scaler_target.inverse_transform(y_pred)
y_test_inv = scaler_target.inverse_transform(y_test)
# Evaluate the model
mse = mean_squared_error(y_test_inv, y_pred_inv)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_inv, y_pred_inv)
print("Neural Network Regression Model Evaluation:")
print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'R-squared (R2): {r2}')
# Plot the actual data and model predictions
plt.figure(figsize=(10, 6))
plt.scatter(range(len(y_test_inv)), y_test_inv, label='Actual Data', color='blue')
plt.scatter(range(len(y_pred_inv)), y_pred_inv, label='Predicted Data', color='red')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.legend()
plt.title('Neural Network Regression: Actual vs Predicted')
plt.show()
from sklearn.tree import DecisionTreeRegressor, plot_tree
# Create a decision tree regressor
regressor = DecisionTreeRegressor(random_state=42)
# Fit the regressor to the training data
regressor.fit(X_train, y_train)
# Visualize the decision tree
fig, ax = plt.subplots(figsize=(15, 15))
plot_tree(regressor, max_depth=3, feature_names=['PC1'], class_names=['ORoRS'],
filled=True, rounded=True, fontsize=10, label='all', ax=ax)
plt.tight_layout() # Adjust layout to prevent overlapping
plt.show()
# Make predictions on the testing data
y_pred = regressor.predict(X_test)
print("Decision Tree Model Evaluation:")
# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
# Calculate the Root Mean Squared Error
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")
# Calculate the R-squared score
r2 = r2_score(y_test, y_pred)
print(f"R-squared (R2): {r2}")
from sklearn.ensemble import RandomForestRegressor
# Select PC1 as the feature
X = pca_df[['PC1']]
# Assuming ORoRS as the dependent variable for regression
y = df_energy['ORoRS']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)
# Predict on the test set
y_pred = rf_regressor.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Random Forest Regressor Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")
# Plotting the results
import matplotlib.pyplot as plt
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.title('Random Forest Regressor: Actual vs Predicted')
plt.legend()
plt.show()
import xgboost as xgb
# Assuming df_energy, pca_df, X, y, X_train, X_test, y_train, and y_test are already defined
# Initialize and train the XGBoost regression model
xgbr = xgb.XGBRegressor(verbosity=0)
xgbr.fit(X_train, y_train)
# Predictions on the test set
y_pred = xgbr.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("XGBoost Regression Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R2): {r2}")
# Plotting the results
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('PC1')
plt.ylabel('ORoRS')
plt.title('XGBoost Regression: Actual vs Predicted')
plt.legend()
plt.show()
"""# **8. Model Evaluation**
Based on the evaluation metrics from above five models, the Random Forest Regressor model demonstrates superior performance compared to the other models. It achieves this by exhibiting the lowest Mean Squared Error (MSE) and the highest R-squared value among all models. These metrics indicate that the Random Forest Regressor provides more accurate predictions and better explains the variance in the target variable compared to the other regression models.
"""
# Evaluation results for each model
models = ['Linear Regression', 'Neural Network Regression', 'Decision Tree', 'Random Forest Regressor', 'XGBoost Regression']
mse_values = [231588760521001.2, 222835706743011.4, 6411923390724.246, 5297160256238.563, 13617885819724.05]
rmse_values = [15218040.626867875, 14927682.564383911, 2532177.5985748405, 2301556.051074699, 3690241.9730586843]
r2_values = [0.9067015665821766, 0.9102278439510403, 0.997416876336296, 0.9978659726302849, 0.9945138640986516]
# Plotting
fig, axs = plt.subplots(3, figsize=(15, 15))
# MSE comparison
axs[0].bar(models, mse_values, color=['blue', 'orange', 'green', 'red', 'purple'])
axs[0].set_title('Mean Squared Error (MSE) Comparison')
axs[0].set_ylabel('MSE')
# RMSE comparison
axs[1].bar(models, rmse_values, color=['blue', 'orange', 'green', 'red', 'purple'])
axs[1].set_title('Root Mean Squared Error (RMSE) Comparison')
axs[1].set_ylabel('RMSE')
# R-squared comparison
axs[2].bar(models, r2_values, color=['blue', 'orange', 'green', 'red', 'purple'])
axs[2].set_title('R-squared (R2) Comparison')
axs[2].set_ylabel('R-squared')
plt.tight_layout()
plt.show()
"""**Linear Regression Model Evaluation:**
Mean Squared Error (MSE): 231588760521001.2
Root Mean Squared Error (RMSE): 15218040.626867875
R-squared (R2): 0.9067015665821766
**Neural Network Model Evaluation:**
Mean Squared Error (MSE): 222835706743011.4
Root Mean Squared Error (RMSE): 14927682.564383911
R-squared (R2): 0.9102278439510403
**Decision Tree Model Evaluation:**
Mean Squared Error (MSE): 6411923390724.246
Root Mean Squared Error (RMSE): 2532177.5985748405
R-squared (R2): 0.997416876336296
**Random Forest Regressor Model Evaluation:**
Mean Squared Error (MSE): 5297160256238.563
Root Mean Squared Error (RMSE): 2301556.051074699
R-squared (R2): 0.9978659726302849
**XGBoost Regression Model Evaluation:**
Mean Squared Error (MSE): 13617885819724.05
Root Mean Squared Error (RMSE): 3690241.9730586843
R-squared (R2): 0.9945138640986516
**Best Model**
Random Forest Regressor
Mean Squared Error (MSE): 5297160256238.563
Root Mean Squared Error (RMSE): 2301556.051074699
R-squared (R2): 0.9978659726302849
# **9. Conclusion**
Based on the evaluation of the different models, several key findings can be concluded:
1. **Linear Regression:** The linear regression model performed the poorest among the models evaluated, with a high Mean Squared Error (MSE) of approximately 2.32 x 10^14 and a moderate R-squared (R2) value of 0.907. This indicates that the linear model did not effectively capture the relationships in the data.
2. **Neural Network Regression:** The neural network model showed slight improvement over the linear regression model, with a lower MSE of approximately 2.23 x 10^14 and a higher R-squared value of 0.910. However, it still exhibited a high MSE, suggesting room for further enhancement.
3. **Decision Tree:** The decision tree model demonstrated significantly lower MSE compared to linear regression and neural network models, with a value of approximately 6.41 x 10^12. It also exhibited a very high R-squared value of 0.997, indicating a strong fit to the data. However, decision trees can be prone to overfitting.
4. **Random Forest Regressor:** The Random Forest regressor outperformed all other models, with the lowest MSE of approximately 5.30 x 10^12 and the highest R-squared value of 0.998. This suggests that the Random Forest model provided the most accurate predictions and best explained the variance in the target variable.
5. **XGBoost Regression:** The XGBoost regression model also performed well, with a relatively low MSE of approximately 1.36 x 10^13 and a high R-squared value of 0.995. While not as high as the Random Forest, it still demonstrated strong predictive performance.
In conclusion, the Random Forest Regressor model is recommended for this project, as it exhibited the best performance in terms of predictive accuracy and model fit. It provided the lowest MSE and highest R-squared value among all models, indicating superior predictive capability. However, depending on specific project requirements, the XGBoost Regression model could also be considered as it demonstrated strong performance as well. The decision tree model, while showing promise, might require additional regularization techniques to mitigate overfitting. The neural network and linear regression models did not perform as well and are less suitable for this dataset.
"""