-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathpdp.Rmd
131 lines (101 loc) · 4.21 KB
/
pdp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "PDP"
author: "William Mburu"
subtitle: "partial dependence plots"
date: "© 2022 "
output:
rmdformats::downcute:
downcute_theme: "chaos"
html_document:
fig_caption: yes
number_sections: yes
self_contained: yes
toc: true
toc_depth: 3
toc_float: true
code_folding: hide
---
<br>
## Analysis in Python
### Loading packages
```{python}
import pandas as pd
import sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
```
### Loading data and doing some cleaning
```{python}
# Data manipulation code below here
data = pd.read_csv('train.csv', nrows=50000)
# Remove data with extreme outlier coordinates or negative fares
data = data.query('pickup_latitude > 40.7 and pickup_latitude < 40.8 and ' +
'dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and ' +
'pickup_longitude > -74 and pickup_longitude < -73.9 and ' +
'dropoff_longitude > -74 and dropoff_longitude < -73.9 and ' +
'fare_amount > 0'
)
y = data.fare_amount
base_features = ['pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude']
X = data[base_features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
first_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(train_X, train_y)
print("Data sample:")
data.head()
```
```{python}
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
feat_name = 'pickup_longitude'
pdp_dist = pdp.pdp_isolate(model=first_model, dataset=val_X, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
```
We see a U-shaped kind of a plot which would suggest that being picked up near the center of the longitude values lowers predicted fares on average, because it means shorter trips (on average).
```{python}
for feat_name in base_features:
pdp_dist = pdp.pdp_isolate(model = first_model, dataset=val_X, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
```
## Creating a 2D plot for the features pickup_longitude and dropoff_longitude
```{python}
fnames = ['pickup_longitude', 'dropoff_longitude']
longitudes_partial_plot = pdp.pdp_interact(model=first_model, dataset=val_X,
model_features=base_features, features=fnames)
pdp.pdp_interact_plot(pdp_interact_out=longitudes_partial_plot,
feature_names=fnames, plot_type='contour')
plt.show()
```
We expect the contours to run along the diagonals. We see that prices increase
as we move further up to the upper right side of the plot.
### Direct measures of distances using the absolute distances.
```{python}
# This is the PDP for pickup_longitude without the absolute difference features. Included here to help compare it to the new PDP you create
feat_name = 'pickup_longitude'
pdp_dist_original = pdp.pdp_isolate(model=first_model, dataset=val_X, model_features=base_features, feature=feat_name)
pdp.pdp_plot(pdp_dist_original, feat_name)
plt.show()
# create new features
data['abs_lon_change'] = abs(data.dropoff_longitude - data.pickup_longitude)
data['abs_lat_change'] = abs(data.dropoff_latitude - data.pickup_latitude)
features_2 = ['pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude',
'abs_lat_change',
'abs_lon_change']
X = data[features_2]
new_train_X, new_val_X, new_train_y, new_val_y = train_test_split(X, y, random_state=1)
second_model = RandomForestRegressor(n_estimators=30, random_state=1).fit(new_train_X, new_train_y)
feat_name = 'pickup_longitude'
pdp_dist = pdp.pdp_isolate(model=second_model, dataset=new_val_X, model_features=features_2, feature=feat_name)
pdp.pdp_plot(pdp_dist, feat_name)
plt.show()
```
We see that controlling for absolute distance traveled, the pick up longitude
has a very small impact on predictions.