@@ -11,7 +11,7 @@ Generalized Additive Models in Python.
11
11
12
12
<img src=imgs/pygam_tensor.png>
13
13
14
- ## Tutorial
14
+ ## Documentation
15
15
[ pyGAM: Getting started with Generalized Additive Models in Python] ( https://medium.com/@jpoberhauser/pygam-getting-started-with-generalized-additive-models-in-python-457df5b4705f )
16
16
17
17
## Installation
@@ -72,362 +72,6 @@ GAMs extend generalized linear models by allowing non-linear functions of featur
72
72
73
73
The result is a very flexible model, where it is easy to incorporate prior knowledge and control overfitting.
74
74
75
-
76
- ## Regression
77
- For ** regression** problems, we can use a ** linear GAM** which models:
78
-
79
- ![ alt tag] ( http://latex.codecogs.com/svg.latex?\mathbb{E}[y|X]=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p) )
80
-
81
- ``` python
82
- from pygam import LinearGAM, s, f
83
- from pygam.datasets import wage
84
-
85
- X, y = wage(return_X_y = True )
86
-
87
- gam = LinearGAM(s(0 ) + s(1 ) + f(2 )).gridsearch(X, y)
88
-
89
- fig, axs = plt.subplots(1 , 3 )
90
- titles = [' year' , ' age' , ' education' ]
91
-
92
- for i, ax in enumerate (axs):
93
- XX = gam.generate_X_grid(term = i)
94
- pdep, confi = gam.partial_dependence(term = i, width = .95 )
95
-
96
- ax.plot(XX [:, i], pdep)
97
- ax.plot(XX [:, i], confi, c = ' r' , ls = ' --' )
98
- ax.set_title(titles[i])
99
- ```
100
- <img src=imgs/pygam_wage_data_linear.png>
101
-
102
- Even though we allowed ** n_splines=20** per numerical feature, our ** smoothing penalty** reduces us to just 19 ** effective degrees of freedom** :
103
-
104
- ```
105
- gam.summary()
106
-
107
- LinearGAM
108
- =============================================== ==========================================================
109
- Distribution: NormalDist Effective DoF: 19.2602
110
- Link Function: IdentityLink Log Likelihood: -24116.7451
111
- Number of Samples: 3000 AIC: 48274.0107
112
- AICc: 48274.2999
113
- GCV: 1250.3656
114
- Scale: 1235.9245
115
- Pseudo R-Squared: 0.2945
116
- ==========================================================================================================
117
- Feature Function Lambda Rank EDoF P > x Sig. Code
118
- ================================= ==================== ============ ============ ============ ============
119
- s(0) [15.8489] 20 6.9 5.52e-03 **
120
- s(1) [15.8489] 20 8.5 1.11e-16 ***
121
- f(2) [15.8489] 5 3.8 1.11e-16 ***
122
- intercept 0 1 0.0 1.11e-16 ***
123
- ==========================================================================================================
124
- Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
125
- ```
126
-
127
-
128
- With ** LinearGAMs** , we can also check the ** prediction intervals** :
129
-
130
- ``` python
131
- from pygam import LinearGAM
132
- from pygam.datasets import mcycle
133
-
134
- X, y = mcycle(return_X_y = True )
135
-
136
- gam = LinearGAM().gridsearch(X, y)
137
- XX = gam.generate_X_grid(term = 0 )
138
-
139
- plt.plot(XX , gam.predict(XX ), ' r--' )
140
- plt.plot(XX , gam.prediction_intervals(term = 0 , XX , width = .95 ), color = ' b' , ls = ' --' )
141
-
142
- plt.scatter(X, y, facecolor = ' gray' , edgecolors = ' none' )
143
- plt.title(' 95% prediction interval' )
144
- ```
145
- <img src=imgs/pygam_mcycle_data_linear.png>
146
-
147
- And simulate from the posterior:
148
-
149
- ``` python
150
- # continuing last example with the mcycle dataset
151
- for response in gam.sample(X, y, quantity = ' y' , n_draws = 50 , sample_at_X = XX ):
152
- plt.scatter(XX , response, alpha = .03 , color = ' k' )
153
- plt.plot(XX , gam.predict(XX ), ' r--' )
154
- plt.plot(XX , gam.prediction_intervals(XX , width = .95 ), color = ' b' , ls = ' --' )
155
- plt.title(' draw samples from the posterior of the coefficients' )
156
- ```
157
-
158
- <img src=imgs/pygam_mcycle_data_linear_sample_from_posterior.png>
159
-
160
- ## Classification
161
- For ** binary classification** problems, we can use a ** logistic GAM** which models:
162
-
163
- ![ alt tag] ( http://latex.codecogs.com/svg.latex?log\left(\frac{P(y=1|X)}{P(y=0|X)}\right)=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p) )
164
-
165
- ``` python
166
- from pygam import LogisticGAM, s, f
167
- from pygam.datasets import default
168
-
169
- X, y = default(return_X_y = True )
170
-
171
- gam = LogisticGAM(f(0 ) + s(1 ) + s(2 )).gridsearch(X, y)
172
-
173
- fig, axs = plt.subplots(1 , 3 )
174
- titles = [' student' , ' balance' , ' income' ]
175
-
176
- for i, ax in enumerate (axs):
177
- XX = gam.generate_X_grid(term = i)
178
- pdep, confi = gam.partial_dependence(term = i, width = .95 )
179
-
180
- ax.plot(XX [:, i], pdep)
181
- ax.plot(XX [:, i], confi, c = ' r' , ls = ' --' )
182
- ax.set_title(titles[i])
183
-
184
- # and check the accuracy
185
- gam.accuracy(X, y)
186
- ```
187
- <img src=imgs/pygam_default_data_logistic.png>
188
-
189
- Since the ** scale** of the ** Binomial distribution** is known, our gridsearch minimizes the ** Un-Biased Risk Estimator** (UBRE) objective:
190
-
191
- ```
192
- gam.summary()
193
-
194
- LogisticGAM
195
- =============================================== ==========================================================
196
- Distribution: BinomialDist Effective DoF: 3.8047
197
- Link Function: LogitLink Log Likelihood: -788.877
198
- Number of Samples: 10000 AIC: 1585.3634
199
- AICc: 1585.369
200
- UBRE: 2.1588
201
- Scale: 1.0
202
- Pseudo R-Squared: 0.4598
203
- ==========================================================================================================
204
- Feature Function Lambda Rank EDoF P > x Sig. Code
205
- ================================= ==================== ============ ============ ============ ============
206
- f(0) [1000.] 2 1.7 4.61e-03 **
207
- s(1) [1000.] 20 1.2 0.00e+00 ***
208
- s(2) [1000.] 20 0.8 3.29e-02 *
209
- intercept 0 1 0.0 0.00e+00 ***
210
- ==========================================================================================================
211
- Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
212
- ```
213
-
214
-
215
- ## Poisson and Histogram Smoothing
216
- We can intuitively perform ** histogram smoothing** by modeling the counts in each bin
217
- as being distributed Poisson via ** PoissonGAM** .
218
-
219
- ``` python
220
- from pygam import PoissonGAM
221
- from pygam.datasets import faithful
222
-
223
- X, y = faithful(return_X_y = True )
224
-
225
- gam = PoissonGAM().gridsearch(X, y)
226
-
227
- plt.hist(faithful(return_X_y = False )[' eruptions' ], bins = 200 , color = ' k' );
228
- plt.plot(X, gam.predict(X), color = ' r' )
229
- plt.title(' Best Lambda: {0:.2f } ' .format(gam.lam[0 ][0 ]));
230
- ```
231
- <img src=imgs/pygam_poisson.png>
232
-
233
- ## Terms and Interactions
234
-
235
- pyGAM can also fit interactions using tensor products via ` te() `
236
- ``` python
237
- from pygam import LinearGAM, s, te
238
- from pygam.datasets import chicago
239
-
240
- X, y = chicago(return_X_y = True )
241
-
242
- gam = PoissonGAM(s(0 , n_splines = 200 ) + te(3 , 1 ) + s(2 )).fit(X, y)
243
- ```
244
-
245
- and plot a 3D surface:
246
-
247
- ``` python
248
- XX = gam.generate_X_grid(term = 1 , meshgrid = True )
249
- Z = gam.partial_dependence(term = 1 , X = XX , meshgrid = True )
250
-
251
- from mpl_toolkits import mplot3d
252
- ax = plt.axes(projection = ' 3d' )
253
- ax.plot_surface(XX [0 ], XX [1 ], Z, cmap = ' viridis' )
254
- ```
255
-
256
- <img src=imgs/pygam_chicago_tensor.png>
257
-
258
- For simple interactions it is sometimes useful to add a by-variable to a term
259
-
260
- ``` python
261
- from pygam import LinearGAM, s
262
- from pygam.datasets import toy_interaction
263
-
264
- X, y = toy_interaction(return_X_y = True )
265
-
266
- gam = LinearGAM(s(0 , by = 1 )).fit(X, y)
267
- gam.summary()
268
- ```
269
-
270
- #### Available Terms
271
- - ` l() ` linear terms
272
- - ` s() ` spline terms
273
- - ` f() ` factor terms
274
- - ` te() ` tensor products
275
- - ` intercept `
276
-
277
- ## Custom Models
278
- It's also easy to build custom models, by using the base ** GAM** class and specifying the ** distribution** and the ** link function** .
279
-
280
- ``` python
281
- from pygam import GAM
282
- from pygam.datasets import trees
283
-
284
- X, y = trees(return_X_y = True )
285
-
286
- gam = GAM(distribution = ' gamma' , link = ' log' )
287
- gam.gridsearch(X, y)
288
-
289
- plt.scatter(y, gam.predict(X))
290
- plt.xlabel(' true volume' )
291
- plt.ylabel(' predicted volume' )
292
- ```
293
- <img src=imgs/pygam_custom.png>
294
-
295
- We can check the quality of the fit by looking at the ` Pseudo R-Squared ` :
296
-
297
- ```
298
- gam.summary()
299
-
300
- GAM
301
- =============================================== ==========================================================
302
- Distribution: GammaDist Effective DoF: 25.3616
303
- Link Function: LogLink Log Likelihood: -26.1673
304
- Number of Samples: 31 AIC: 105.0579
305
- AICc: 501.5549
306
- GCV: 0.0088
307
- Scale: 0.001
308
- Pseudo R-Squared: 0.9993
309
- ==========================================================================================================
310
- Feature Function Lambda Rank EDoF P > x Sig. Code
311
- ================================= ==================== ============ ============ ============ ============
312
- s(0) [0.001] 20 2.04e-08 ***
313
- s(1) [0.001] 20 7.36e-06 ***
314
- intercept 0 1 4.39e-13 ***
315
- ==========================================================================================================
316
- Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
317
- ```
318
-
319
- ## Penalties / Constraints
320
- With GAMs we can encode ** prior knowledge** and ** control overfitting** by using penalties and constraints.
321
-
322
- #### Available penalties:
323
- - second derivative smoothing (default on numerical features)
324
- - L2 smoothing (default on categorical features)
325
-
326
- #### Availabe constraints:
327
- - monotonic increasing/decreasing smoothing
328
- - convex/concave smoothing
329
- - periodic smoothing [ soon...]
330
-
331
-
332
- We can inject our intuition into our model by using ** monotonic** and ** concave** constraints:
333
-
334
- ``` python
335
- from pygam import LinearGAM, s
336
- from pygam.datasets import hepatitis
337
-
338
- X, y = hepatitis(return_X_y = True )
339
-
340
- gam1 = LinearGAM(s(0 , constraints = ' monotonic_inc' )).fit(X, y)
341
- gam2 = LinearGAM(s(0 , constraints = ' concave' )).fit(X, y)
342
-
343
- fig, ax = plt.subplots(1 , 2 )
344
- ax[0 ].plot(X, y, label = ' data' )
345
- ax[0 ].plot(X, gam1.predict(X), label = ' monotonic fit' )
346
- ax[0 ].legend()
347
-
348
- ax[1 ].plot(X, y, label = ' data' )
349
- ax[1 ].plot(X, gam2.predict(X), label = ' concave fit' )
350
- ax[1 ].legend()
351
- ```
352
- <img src=imgs/pygam_constraints.png>
353
-
354
- ## API
355
- pyGAM is intuitive, modular, and adheres to a familiar API:
356
-
357
- ``` python
358
- from pygam import LogisticGAM
359
- from pygam.datasets import toy_classification
360
-
361
- X, y = toy_classification(return_X_y = True )
362
-
363
- gam = LogisticGAM(s(0 ) + s(1 ) + s(2 ) + s(3 ) + s(4 ) + f(5 ))
364
- gam.fit(X, y)
365
- ```
366
-
367
- Since GAMs are additive, it is also super easy to visualize each individual ** feature function** , ` f_i(X_i) ` . These feature functions describe the effect of each ` X_i ` on ` y ` individually while marginalizing out all other predictors:
368
-
369
- ``` python
370
- plt.figure()
371
- for i, term in enumerate (gam.terms):
372
- if term.isintercept:
373
- continue
374
- plt.plot(gam.partial_dependence(term = i))
375
- ```
376
- <img src=imgs/pygam_multi_pdep.png>
377
-
378
- ## Current Features
379
- ### Models
380
- pyGAM comes with many models out-of-the-box:
381
-
382
- - GAM (base class for constructing custom models)
383
- - LinearGAM
384
- - LogisticGAM
385
- - GammaGAM
386
- - PoissonGAM
387
- - InvGaussGAM
388
- - ExpectileGAM
389
-
390
- You can mix and match distributions with link functions to create custom models!
391
-
392
- ``` python
393
- gam = GAM(distribution = ' gamma' , link = ' inverse' )
394
- ```
395
-
396
- ### Distributions
397
-
398
- - Normal
399
- - Binomial
400
- - Gamma
401
- - Poisson
402
- - Inverse Gaussian
403
-
404
- ### Link Functions
405
- Link functions take the distribution mean to the linear prediction. These are the canonical link functions for the above distributions:
406
-
407
- - Identity
408
- - Logit
409
- - Inverse
410
- - Log
411
- - Inverse-squared
412
-
413
- ### Callbacks
414
- Callbacks are performed during each optimization iteration. It's also easy to write your own.
415
-
416
- - deviance - model deviance
417
- - diffs - differences of coefficient norm
418
- - accuracy - model accuracy for LogisticGAM
419
- - coef - coefficient logging
420
-
421
- You can check a callback by inspecting:
422
-
423
- ``` python
424
- plt.plot(gam.logs_[' deviance' ])
425
- ```
426
- <img src=imgs/pygam_multi_deviance.png>
427
-
428
- ### Linear Extrapolation
429
- <img src=imgs/pygam_mcycle_data_extrapolation.png>
430
-
431
75
## Citing pyGAM
432
76
Please consider citing pyGAM if it has helped you in your research or work:
433
77
0 commit comments