Skip to content

Commit 3ccdc86

Browse files
committed
simplify readme
1 parent 810b69a commit 3ccdc86

File tree

1 file changed

+1
-357
lines changed

1 file changed

+1
-357
lines changed

README.md

Lines changed: 1 addition & 357 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Generalized Additive Models in Python.
1111

1212
<img src=imgs/pygam_tensor.png>
1313

14-
## Tutorial
14+
## Documentation
1515
[pyGAM: Getting started with Generalized Additive Models in Python](https://medium.com/@jpoberhauser/pygam-getting-started-with-generalized-additive-models-in-python-457df5b4705f)
1616

1717
## Installation
@@ -72,362 +72,6 @@ GAMs extend generalized linear models by allowing non-linear functions of featur
7272

7373
The result is a very flexible model, where it is easy to incorporate prior knowledge and control overfitting.
7474

75-
76-
## Regression
77-
For **regression** problems, we can use a **linear GAM** which models:
78-
79-
![alt tag](http://latex.codecogs.com/svg.latex?\mathbb{E}[y|X]=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p))
80-
81-
```python
82-
from pygam import LinearGAM, s, f
83-
from pygam.datasets import wage
84-
85-
X, y = wage(return_X_y=True)
86-
87-
gam = LinearGAM(s(0) + s(1) + f(2)).gridsearch(X, y)
88-
89-
fig, axs = plt.subplots(1, 3)
90-
titles = ['year', 'age', 'education']
91-
92-
for i, ax in enumerate(axs):
93-
XX = gam.generate_X_grid(term=i)
94-
pdep, confi = gam.partial_dependence(term=i, width=.95)
95-
96-
ax.plot(XX[:, i], pdep)
97-
ax.plot(XX[:, i], confi, c='r', ls='--')
98-
ax.set_title(titles[i])
99-
```
100-
<img src=imgs/pygam_wage_data_linear.png>
101-
102-
Even though we allowed **n_splines=20** per numerical feature, our **smoothing penalty** reduces us to just 19 **effective degrees of freedom**:
103-
104-
```
105-
gam.summary()
106-
107-
LinearGAM
108-
=============================================== ==========================================================
109-
Distribution: NormalDist Effective DoF: 19.2602
110-
Link Function: IdentityLink Log Likelihood: -24116.7451
111-
Number of Samples: 3000 AIC: 48274.0107
112-
AICc: 48274.2999
113-
GCV: 1250.3656
114-
Scale: 1235.9245
115-
Pseudo R-Squared: 0.2945
116-
==========================================================================================================
117-
Feature Function Lambda Rank EDoF P > x Sig. Code
118-
================================= ==================== ============ ============ ============ ============
119-
s(0) [15.8489] 20 6.9 5.52e-03 **
120-
s(1) [15.8489] 20 8.5 1.11e-16 ***
121-
f(2) [15.8489] 5 3.8 1.11e-16 ***
122-
intercept 0 1 0.0 1.11e-16 ***
123-
==========================================================================================================
124-
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
125-
```
126-
127-
128-
With **LinearGAMs**, we can also check the **prediction intervals**:
129-
130-
```python
131-
from pygam import LinearGAM
132-
from pygam.datasets import mcycle
133-
134-
X, y = mcycle(return_X_y=True)
135-
136-
gam = LinearGAM().gridsearch(X, y)
137-
XX = gam.generate_X_grid(term=0)
138-
139-
plt.plot(XX, gam.predict(XX), 'r--')
140-
plt.plot(XX, gam.prediction_intervals(term=0, XX, width=.95), color='b', ls='--')
141-
142-
plt.scatter(X, y, facecolor='gray', edgecolors='none')
143-
plt.title('95% prediction interval')
144-
```
145-
<img src=imgs/pygam_mcycle_data_linear.png>
146-
147-
And simulate from the posterior:
148-
149-
```python
150-
# continuing last example with the mcycle dataset
151-
for response in gam.sample(X, y, quantity='y', n_draws=50, sample_at_X=XX):
152-
plt.scatter(XX, response, alpha=.03, color='k')
153-
plt.plot(XX, gam.predict(XX), 'r--')
154-
plt.plot(XX, gam.prediction_intervals(XX, width=.95), color='b', ls='--')
155-
plt.title('draw samples from the posterior of the coefficients')
156-
```
157-
158-
<img src=imgs/pygam_mcycle_data_linear_sample_from_posterior.png>
159-
160-
## Classification
161-
For **binary classification** problems, we can use a **logistic GAM** which models:
162-
163-
![alt tag](http://latex.codecogs.com/svg.latex?log\left(\frac{P(y=1|X)}{P(y=0|X)}\right)=\beta_0+f_1(X_1)+f_2(X_2)+\dots+f_p(X_p))
164-
165-
```python
166-
from pygam import LogisticGAM, s, f
167-
from pygam.datasets import default
168-
169-
X, y = default(return_X_y=True)
170-
171-
gam = LogisticGAM(f(0) + s(1) + s(2)).gridsearch(X, y)
172-
173-
fig, axs = plt.subplots(1, 3)
174-
titles = ['student', 'balance', 'income']
175-
176-
for i, ax in enumerate(axs):
177-
XX = gam.generate_X_grid(term=i)
178-
pdep, confi = gam.partial_dependence(term=i, width=.95)
179-
180-
ax.plot(XX[:, i], pdep)
181-
ax.plot(XX[:, i], confi, c='r', ls='--')
182-
ax.set_title(titles[i])
183-
184-
# and check the accuracy
185-
gam.accuracy(X, y)
186-
```
187-
<img src=imgs/pygam_default_data_logistic.png>
188-
189-
Since the **scale** of the **Binomial distribution** is known, our gridsearch minimizes the **Un-Biased Risk Estimator** (UBRE) objective:
190-
191-
```
192-
gam.summary()
193-
194-
LogisticGAM
195-
=============================================== ==========================================================
196-
Distribution: BinomialDist Effective DoF: 3.8047
197-
Link Function: LogitLink Log Likelihood: -788.877
198-
Number of Samples: 10000 AIC: 1585.3634
199-
AICc: 1585.369
200-
UBRE: 2.1588
201-
Scale: 1.0
202-
Pseudo R-Squared: 0.4598
203-
==========================================================================================================
204-
Feature Function Lambda Rank EDoF P > x Sig. Code
205-
================================= ==================== ============ ============ ============ ============
206-
f(0) [1000.] 2 1.7 4.61e-03 **
207-
s(1) [1000.] 20 1.2 0.00e+00 ***
208-
s(2) [1000.] 20 0.8 3.29e-02 *
209-
intercept 0 1 0.0 0.00e+00 ***
210-
==========================================================================================================
211-
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
212-
```
213-
214-
215-
## Poisson and Histogram Smoothing
216-
We can intuitively perform **histogram smoothing** by modeling the counts in each bin
217-
as being distributed Poisson via **PoissonGAM**.
218-
219-
```python
220-
from pygam import PoissonGAM
221-
from pygam.datasets import faithful
222-
223-
X, y = faithful(return_X_y=True)
224-
225-
gam = PoissonGAM().gridsearch(X, y)
226-
227-
plt.hist(faithful(return_X_y=False)['eruptions'], bins=200, color='k');
228-
plt.plot(X, gam.predict(X), color='r')
229-
plt.title('Best Lambda: {0:.2f}'.format(gam.lam[0][0]));
230-
```
231-
<img src=imgs/pygam_poisson.png>
232-
233-
## Terms and Interactions
234-
235-
pyGAM can also fit interactions using tensor products via `te()`
236-
```python
237-
from pygam import LinearGAM, s, te
238-
from pygam.datasets import chicago
239-
240-
X, y = chicago(return_X_y=True)
241-
242-
gam = PoissonGAM(s(0, n_splines=200) + te(3, 1) + s(2)).fit(X, y)
243-
```
244-
245-
and plot a 3D surface:
246-
247-
```python
248-
XX = gam.generate_X_grid(term=1, meshgrid=True)
249-
Z = gam.partial_dependence(term=1, X=XX, meshgrid=True)
250-
251-
from mpl_toolkits import mplot3d
252-
ax = plt.axes(projection='3d')
253-
ax.plot_surface(XX[0], XX[1], Z, cmap='viridis')
254-
```
255-
256-
<img src=imgs/pygam_chicago_tensor.png>
257-
258-
For simple interactions it is sometimes useful to add a by-variable to a term
259-
260-
```python
261-
from pygam import LinearGAM, s
262-
from pygam.datasets import toy_interaction
263-
264-
X, y = toy_interaction(return_X_y=True)
265-
266-
gam = LinearGAM(s(0, by=1)).fit(X, y)
267-
gam.summary()
268-
```
269-
270-
#### Available Terms
271-
- `l()` linear terms
272-
- `s()` spline terms
273-
- `f()` factor terms
274-
- `te()` tensor products
275-
- `intercept`
276-
277-
## Custom Models
278-
It's also easy to build custom models, by using the base **GAM** class and specifying the **distribution** and the **link function**.
279-
280-
```python
281-
from pygam import GAM
282-
from pygam.datasets import trees
283-
284-
X, y = trees(return_X_y=True)
285-
286-
gam = GAM(distribution='gamma', link='log')
287-
gam.gridsearch(X, y)
288-
289-
plt.scatter(y, gam.predict(X))
290-
plt.xlabel('true volume')
291-
plt.ylabel('predicted volume')
292-
```
293-
<img src=imgs/pygam_custom.png>
294-
295-
We can check the quality of the fit by looking at the `Pseudo R-Squared`:
296-
297-
```
298-
gam.summary()
299-
300-
GAM
301-
=============================================== ==========================================================
302-
Distribution: GammaDist Effective DoF: 25.3616
303-
Link Function: LogLink Log Likelihood: -26.1673
304-
Number of Samples: 31 AIC: 105.0579
305-
AICc: 501.5549
306-
GCV: 0.0088
307-
Scale: 0.001
308-
Pseudo R-Squared: 0.9993
309-
==========================================================================================================
310-
Feature Function Lambda Rank EDoF P > x Sig. Code
311-
================================= ==================== ============ ============ ============ ============
312-
s(0) [0.001] 20 2.04e-08 ***
313-
s(1) [0.001] 20 7.36e-06 ***
314-
intercept 0 1 4.39e-13 ***
315-
==========================================================================================================
316-
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
317-
```
318-
319-
## Penalties / Constraints
320-
With GAMs we can encode **prior knowledge** and **control overfitting** by using penalties and constraints.
321-
322-
#### Available penalties:
323-
- second derivative smoothing (default on numerical features)
324-
- L2 smoothing (default on categorical features)
325-
326-
#### Availabe constraints:
327-
- monotonic increasing/decreasing smoothing
328-
- convex/concave smoothing
329-
- periodic smoothing [soon...]
330-
331-
332-
We can inject our intuition into our model by using **monotonic** and **concave** constraints:
333-
334-
```python
335-
from pygam import LinearGAM, s
336-
from pygam.datasets import hepatitis
337-
338-
X, y = hepatitis(return_X_y=True)
339-
340-
gam1 = LinearGAM(s(0, constraints='monotonic_inc')).fit(X, y)
341-
gam2 = LinearGAM(s(0, constraints='concave')).fit(X, y)
342-
343-
fig, ax = plt.subplots(1, 2)
344-
ax[0].plot(X, y, label='data')
345-
ax[0].plot(X, gam1.predict(X), label='monotonic fit')
346-
ax[0].legend()
347-
348-
ax[1].plot(X, y, label='data')
349-
ax[1].plot(X, gam2.predict(X), label='concave fit')
350-
ax[1].legend()
351-
```
352-
<img src=imgs/pygam_constraints.png>
353-
354-
## API
355-
pyGAM is intuitive, modular, and adheres to a familiar API:
356-
357-
```python
358-
from pygam import LogisticGAM
359-
from pygam.datasets import toy_classification
360-
361-
X, y = toy_classification(return_X_y=True)
362-
363-
gam = LogisticGAM(s(0) + s(1) + s(2) + s(3) + s(4) + f(5))
364-
gam.fit(X, y)
365-
```
366-
367-
Since GAMs are additive, it is also super easy to visualize each individual **feature function**, `f_i(X_i)`. These feature functions describe the effect of each `X_i` on `y` individually while marginalizing out all other predictors:
368-
369-
```python
370-
plt.figure()
371-
for i, term in enumerate(gam.terms):
372-
if term.isintercept:
373-
continue
374-
plt.plot(gam.partial_dependence(term=i))
375-
```
376-
<img src=imgs/pygam_multi_pdep.png>
377-
378-
## Current Features
379-
### Models
380-
pyGAM comes with many models out-of-the-box:
381-
382-
- GAM (base class for constructing custom models)
383-
- LinearGAM
384-
- LogisticGAM
385-
- GammaGAM
386-
- PoissonGAM
387-
- InvGaussGAM
388-
- ExpectileGAM
389-
390-
You can mix and match distributions with link functions to create custom models!
391-
392-
```python
393-
gam = GAM(distribution='gamma', link='inverse')
394-
```
395-
396-
### Distributions
397-
398-
- Normal
399-
- Binomial
400-
- Gamma
401-
- Poisson
402-
- Inverse Gaussian
403-
404-
### Link Functions
405-
Link functions take the distribution mean to the linear prediction. These are the canonical link functions for the above distributions:
406-
407-
- Identity
408-
- Logit
409-
- Inverse
410-
- Log
411-
- Inverse-squared
412-
413-
### Callbacks
414-
Callbacks are performed during each optimization iteration. It's also easy to write your own.
415-
416-
- deviance - model deviance
417-
- diffs - differences of coefficient norm
418-
- accuracy - model accuracy for LogisticGAM
419-
- coef - coefficient logging
420-
421-
You can check a callback by inspecting:
422-
423-
```python
424-
plt.plot(gam.logs_['deviance'])
425-
```
426-
<img src=imgs/pygam_multi_deviance.png>
427-
428-
### Linear Extrapolation
429-
<img src=imgs/pygam_mcycle_data_extrapolation.png>
430-
43175
## Citing pyGAM
43276
Please consider citing pyGAM if it has helped you in your research or work:
43377

0 commit comments

Comments
 (0)