This package wants to be a simple porting of the regsubset function from R.
Unlike leaps
in R
, the package is not optimized yet, and requires extra work to improve code readability.
For now, I have implemented forward/backward/best subset selection for linear regression, building on top
of the excellent statsmodels
package.
In addition, for now, the user needs to manually code the categorical variable contrasts. This will be fixed in the future.
The package can be easily installed through pip. Check out https://pypi.org/project/pyleaps/ for details
pip install pyleaps
The relevant dependencies should automatically get installed, in case they are not present in the environment
Any help/collaboration is very welcome. Just let me know what kind of edits you propose and I will be very happy to discuss them.
This section contains a list of future edits:
- Improve general code readability
- Figure out a way to speed up best subset section. So far, it is way slower than the R counterpart.
This section demonstrates the package usage. In this instance, I will use a dataset from the popular UCI data set repository. Please visit https://archive.ics.uci.edu/ml for further details.
import pandas as pd
import pyleaps
import matplotlib.pyplot as plt
Loading the data set
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat", sep="\t", header=None)
df.columns = ["freq", "aoa", "ch_len", "u", "suc_thick", "sound_db"]
pyleaps.regsubsets(df, "sound_db", df.columns.to_list(), intercept=True, method="full").summary
r2 | r2_adj | bic | aic | ssr | vars | |
---|---|---|---|---|---|---|
1 | 0.152655 | 0.152091 | 9835.55871 | 9824.928273 | 60570.206223 | [intercept, freq] |
2 | 0.323783 | 0.322882 | 9503.806546 | 9487.860891 | 48337.58386 | [intercept, freq, suc_thick] |
3 | 0.43992 | 0.438799 | 9227.905534 | 9206.64466 | 40035.855465 | [intercept, freq, ch_len, suc_thick] |
4 | 0.484574 | 0.483198 | 9110.342833 | 9083.766741 | 36843.885086 | [intercept, u, freq, aoa, ch_len] |
5 | 0.51571 | 0.514092 | 9024.006788 | 8992.115478 | 34618.219133 | [intercept, u, freq, aoa, ch_len, suc_thick] |
pyleaps.regsubsets(df, "sound_db", df.columns.to_list(), intercept=False, method="full").summary
r2 | r2_adj | bic | aic | ssr | vars | |
---|---|---|---|---|---|---|
1 | 0.915417 | 0.915361 | 15074.73871 | 15069.423491 | 1987206.653961 | [u] |
2 | 0.925304 | 0.925204 | 14895.227275 | 14884.596838 | 1754927.356895 | [u, ch_len] |
3 | 0.937199 | 0.937073 | 14641.845804 | 14625.900149 | 1475469.985069 | [u, aoa, ch_len] |
4 | 0.938688 | 0.938524 | 14613.094355 | 14591.833481 | 1440485.372375 | [u, freq, aoa, ch_len] |
5 | 0.939991 | 0.93979 | 14588.128228 | 14561.552136 | 1409876.596034 | [u, freq, aoa, ch_len, suc_thick] |
pyleaps.regsubsets(df, "sound_db", df.columns.to_list(), intercept=True, method="forward").summary
r2 | r2_adj | bic | aic | ssr | vars | |
---|---|---|---|---|---|---|
1 | 0.152655 | 0.152091 | 9835.55871 | 9824.928273 | 60570.206223 | [intercept, freq] |
2 | 0.323783 | 0.322882 | 9503.806546 | 9487.860891 | 48337.58386 | [intercept, freq, suc_thick] |
3 | 0.43992 | 0.438799 | 9227.905534 | 9206.64466 | 40035.855465 | [intercept, freq, suc_thick, ch_len] |
4 | 0.477646 | 0.476251 | 9130.411111 | 9103.835019 | 37339.129014 | [intercept, freq, suc_thick, ch_len, u] |
5 | 0.51571 | 0.514092 | 9024.006788 | 8992.115478 | 34618.219133 | [intercept, freq, suc_thick, ch_len, u, aoa] |
pyleaps.regsubsets(df, "sound_db", df.columns.to_list(), intercept=False, method="forward").summary
r2 | r2_adj | bic | aic | ssr | vars | |
---|---|---|---|---|---|---|
1 | 0.915417 | 0.915361 | 15074.73871 | 15069.423491 | 1987206.653961 | [u] |
2 | 0.925304 | 0.925204 | 14895.227275 | 14884.596838 | 1754927.356895 | [u, ch_len] |
3 | 0.937199 | 0.937073 | 14641.845804 | 14625.900149 | 1475469.985069 | [u, ch_len, aoa] |
4 | 0.938688 | 0.938524 | 14613.094355 | 14591.833481 | 1440485.372375 | [u, ch_len, aoa, freq] |
5 | 0.939991 | 0.93979 | 14588.128228 | 14561.552136 | 1409876.596034 | [u, ch_len, aoa, freq, suc_thick] |
pyleaps.regsubsets(df, "sound_db", df.columns.to_list(), intercept=True, method="backward").summary
r2 | r2_adj | bic | aic | ssr | vars | |
---|---|---|---|---|---|---|
5 | 0.51571 | 0.514092 | 9024.006788 | 8992.115478 | 34618.219133 | [intercept, u, freq, aoa, ch_len, suc_thick] |
4 | 0.484574 | 0.483198 | 9110.342833 | 9083.766741 | 36843.885086 | [intercept, u, freq, aoa, ch_len] |
3 | 0.427997 | 0.426852 | 9259.565127 | 9238.304253 | 40888.126119 | [intercept, freq, aoa, ch_len] |
2 | 0.227202 | 0.226172 | 9704.462269 | 9688.516614 | 55241.410754 | [intercept, freq, aoa] |
1 | 0.152655 | 0.152091 | 9835.55871 | 9824.928273 | 60570.206223 | [intercept, freq] |
pyleaps.regsubsets(df, "sound_db", df.columns.to_list(), intercept=False, method="backward").summary
r2 | r2_adj | bic | aic | ssr | vars | |
---|---|---|---|---|---|---|
5 | 0.939991 | 0.93979 | 14588.128228 | 14561.552136 | 1409876.596034 | [u, freq, aoa, ch_len, suc_thick] |
4 | 0.938688 | 0.938524 | 14613.094355 | 14591.833481 | 1440485.372375 | [u, freq, aoa, ch_len] |
3 | 0.937199 | 0.937073 | 14641.845804 | 14625.900149 | 1475469.985069 | [u, aoa, ch_len] |
2 | 0.925304 | 0.925204 | 14895.227275 | 14884.596838 | 1754927.356895 | [u, ch_len] |
1 | 0.915417 | 0.915361 | 15074.73871 | 15069.423491 | 1987206.653961 | [u] |