-
Notifications
You must be signed in to change notification settings - Fork 119
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* clean up test_api just a bit * added a patsy transformer * fixed tests and such * also added a doc to the documentation page * now testing for different group values * fixed flake bug * numpy needs to be loaded in the module. le strange * change building of design matrix in patsytransformer for stateful transform (#104)
- Loading branch information
Showing
6 changed files
with
222 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -62,4 +62,5 @@ Usage | |
install | ||
contribution | ||
mixture-methods | ||
preprocessing | ||
api/modules |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
Preprocessing | ||
============= | ||
|
||
There are many preprocessors in scikit-lego and in this document we | ||
would like to highlight a few such that you might be inspired to use | ||
pipelines a little bit more flexibly. | ||
|
||
Patsy Formulas | ||
************** | ||
|
||
If you're used to the statistical programming language R you might have | ||
seen a formula object before. This is an object that represents a shorthand | ||
way to design variables used in a statistical model. The python project patsy_ | ||
took this idea and made it available for python. From sklego we've made a | ||
wrapper such that you can also use these in your pipelines. | ||
|
||
.. code-block:: python | ||
import pandas as pd | ||
from sklego.transformers import PatsyTransformer | ||
df = pd.DataFrame({"a": [1, 2, 3, 4, 5], | ||
"b": ["yes", "yes", "no", "maybe", "yes"], | ||
"y": [2, 2, 4, 4, 6]}) | ||
X, y = df[["a", "b"]], df[["y"]].values | ||
pt = PatsyTransformer("a + np.log(a) + b") | ||
pt.fit(X, y).transform(X) | ||
This will result in the following array: | ||
|
||
.. code-block:: python | ||
array([[1. , 0. , 1. , 1. , 0. ], | ||
[1. , 0. , 1. , 2. , 0.69314718], | ||
[1. , 1. , 0. , 3. , 1.09861229], | ||
[1. , 0. , 0. , 4. , 1.38629436], | ||
[1. , 0. , 1. , 5. , 1.60943791]]) | ||
You might notice that the first column contains the constant array | ||
equal to one. You might also expect 3 dummy variable columns instead of 2. | ||
This is because the design matrix from patsy attempts to keep the | ||
columns in the matrix linearly independant of eachother. | ||
|
||
If this is not something you'd want to create you can choose to omit | ||
it by indicating "-1" in the formula. | ||
|
||
.. code-block:: python | ||
pt = PatsyTransformer("a + np.log(a) + b - 1") | ||
pt.fit(X, y).transform(X) | ||
This will result in the following array: | ||
|
||
.. code-block:: python | ||
array([[0. , 0. , 1. , 1. , 0. ], | ||
[0. , 0. , 1. , 2. , 0.69314718], | ||
[0. , 1. , 0. , 3. , 1.09861229], | ||
[1. , 0. , 0. , 4. , 1.38629436], | ||
[0. , 0. , 1. , 5. , 1.60943791]]) | ||
You'll notice that now the constant array is gone and it is replaced with | ||
a dummy array. Again this is now possible because patsy wants to guarantee | ||
that each column in this matrix is linearly independant of eachother. | ||
|
||
The formula syntax is pretty powerful, if you'd like to learn we refer you | ||
to formulas_ documentation. | ||
|
||
.. _patsy: https://patsy.readthedocs.io/en/latest/ | ||
.. _formulas https://patsy.readthedocs.io/en/latest/formulas.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
import pytest | ||
import numpy as np | ||
import pandas as pd | ||
|
||
from sklego.transformers import PatsyTransformer | ||
from sklearn.preprocessing import StandardScaler | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.linear_model import LogisticRegression | ||
|
||
|
||
@pytest.fixture() | ||
def df(): | ||
return pd.DataFrame({"a": [1, 2, 3, 4, 5, 6], | ||
"b": np.log([10, 9, 8, 7, 6, 5]), | ||
"c": ["a", "b", "a", "b", "c", "c"], | ||
"d": ["b", "a", "a", "b", "a", "b"], | ||
"e": [0, 1, 0, 1, 0, 1]}) | ||
|
||
|
||
def test_basic_usage(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]] | ||
tf = PatsyTransformer("a + b") | ||
assert tf.fit(X, y).transform(X).shape == (6, 3) | ||
|
||
|
||
def test_min_sign_usage(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]] | ||
tf = PatsyTransformer("a + b - 1") | ||
assert tf.fit(X, y).transform(X).shape == (6, 2) | ||
|
||
|
||
def test_apply_numpy_transform(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]] | ||
tf = PatsyTransformer("a + np.log(a) + b - 1") | ||
assert tf.fit(X, y).transform(X).shape == (6, 3) | ||
|
||
|
||
def test_multiply_columns(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]] | ||
tf = PatsyTransformer("a*b - 1") | ||
print(tf.fit(X, y).transform(X)) | ||
assert tf.fit(X, y).transform(X).shape == (6, 3) | ||
|
||
|
||
def test_transform_dummy1(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]] | ||
tf = PatsyTransformer("a + b + d") | ||
print(tf.fit(X, y).transform(X)) | ||
assert tf.fit(X, y).transform(X).shape == (6, 4) | ||
|
||
|
||
def test_transform_dummy2(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]] | ||
tf = PatsyTransformer("a + b + c + d") | ||
print(tf.fit(X, y).transform(X)) | ||
assert tf.fit(X, y).transform(X).shape == (6, 6) | ||
|
||
|
||
def test_mult_usage(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]] | ||
tf = PatsyTransformer("a*b - 1") | ||
print(tf.fit(X, y).transform(X)) | ||
assert tf.fit(X, y).transform(X).shape == (6, 3) | ||
|
||
|
||
def test_design_matrix_in_pipeline(df): | ||
X, y = df[["a", "b", "c", "d"]], df[["e"]].values.ravel() | ||
pipe = Pipeline([ | ||
("design", PatsyTransformer("a + np.log(a) + b - 1")), | ||
("scale", StandardScaler()), | ||
("model", LogisticRegression(solver='lbfgs')), | ||
]) | ||
assert pipe.fit(X, y).predict(X).shape == (6,) | ||
|
||
|
||
def test_subset_categories_in_test(df): | ||
df_train = df[:5] | ||
X_train, y_train = df_train[["a", "b", "c", "d"]], df_train[["e"]].values.ravel() | ||
|
||
df_test = df[5:] | ||
X_test, _ = df_test[["a", "b", "c", "d"]], df_test[["e"]].values.ravel() | ||
|
||
trf = PatsyTransformer("a + np.log(a) + b + c + d - 1") | ||
|
||
trf.fit(X_train, y_train) | ||
|
||
assert trf.transform(X_test).shape[1] == trf.transform(X_train).shape[1] | ||
|
||
|
||
def test_design_matrix_error(df): | ||
df_train = df[:4] | ||
X_train, y_train = df_train[["a", "b", "c", "d"]], df_train[["e"]].values.ravel() | ||
|
||
df_test = df[4:] | ||
X_test, _ = df_test[["a", "b", "c", "d"]], df_test[["e"]].values.ravel() | ||
|
||
pipe = Pipeline([ | ||
("design", PatsyTransformer("a + np.log(a) + b + c + d - 1")), | ||
("scale", StandardScaler()), | ||
("model", LogisticRegression(solver='lbfgs')), | ||
]) | ||
|
||
pipe.fit(X_train, y_train) | ||
with pytest.raises(RuntimeError): | ||
pipe.predict(X_test) |