-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new feature request - automatically fit multiple variables #27
Comments
Great suggestion. I looked into it but it is quite some work to integrate such an approach. At the moment all functions (figures, predict etc) are designed for single-column univariate and not multi-column univariate. I will put this on my never-ending-always-getting-longer-todo-list. |
Thank you for the reply! In the meantime, here is starter code to run a distfit exploratory data analysis with multiple cores. Pandas DataFrames are used for readability. (Code can be easily tweaked to use numpy instead.) The illustration below uses a numeric-only dataset called Company Bankruptcy Prediction. It has 6819 rows and 96 columns. Note: Error-handling is required to run distfit on this dataset. Certain columns will error out--with or without parallel processing. import numpy as np
import pandas as pd
import re
from distfit import distfit
from joblib import Parallel, delayed
import collections
pd.options.display.max_columns = 100
# Numeric-only data from here (sign-in required):
# https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction/download?datasetVersionNumber=2
#----------------------------------------------------------------------------------
#Clean up column names and lower memory.
#----------------------------------------------------------------------------------
df = pd.read_csv("./data.csv")
for c in df.columns: #clean up column names
no_beg_end_spaces = c.strip()
result = re.sub(r"\s+", "_", no_beg_end_spaces)
df.rename(columns={c : result}, inplace=True)
print('df shape:', df.shape)
display(df.tail(3))
for c in df.columns:
df[c] = pd.to_numeric(df[c], downcast='float')
#----------------------------------------------------------------------------------
#Use joblib to run distfit on CPU cores in parallel.
#----------------------------------------------------------------------------------
chunks = np.array_split(df, len(df.columns), axis=1) #chunks are one column due to univariate constraint.
display(chunks[0].head())
display(chunks[1].head())
def get_distfit(series):
try:
result = dfit.fit_transform(series.values, verbose=30)
return result['model']['name']
except:
return 'ERROR'
dfit = distfit()
with Parallel(n_jobs=-2, prefer="processes") as parallel:
results = parallel(delayed(get_distfit)(chunk) for chunk in chunks)
display(list(zip(df.columns, results))[0:5]) #show best distribution by column
display(sorted(collections.Counter(results).items(), key=lambda x:x[1], reverse=True))
#----------------------------------------------------------------------------------
#Get best distribution one column at a time (slower than parallel run).
#----------------------------------------------------------------------------------
sequential_outputs = []
for chunk in chunks:
sequential_outputs.append(get_distfit(chunk))
display(list(zip(df.columns, sequential_outputs))[0:5]) #show best distribution by column
display(sorted(collections.Counter(sequential_outputs).items(), key=lambda x:x[1], reverse=True)) |
I recommend dfit.fit_transform(X) be extended to include multiple variables. Each variable will be fitted individually.
matrix rows = samples
matrix columns = features (variables)
The proposed functionality mirrors the popular scikit-learn API. Here is an example of that API: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
Also, parallel processing across a multi-core CPU would be an awesome enhancement! :-)
Guillaume Lemaitre (https://github.com/glemaitre) committed code for sklearn.utils.parallel. He is a developer for the scikit-learn foundation. He may be a good contact on how best to implement parallel processing in Python in 2023.
The text was updated successfully, but these errors were encountered: