This Python script provides a comprehensive collection of functions and utilities designed to streamline various stages of the data science workflow, including data preprocessing, exploratory data analysis, model training, and evaluation. The script leverages several popular libraries, such as pandas
, numpy
, scikit-learn
, seaborn
, and matplotlib
, among others.
To use the functions in this script, ensure that you have the required libraries installed. You can install them using pip
:
pip install pandas numpy scikit-learn seaborn matplotlib joblib xgboost lightgbm catboost
- Imputing missing values: Functions to handle missing data using various imputation techniques (
KNNImputer
,SimpleImputer
, etc.). - Encoding: Functions to encode categorical variables using methods like
LabelEncoder
,OneHotEncoder
, etc. - Scaling: Functions to scale numerical data using techniques like
MinMaxScaler
,StandardScaler
, andRobustScaler
.
- Dataframe Overview: Functions to get a quick summary of the dataset including shape, data types, and missing values.
- Column Classification: Functions to classify columns into categorical, numerical, and categorical but cardinal.
- Summarization: Functions to generate summaries for categorical and numerical columns.
- Correlation Analysis: Functions to visualize and analyze correlations between features.
- Supervised Learning: Pre-defined functions to train and evaluate models using various algorithms such as Random Forest, Gradient Boosting, Logistic Regression, etc.
- Unsupervised Learning: Utilities for clustering and dimensionality reduction using
KMeans
,PCA
,AgglomerativeClustering
, etc. - Model Selection: Tools for cross-validation, hyperparameter tuning (
GridSearchCV
,RandomizedSearchCV
), and ensemble methods.
- Outlier Detection and Removal: Functions to detect and handle outliers using statistical methods like IQR.
- Missing Value Analysis: Functions to summarize and handle missing data, including quick imputation methods.
To use these functions in your project, simply import the required_functions.py
file into your Python script:
from required_functions import *
You can then call any of the provided functions directly:
df = pd.read_csv('your_data.csv')
check_df(df)
cat_cols, num_cols, cat_but_car = grab_col_names(df)
This project is licensed under the MIT License - see the LICENSE file for details.