Data preparation accounts for about 80% of the work during a data science project. Let's take that number down. dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.
This package is
- fast (use
data.table
and exponential search) - RAM efficient (perform operations by reference and column-wise to avoid copying data)
- stable (most exceptions are handled)
- verbose (log a lot)
Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:
- Read: load the data set (this package don't treat this point: for csv we recommend
data.table::fread
) - Correct: most of the times, there are some mistake after reading, wrong format... one have to correct them
- Transform: creating new features from date, categorical, character... in order to have information usable for a ML algorithm (aka: numeric or categorical)
- Filter: get rid of useless information in order to speed up computation
- Pre model transformation: Specific manipulation for the chosen model (handling NA, discretization, one hot encoding, scaling...)
- Shape: put your data set in a nice shape usable by a ML algorithm
Here are the functions available in this package to tackle those issues:
Correct | Transform | Filter | Pre model manipulation | Shape |
---|---|---|---|---|
un_factor | generate_date_diffs | fast_filter_variables | fast_handle_na | shape_set |
find_and_transform_dates | generate_factor_from_date | which_are_constant | fast_discretization | same_shape |
find_and_transform_numerics | aggregate_by_key | which_are_in_double | fast_scale | set_as_numeric_matrix |
set_col_as_character | generate_from_factor | which_are_bijection | one_hot_encoder | |
set_col_as_numeric | generate_from_character | remove_sd_outlier | ||
set_col_as_date | fast_round | remove_rare_categorical | ||
set_col_as_factor | target_encode | remove_percentile_outlier |
All of those functions are integrated in the full pipeline function prepare_set
.
For more details on how it work go check our tutorial.
Install the package from CRAN:
install.packages("dataPreparation")
To have the latest features, install the package from github:
library(devtools)
install_github("ELToulemonde/dataPreparation")
Load a toy data set
library(dataPreparation)
data(messy_adult)
head(messy_adult)
Perform full pipeline function
clean_adult <- prepare_set(messy_adult)
head(clean_adult)
That's it. For all functions, you can check out documentation and/or tutorial vignette.
dataPreparation has been developed and used by many active community members. Your help is very valuable to make it better for everyone.
- Check out call for contributions to see what can be improved, or open an issue if you want something.
- Contribute to add new usesfull features.
- Contribute to the tests to make it more reliable.
- Contribute to the documents to make it clearer for everyone.
- Contribute to the examples to share your experience with other users.
- Open issue if you met problems during development.
For more details, please refer to CONTRIBUTING.