If you wish to use these workflows with your own datasets, proceed with the following steps:
- Download the code repository and make sure the dependencies are installed.
- Edit
config.json
- Edit the config file to align with the field names and data types of your datasets. A sample template is provided using the sample datasets.
- Under
datasets
- The key should be the dataset name. The value contains the following:
numeric_fields
contains the field names that should be numeric (e.g.float
orint
)date_fields
contains the field names that are datesprimary_key
the field name of the presumed unique identifierpre_cluster_exact_matches
(bool
type) istrue
if records with exact matches in all fields should be pre-clustered before deduplication.- Set this to
true
if there is rampant exact matches in all fields in your dataset. Otherwise, leave it at the defaultfalse
.
- Set this to
- Under
tasks
- The key should be in the form
<task_name>-<dataset_name>
where thetask_name
is eitherdedup
orrl
. The value contains the following: recall_train
(float
type) is a parameter used for learning the blocking predicates. This parameter ranges from 0.0 to 1.0. See dedupe's API for more detail.- This parameter should only be decreased (e.g. set to 0.9) if there are too many blocks later on after model training. Too many blocks generated by the trained model can lead to an out of memory error during clustering.
fields
contains the data types for each field. See dedupe's variable definition documentation for more detail.
- The key should be in the form
- Set up the folder structure
- The notebooks are sensitive to the filename conventions and the folder structure because that's how they know which filepaths to read and write from. It's important to set this up.
- Make sure you've already filled out the
datasets
section ofconfig.json
(filling outtasks
section is not necessary to set up the folder structure). - Make sure you have the
notebooks
folder with the template notebook files.- The
notebooks
folder should be in the same directory assaved_files
and thepython_scripts
folders.
- The
- Run this script:
python_scripts/init_files_folders.py
. This automatically creates the necessary folders and copies the template Jupyter notebooks.- You may also opt to do this yourself although it may be more cumbersome.
- Copy the raw dataset/s in the appropriate folder/s in
saved_files
Place it under the<dataset_name>
folder with filenameraw_data.csv
. - You're ready to go! Check the
notebooks
folder for the notebook templates pre-prepared for your own datasets.