Skip to content

Latest commit

 

History

History
26 lines (26 loc) · 2.77 KB

set-up-own-dataset-instructions.md

File metadata and controls

26 lines (26 loc) · 2.77 KB

How to Set Up for Your Own Datasets 🔧

If you wish to use these workflows with your own datasets, proceed with the following steps:

  1. Download the code repository and make sure the dependencies are installed.
  2. Edit config.json
  • Edit the config file to align with the field names and data types of your datasets. A sample template is provided using the sample datasets.
  • Under datasets
    • The key should be the dataset name. The value contains the following:
    • numeric_fields contains the field names that should be numeric (e.g. float or int)
    • date_fields contains the field names that are dates
    • primary_key the field name of the presumed unique identifier
    • pre_cluster_exact_matches (bool type) is true if records with exact matches in all fields should be pre-clustered before deduplication.
      • Set this to true if there is rampant exact matches in all fields in your dataset. Otherwise, leave it at the default false.
  • Under tasks
    • The key should be in the form <task_name>-<dataset_name> where the task_name is either dedup or rl. The value contains the following:
    • recall_train (float type) is a parameter used for learning the blocking predicates. This parameter ranges from 0.0 to 1.0. See dedupe's API for more detail.
      • This parameter should only be decreased (e.g. set to 0.9) if there are too many blocks later on after model training. Too many blocks generated by the trained model can lead to an out of memory error during clustering.
    • fields contains the data types for each field. See dedupe's variable definition documentation for more detail.
  1. Set up the folder structure
  • The notebooks are sensitive to the filename conventions and the folder structure because that's how they know which filepaths to read and write from. It's important to set this up.
  • Make sure you've already filled out the datasets section of config.json (filling out tasks section is not necessary to set up the folder structure).
  • Make sure you have the notebooks folder with the template notebook files.
    • The notebooks folder should be in the same directory as saved_files and the python_scripts folders.
  • Run this script: python_scripts/init_files_folders.py. This automatically creates the necessary folders and copies the template Jupyter notebooks.
    • You may also opt to do this yourself although it may be more cumbersome.
  • Copy the raw dataset/s in the appropriate folder/s in saved_files Place it under the <dataset_name> folder with filename raw_data.csv.
  • You're ready to go! Check the notebooks folder for the notebook templates pre-prepared for your own datasets.