This folder contains some configuration files to allow users to easily understand the configuration methods of various functions and quickly reproduce the processing flow of different datasets.
# To process your dataset.
python tools/process_data.py --config xxx.yaml
# To analyse your dataset.
python tools/analyze_data.py --config xxx.yaml
The current configuration files are classified into the subsequent categories.
Demo configuration files are used to help users quickly familiarize the basic functions of Data-Juicer. Please refer to the demo folder for details.
We have reproduced the processing flow of some redpajama datasets. Please refer to the redpajama folder for details.
We have reproduced the processing flow of some bloom datasets. please refer to the bloom folder for details.
We have refined some open source datasets (including SFT datasets) by using Data-Juicer and have provided configuration files for the refine flow. please refer to the refine_recipe folder for details.