Skip to content

Latest commit

 

History

History
112 lines (93 loc) · 7.36 KB

File metadata and controls

112 lines (93 loc) · 7.36 KB

Refine Alpaca-CoT Config Files

This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.

Preprocess

The raw data files can be downloaded from Alpaca-CoT on HuggingFace.

Convert raw Alpaca-CoT data to jsonl

Use raw_alpaca_cot_merge_add_meta.py to select instruction, input and output columns and merge them to text field with a space, and add extra META info to dataset:

python tools/preprocess/raw_alpaca_cot_merge_add_meta.py    \
    --src_dir             <Alpaca-CoT_src_dir>              \
    --target_dir          <target_dir>                      \
    --num_proc            <num_proc>

Split datasets to sub-datasets by language

Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:

python tools/preprocess/dataset_split_by_language.py    \
    --src_dir             <src_dir>                     \
    --target_dir          <target_dir>                  \
    --suffixes            jsonl                         \
    --num_proc            <num_proc>

Process

After preprocess, modify the dataset path in alpaca-cot-en-refine and alpaca-cot-zh-refine, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.

# refine English dataset
python tools/process_data.py --config configs/refine_recipe/alpaca_cot/alpaca-cot-en-refine].yaml

# refine Chinese dataset
python tools/process_data.py --config configs/refine_recipe/alpaca_cot/alpaca-cot-zh-refine].yaml

Meta Info

Each sample in refined data of Alpaca-CoT contains meta info listed as below:

Alpaca-CoT original meta info

  • Language Tags:
    • EN: Instruction datasets in English
    • CN: Instruction datasets in Chinese
    • ML: [Multi-lingual] Instruction datasets in multiple languages
  • Task Tags
    • MT: [Multi-task] Datasets containing multiple tasks
    • TS: [Task-specific] Datasets tailored for specific tasks
  • Generation-method:
    • HG: [Human Generated Dataset] Datasets created by humans
    • SI: [Self-Instruct] Datasets generated using self-instruct methods
    • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
    • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Data-Juicer Meta info

  • Dataset: Dataset in Alpaca-CoT

  • Multi-round Dialog (MRD): Multi-round Dialog datasets

  • IFT: Instruction Fine-Tuning datasets

  • SFT: Supervised Fine-Tuning datasets

  • Preference: Preference datasets

  • origin_path: origin file path in in Alpaca-CoT

Refined Alpaca-CoT dataset Meta info

Task Gen Lang Dataset MRD IFT SFT Preference
Chain-of-Thought MT HG EN/CN Chain-of-Thought
GPT4all MT COL EN GPT4all
GPTeacher MT SI EN GPTeacher
Guanaco MT SI ML Guanaco
HC3 TS MIX EN/CN HC3
alpaca MT SI EN alpaca
Natural-Instructions MT COL ML Natural-Instructions
belle_cn TS/MT SI CN belle_cn
instinwild MT SI EN/CN instinwild
prosocial-dialog TS MIX EN prosocial-dialog
finance TS COL EN finance
xP3 MT COL ML xP3
firefly MT COL CN firefly
instruct MT COL EN instruct
CodeAlpaca TS SI EN CodeAlpaca
alpacaGPT4 MT SI EN/CN alpacaGPT4
webGPT TS MIX EN webGPT
dolly TS HG EN dolly
baize MT COL EN baize
hh-rlhf TS MIX EN hh-rlhf
OIG MT COL EN OIG
GAOKAO MT COL CN GAOKAO
camel MT SI EN camel
FLAN-Muffin MT COL EN FLAN-Muffin
COIG MT COL CN COIG
gpt4tools MT SI EN gpt4tools
ShareGPT MT MIX EN ShareGPT
Auto-CoT MT COL EN Auto-CoT
MOSS TS SI EN/CN MOSS
ultrachat TS SI EN ultrachat
Chinese-medical TS COL CN Chinese-medical
CSL MT COL CN CSL
pCLUE MT COL CN pCLUE
news_commentary TS COL CN news_commentary
StackExchange MT COL EN StackExchange
ConvAI2 TS HG EN ConvAI2
FastChat MT SI EN FastChat
Tabular-LLM-Data MT COL EN/CN Tabular-LLM-Data
ThoughtSource MT COL EN ThoughtSource