Refine Alpaca-CoT Config Files

This folder contains some configuration files to allow users to easily and quickly refine Alpaca-CoT.

Preprocess

The raw data files can be downloaded from Alpaca-CoT on HuggingFace.

Convert raw Alpaca-CoT data to jsonl

Use raw_alpaca_cot_merge_add_meta.py to select instruction, input and output columns and merge them to text field with a space, and add extra META info to dataset:

python tools/preprocess/raw_alpaca_cot_merge_add_meta.py    \
    --src_dir             <Alpaca-CoT_src_dir>              \
    --target_dir          <target_dir>                      \
    --num_proc            <num_proc>

Split datasets to sub-datasets by language

Use dataset_split_by_language.py to split the dataset to EN and ZH sub-datasets:

python tools/preprocess/dataset_split_by_language.py    \
    --src_dir             <src_dir>                     \
    --target_dir          <target_dir>                  \
    --suffixes            jsonl                         \
    --num_proc            <num_proc>

Process

After preprocess, modify the dataset path in alpaca-cot-en-refine and alpaca-cot-zh-refine, and then execute the following command to reproduce the processing flow of refined Alpaca-CoT.

# refine English dataset
python tools/process_data.py --config configs/refine_recipe/alpaca_cot/alpaca-cot-en-refine].yaml

# refine Chinese dataset
python tools/process_data.py --config configs/refine_recipe/alpaca_cot/alpaca-cot-zh-refine].yaml

Meta Info

Each sample in refined data of Alpaca-CoT contains meta info listed as below:

Alpaca-CoT original meta info

Language Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
Task Tags
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets

Data-Juicer Meta info

Dataset: Dataset in Alpaca-CoT
Multi-round Dialog (MRD): Multi-round Dialog datasets
IFT: Instruction Fine-Tuning datasets
SFT: Supervised Fine-Tuning datasets
Preference: Preference datasets
origin_path: origin file path in in Alpaca-CoT

Refined Alpaca-CoT dataset Meta info

	Task	Gen	Lang	Dataset	MRD	IFT	SFT	Preference
Chain-of-Thought	MT	HG	EN/CN	Chain-of-Thought		✅
GPT4all	MT	COL	EN	GPT4all		✅	✅
GPTeacher	MT	SI	EN	GPTeacher			✅
Guanaco	MT	SI	ML	Guanaco			✅
HC3	TS	MIX	EN/CN	HC3			✅	✅
alpaca	MT	SI	EN	alpaca			✅
Natural-Instructions	MT	COL	ML	Natural-Instructions		✅
belle_cn	TS/MT	SI	CN	belle_cn			✅
instinwild	MT	SI	EN/CN	instinwild			✅
prosocial-dialog	TS	MIX	EN	prosocial-dialog			✅
finance	TS	COL	EN	finance			✅
xP3	MT	COL	ML	xP3		✅
firefly	MT	COL	CN	firefly		✅
instruct	MT	COL	EN	instruct			✅
CodeAlpaca	TS	SI	EN	CodeAlpaca		✅
alpacaGPT4	MT	SI	EN/CN	alpacaGPT4			✅	✅
webGPT	TS	MIX	EN	webGPT		✅		✅
dolly	TS	HG	EN	dolly			✅
baize	MT	COL	EN	baize			✅
hh-rlhf	TS	MIX	EN	hh-rlhf	✅		✅	✅
OIG	MT	COL	EN	OIG			✅
GAOKAO	MT	COL	CN	GAOKAO		✅
camel	MT	SI	EN	camel		✅
FLAN-Muffin	MT	COL	EN	FLAN-Muffin		✅
COIG	MT	COL	CN	COIG			✅
gpt4tools	MT	SI	EN	gpt4tools		✅
ShareGPT	MT	MIX	EN	ShareGPT	✅		✅
Auto-CoT	MT	COL	EN	Auto-CoT		✅
MOSS	TS	SI	EN/CN	MOSS			✅
ultrachat	TS	SI	EN	ultrachat			✅
Chinese-medical	TS	COL	CN	Chinese-medical			✅
CSL	MT	COL	CN	CSL		✅
pCLUE	MT	COL	CN	pCLUE		✅
news_commentary	TS	COL	CN	news_commentary		✅
StackExchange	MT	COL	EN	StackExchange			✅	✅
ConvAI2	TS	HG	EN	ConvAI2			✅
FastChat	MT	SI	EN	FastChat			✅
Tabular-LLM-Data	MT	COL	EN/CN	Tabular-LLM-Data		✅
ThoughtSource	MT	COL	EN	ThoughtSource		✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Refine Alpaca-CoT Config Files

Preprocess

Convert raw Alpaca-CoT data to jsonl

Split datasets to sub-datasets by language

Process

Meta Info

Alpaca-CoT original meta info

Data-Juicer Meta info

Refined Alpaca-CoT dataset Meta info

Files

README.md

Latest commit

History

README.md

File metadata and controls

Refine Alpaca-CoT Config Files

Preprocess

Convert raw Alpaca-CoT data to jsonl

Split datasets to sub-datasets by language

Process

Meta Info

Alpaca-CoT original meta info

Data-Juicer Meta info

Refined Alpaca-CoT dataset Meta info