This repository collects publicly available distillation datasets based on DepSeek-R1, which are intended for researchers, students, and practitioners interested in exploring and enhancing the reasoning capabilities of large language models.
These datasets are distilled from DeepSeek-R1
- open-r1/OpenR1-Math-220k: a large-scale dataset for mathematical reasoning.
- simplescaling/s1K-1.1: consists of the same 1,000 questions as in s1K but with traces instead generated by DeepSeek r1. We find that these traces lead to much better performance
- open-r1/OpenR1-Math-220k: a large-scale dataset for mathematical reasoning. It consists of 220k math problems with two to four reasoning traces generated by DeepSeek R1 for problems from NuminaMath 1.5.
- bespokelabs/Bespoke-Stratos-17k: a reasoning dataset of questions, reasoning traces, and answers.
- open-r1/OpenThoughts-114k-math: a large scale dataset that focus on the math.
- hw-hwei/MedThoughts-8K: a reasoning dataset that focus on medical Q&A task.
- FreedomIntelligence/Medical-R1-Distill-Data: an SFT dataset distilled from Deepseek-R1 (Full Power Version), based on medical verifiable problems from HuatuoGPT-o1. The corresponding Chinese version FreedomIntelligence/Medical-R1-Distill-Data-Chinese.
- sequelbox/Raiden-DeepSeek-R1: is a dataset (62.9K) containing creative-reasoning and analytic-reasoning responses, testing the limits of DeepSeek R1's reasoning skills!
- LLaVA-R1-100k: 大规模多模态Reasoning数据集,使用GPT4-o来生成图片描述,并通过DeepSeek-R1模型生成Reasoning过程;
- open-thoughts/OpenThoughts-114k: an synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles!
- Congliu/Chinese-DeepSeek-R1-Distill-data-110k: Chinese open source distilled full-blooded R1 dataset, including Math, Exam, STEm and general knowledge. (中文开源蒸馏满血R1的数据集, 包括Math,Exam,STEm和通识。)
- ServiceNow-AI/R1-Distill-SFT: distilled with DeepSeek-R1-32b.
- PrimeIntellect/SYNTHETIC-1-SFT-Data: a reasoning dataset (894K) obtained from Deepseek-R1, generated with crowdsourced compute and annotated with diverse verifiers such as LLM judges or symbolic mathematics verifiers.
- Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B: This dataset is generated by Meta's Llama 3.1 70B Instruct, Llama 3.3 70B Instruct and deepseek-ai/DeepSeek-R1-Distill-Llama-70B using Magpie framework.
These reasoning datasets are distilled from other models, such as GPT-4 series.
- facebook/natural_reasoning: a large-scale dataset for general reasoning tasks.
- FreedomIntelligence/medical-o1-reasoning-SFT: the medical reasoning dataset distilled from GPT-4o.
- AI-MO/NuminaMath-CoT: Approximately 860k math problems, where each solution is formatted in a Chain of Thought (CoT) manner.
- cognitivecomputations/dolphin-r1: a 800k sample dataset similar in composition to the one used to train DeepSeek-R1 Distill models.
- GAIR/LIMO: a reasoning dataset mainly focus on math.
- wangrongsheng/Kimi-K1.5-Distill-data: a math dataset (531) that distilled from kimi-k1.5.