Table of Contents
Key Features : RL via GRPO algorithm, "aha" moment, Distillation, limitations of the MCTS and PRM in training R1
DeepSeek-R1-Zero and is the first generation reasoning model by DeepSeek AI trained via reinforcement learning without Supervised fine-tuning. Its primary limitations was that it had poor readability and language mixing
To reduce the training cost of reinforcement learning (RL), a critic model of equivalent size to the policy model was omitted. Instead, Grouped Reinforcement Policy Optimization (GRPO) was used, which estimates the baseline using group scores.
GRPO samples a group of outputs (\({𝑜_1, 𝑜_2, · · · , 𝑜_𝐺 }\)) from the old policy \((\pi \theta_{\text{old}}\))and then optimizes the policy model \(\pi \theta\) by maximizing the following objective:
\[\mathcal{J}_{GRPO}(\theta) = \mathbb{E} \left[ q \sim p(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{old}}(O | q) \right]\] \[\frac{1}{G} \sum_{i=1}^{G} \left( \min \left( \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{old}}(o_i | q)} A_i, \operatorname{clip} \left( \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{old}}(o_i | q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) - \beta \mathbb{D}_{KL} (\pi_{\theta} \| \pi_{ref}) \right)\] \[\mathbb{D}_{KL} (\pi_{\theta} \| \pi_{ref}) = \frac{\pi_{ref}(o_i | q)}{\pi_{\theta}(o_i | q)} - \log \frac{\pi_{ref}(o_i | q)}{\pi_{\theta}(o_i | q)} - 1\]where,
The advantage is computed using a group of rewards (\({𝑟_1, 𝑟_2, . . . , 𝑟_G }\)) corresponding to the outputs within each group.
\[A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \dots, r_G\})}{\text{std}(\{r_1, r_2, \dots, r_G\})}\]The reward decides the optimization direction of the RL. A rule-based reward system that mainly consists of two types of rewards is adopted.
<think>
tag appropriatelly annotated<think>
tagsThis is the training template for DeepSeek-R1-Zero. The prompt
portion is replaced with the question for the actual training.
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:
The following figure depicts the performance trajectory of the R-1 Zero model on the AIME 2024 Benchmark throught the RL training process.
The following is a brief explanation of each benchmark:
AIME 2024: The American Invitational Mathematics Examination (AIME) is a 15-question, 3-hour test focusing on advanced high school mathematics, including algebra, geometry, and number theory. It serves as a qualifier for the United States Mathematical Olympiad. link
MATH-500: This dataset comprises 500 problems from the MATH benchmark, covering various topics such as algebra, calculus, and probability. It is designed to evaluate mathematical reasoning and problem-solving abilities of AI models. link
GPQA Diamond: The Graduate-Level Google-Proof Q&A Benchmark (GPQA) is a challenging dataset of 448 multiple-choice questions in biology, physics, and chemistry, crafted by domain experts. The “Diamond” subset represents the most difficult questions, assessing the advanced reasoning capabilities of AI models. link
LiveCodeBench: LiveCodeBench is a benchmark designed for evaluating large language models’ coding capabilities. It continuously collects new problems from platforms like LeetCode, AtCoder, and CodeForces, and assesses various aspects such as code generation, execution, and self-repair. link
CodeForces: CodeForces is a competitive programming platform that hosts regular contests where participants solve algorithmic problems under time constraints. It serves as a benchmark for evaluating coding and problem-solving skills, with a rating system to rank participants. link
The significance of the “aha moment” is that through RL, the model autonomously develops advanced problem-solving strategies, given the right incentives. The model learns to rethink using an anthropomorphic tone(aha!).
Two questions arise by the results of DeepSeek-R1-Zero.
The pipeline of DeepSeek-R1 attempts to address these questions. The pipeline consists of four stages, outlined as follows(stages 1~4).
To prevent instability in the early phases of RL training, DeepSeek-R1 incorporates a cold start phase by fine-tuning the base model with a small dataset of high-quality long CoT data. This dataset is constructed through multiple approaches, including:
By introducing cold-start data, DeepSeek-R1 achieves improved readability
After fine-tuning DeepSeek-V3-Base on cold-start data, a large-scale reinforcement learning (RL) process similar to DeepSeek-R1-Zero was applied. This phase enhances the model’s reasoning abilities, particularly in coding, mathematics, science, and logical reasoning—domains with well-defined problems and clear solutions.
One challenge observed during training is language mixing in CoT responses, especially when RL prompts involve multiple languages. To address a language consistency reward was implemented, which measures the proportion of target-language words in the response. While ablation studies indicated a minor drop in performance, this alignment improved readability and human preference. The final reward combines reasoning accuracy and language consistency, and RL training continues until convergence.
Once reasoning-oriented RL converges, the checkpoint is used to generate Supervised Fine-Tuning (SFT) data for the next training phase. Unlike the cold-start data, which primarily targets reasoning, this stage incorporates diverse data types to improve DeepSeek-R1’s general-purpose capabilities.
Reasoning Data: Prompts are curated, and multiple reasoning trajectories are generated using rejection sampling. Unlike earlier phases that relied solely on rule-based rewards, this stage integrates generative reward models by comparing model outputs with ground-truth data using DeepSeek-V3. Additionally, unreadable outputs with mixed languages, excessive length, or unnecessary code blocks are filtered out. Ultimately, this process yields 600k high-quality reasoning samples.
Non-Reasoning Data: To enhance tasks such as writing, factual QA, self-cognition, and translation, 200k additional training samples are incorporated from DeepSeek-V3’s SFT dataset. For complex queries, an intermediate CoT is generated before answering, while for simple queries (e.g., “hello”), CoT is omitted.
Overall, DeepSeek-V3-Base is fine-tuned for two epochs using an 800k-sample dataset.
To further align the model with human preferences, a secondary RL stage is introduced, refining both reasoning capabilities and alignment with user expectations. This phase integrates reward signals and diverse prompt distributions:
General Data: A reward model evaluates helpfulness and harmlessness in open-ended tasks, following the DeepSeek-V3 preference pipeline.
By combining reinforcement learning, reward models, and diverse data sources, DeepSeek-R1 is trained to excel in reasoning while maintaining strong alignment with human preferences.
To equip smaller models with reasoning capabilities similar to DeepSeek-R1, open-source models like Qwen and Llama was fine tuned using the 800k curated samples from stage 3 of the pipeline.
The base models used include:
Unlike DeepSeek-R1, distilled models undergo only SFT
The “flagship model” DeepSeek-R1 was evaluated using a range of benchmarks covering math, coding, factual knowledge, and reasoning, including:
For distilled R1 models, the results was reported on the AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench.
For long-output reasoning models, we use pass@k evaluation to reduce variability, with pass@1 calculated using temperature 0.6 and top-p 0.95. AIME 2024 additionally reports cons@64 (majority vote results).
DeepSeek-R1 is compared against strong baselines, including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, OpenAI-o1-1217, and for distilled models, QwQ-32B-Preview.
During the development of DeepSeek-R1, Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS) were unsuccessful in training.
PRM aims to guide the model toward better reasoning by rewarding correct intermediate steps. However, it has three major limitations:
While PRM can be effective in reranking responses or guided searches, its computational overhead outweighs its benefits in large-scale reinforcement learning.
Inspired by AlphaGo and AlphaZero, MCTS was tested as a method to enhance reasoning by systematically exploring solution paths. The process involved tagging reasoning steps, using a value model to guide searches, and iteratively refining the training process. However, two major challenges emerged:
While MCTS can improve inference when used with a pre-trained value model, iteratively boosting model performance through self-search remains a challenge.
SFT 없이 강화학습만 해도 추론능력을 향상시킬 수 있다는 점이 이번 페이퍼의 key finding.