(ENG) DeepSeek-R1 Paper Review

Table of Contents


1. Introduction

Key Features : RL via GRPO algorithm, "aha" moment, Distillation, limitations of the MCTS and PRM in training R1

DeepSeek-R1-Zero and is the first generation reasoning model by DeepSeek AI trained via reinforcement learning without Supervised fine-tuning. Its primary limitations was that it had poor readability and language mixingLanguage Mixing : the practice of using two or more languages in the same conversation or sentence.. To address these issues, DeepSeek-R1 was introuduced in early 2025, that utilizes multi-stage training and cold-start data before RL.

2. Approach

2.1 Reinforcement Learning Algorithm on the Base Model (R-1 Zero)

GRPO

To reduce the training cost of reinforcement learning (RL), a critic model of equivalent size to the policy model was omitted. Instead, Grouped Reinforcement Policy Optimization (GRPO) was used, which estimates the baseline using group scores.

GRPO samples a group of outputs (\({𝑜_1, 𝑜_2, · · · , 𝑜_𝐺 }\)) from the old policy \((\pi \theta_{\text{old}}\))and then optimizes the policy model \(\pi \theta\) by maximizing the following objective:

\[\mathcal{J}_{GRPO}(\theta) = \mathbb{E} \left[ q \sim p(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{old}}(O | q) \right]\] \[\frac{1}{G} \sum_{i=1}^{G} \left( \min \left( \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{old}}(o_i | q)} A_i, \operatorname{clip} \left( \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{old}}(o_i | q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) - \beta \mathbb{D}_{KL} (\pi_{\theta} \| \pi_{ref}) \right)\] \[\mathbb{D}_{KL} (\pi_{\theta} \| \pi_{ref}) = \frac{\pi_{ref}(o_i | q)}{\pi_{\theta}(o_i | q)} - \log \frac{\pi_{ref}(o_i | q)}{\pi_{\theta}(o_i | q)} - 1\]

where,

The advantage is computed using a group of rewards (\({𝑟_1, 𝑟_2, . . . , 𝑟_G }\)) corresponding to the outputs within each group.

\[A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \dots, r_G\})}{\text{std}(\{r_1, r_2, \dots, r_G\})}\]

Rule Based Reward System

The reward decides the optimization direction of the RL. A rule-based reward system that mainly consists of two types of rewards is adopted.

Training Template

This is the training template for DeepSeek-R1-Zero. The prompt portion is replaced with the question for the actual training.

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

Performance, Self-evolution Process and Aha Moment

The following figure depicts the performance trajectory of the R-1 Zero model on the AIME 2024 Benchmark throught the RL training process.

The following is a brief explanation of each benchmark:

Aha moment of DeepSeek-R1-Zero

The significance of the “aha moment” is that through RL, the model autonomously develops advanced problem-solving strategies, given the right incentives. The model learns to rethink using an anthropomorphic tone(aha!).

2.1 Reinforcement Learning with Cold Start

Two questions arise by the results of DeepSeek-R1-Zero.

  1. Can reasoning performance be further improved or can convergence be accelerated by incorporating a small amount of high-quality data as a cold start?
  2. How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities

The pipeline of DeepSeek-R1 attempts to address these questions. The pipeline consists of four stages, outlined as follows(stages 1~4).

Stage 1: Cold Start

To prevent instability in the early phases of RL training, DeepSeek-R1 incorporates a cold start phase by fine-tuning the base model with a small dataset of high-quality long CoT data. This dataset is constructed through multiple approaches, including:

By introducing cold-start data, DeepSeek-R1 achieves improved readabilityReadability : How easy, accessible, or comprehensible that text is (가독성) and performance over DeepSeek-R1-Zero. The structured format includes a reasoning process and a summary, ensuring clear and coherent outputs.

Stage 2: Reasoning-oriented Reinforcement Learning

After fine-tuning DeepSeek-V3-Base on cold-start data, a large-scale reinforcement learning (RL) process similar to DeepSeek-R1-Zero was applied. This phase enhances the model’s reasoning abilities, particularly in coding, mathematics, science, and logical reasoning—domains with well-defined problems and clear solutions.

One challenge observed during training is language mixing in CoT responses, especially when RL prompts involve multiple languages. To address a language consistency reward was implemented, which measures the proportion of target-language words in the response. While ablation studies indicated a minor drop in performance, this alignment improved readability and human preference. The final reward combines reasoning accuracy and language consistency, and RL training continues until convergence.

Stage 3: Rejection Sampling and Supervised Fine-Tuning

Once reasoning-oriented RL converges, the checkpoint is used to generate Supervised Fine-Tuning (SFT) data for the next training phase. Unlike the cold-start data, which primarily targets reasoning, this stage incorporates diverse data types to improve DeepSeek-R1’s general-purpose capabilities.

Overall, DeepSeek-V3-Base is fine-tuned for two epochs using an 800k-sample dataset.

Stage 4: Reinforcement Learning for All Scenarios

To further align the model with human preferences, a secondary RL stage is introduced, refining both reasoning capabilities and alignment with user expectations. This phase integrates reward signals and diverse prompt distributions:

By combining reinforcement learning, reward models, and diverse data sources, DeepSeek-R1 is trained to excel in reasoning while maintaining strong alignment with human preferences.

3. Distillation

To equip smaller models with reasoning capabilities similar to DeepSeek-R1, open-source models like Qwen and Llama was fine tuned using the 800k curated samples from stage 3 of the pipeline.

The base models used include:

Unlike DeepSeek-R1, distilled models undergo only SFTSupervised Fine Tuning : 정답에 해당하는 답안을 제시하고 이를 학습시키는 지도학습 방법론. 학습 속도가 빠르고, 계산량이 상대적으로 적다. and RL methodologies are not applied.

4. Experiment

The “flagship model” DeepSeek-R1 was evaluated using a range of benchmarks covering math, coding, factual knowledge, and reasoning, including:

For distilled R1 models, the results was reported on the AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench.

Evaluation Prompts & Setup

For long-output reasoning models, we use pass@k evaluation to reduce variability, with pass@1 calculated using temperature 0.6 and top-p 0.95. AIME 2024 additionally reports cons@64 (majority vote results).

Baselines

DeepSeek-R1 is compared against strong baselines, including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, OpenAI-o1-1217, and for distilled models, QwQ-32B-Preview.

Generation Constraints

5. Discussion (Unsuccessful Attepts)

During the development of DeepSeek-R1, Process Reward Models (PRM) and Monte Carlo Tree Search (MCTS) were unsuccessful in training.

Process Reward Model (PRM)

PRM aims to guide the model toward better reasoning by rewarding correct intermediate steps. However, it has three major limitations:

  1. Defining fine-grainedFine-Grained Approach : 작업을 큰 단위 안에서도 세분화된 작업 프로세스를 만들어서 실행하는 방식 reasoning steps is difficult.
  2. Determining the correctness of an intermediate step is challenging, as automated annotations are unreliable and manual annotation does not scale.
  3. PRM introduces reward hacking, requiring frequent retraining and additional resources.

While PRM can be effective in reranking responses or guided searches, its computational overhead outweighs its benefits in large-scale reinforcement learning.

Monte Carlo Tree Search (MCTS)

Inspired by AlphaGo and AlphaZero, MCTS was tested as a method to enhance reasoning by systematically exploring solution paths. The process involved tagging reasoning steps, using a value model to guide searches, and iteratively refining the training process. However, two major challenges emerged:

  1. The search space for token generation is exponentially larger than structured games like chess, making it difficult to scale effectively.
  2. The value model plays a crucial role in search quality, but training a reliable fine-grained value model is inherently difficult.

While MCTS can improve inference when used with a pre-trained value model, iteratively boosting model performance through self-search remains a challenge.

6. 서울대학교 자연어처리 랩실 랩미팅 내용

페이퍼 리뷰 요약

Training Pipeline of DeepSeek-R1

  1. initial SFT w/ high quality examples → collected from DeepSeek-R1 Zero
  2. RL focusing on reasoning tasks (coding, math, …)
  3. collection of new training data through rejection sampling & SFT correct and readable samples in more general domains
  4. final RL across all task types: Rule-based reward or LLM feedback

SFT 없이 강화학습만 해도 추론능력을 향상시킬 수 있다는 점이 이번 페이퍼의 key finding.

limitations