Notes for DeepSeek Paper

Title: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning


1. Overview

1.1. DeepSeek-R1-Zero and DeepSeek-R1

Key Points:

1.2. Major Components

  1. Process-based Reward Model
  2. Reinforcement Learning
  3. Search Algorithms

2. Goal of the Paper

  1. To explore how large language models (LLMs) can develop reasoning capabilities without any supervised data, focusing on a purely RL-driven, self-evolution process.
  2. To use DeepSeek V3 Base as the starting model and employ GRPO (Group Relative Policy Optimization) as the RL framework, thereby improving performance on reasoning tasks.

3. DeepSeek-R1 Zero

3.1. Contributions

3.2. Distillation


4. Evaluation Results

Aspect Performance
Reasoning Task DeepSeek-R1 achieves 79.8, slightly surpassing OpenAI-o1-1217. Better than DeepSeek V3.
Knowledge DeepSeek-R1 outperforms other closed-source models in various knowledge tests.

5. Approach Summary

  1. Large-Scale Reinforcement Learning: Applied directly on top of a base LLM.
  2. DeepSeek-R1 Zero: Base model + RL without any SFT data.
  3. DeepSeek-R1: RL starting from a checkpoint already fine-tuned with thousands of long chain-of-thought examples.
  4. Distillation: Reasoning ability is further distilled into smaller dense models from the main RL-trained checkpoint.

5.1. DeepSeek-R1-Zero: RL on the Base Model

5.1.1. Reinforcement Learning Algorithm

5.1.2. Prompting Format

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in its mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively.

User: [prompt here] Assistant: [reasoning process here] [answer here]

5.1.3. Reward Model

5.1.4. Training Template

5.2. DeepSeek-R1


6. News & Miscellaneous

6.1. Founder

6.2. GPU Fight

6.3. Privacy

6.4. Political Speech

6.5. Cost

6.6. Memory

6.7. “Unsensor” Version


End of DeepSeek Paper Notes