Notes for DeepSeek Paper

Title: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

1. Overview

1.1. DeepSeek-R1-Zero and DeepSeek-R1

DeepSeek-R1-Zero: A model trained without supervised fine-tuning, but with reinforcement learning (RL).
DeepSeek-R1: A variant distilled from Qwen and Llama, incorporating supervised fine-tuning (SFT) data before RL.

Key Points:

Previous approaches suffered from:
Poor readability
Mixing languages
The DeepSeek approach uses multi-stage training and “cold-start” data before RL.
Performance is comparable to OpenAI-o1-1217 on reasoning tasks.
Post-training (RL) is crucial for improved accuracy on reasoning tasks, alignment with social values, and adaptation to user preferences—requiring fewer computation resources compared to full retraining.
Effective test-time scaling remains an open question for the research community.

1.2. Major Components

Process-based Reward Model
Reinforcement Learning
Search Algorithms

2. Goal of the Paper

To explore how large language models (LLMs) can develop reasoning capabilities without any supervised data, focusing on a purely RL-driven, self-evolution process.
To use DeepSeek V3 Base as the starting model and employ GRPO (Group Relative Policy Optimization) as the RL framework, thereby improving performance on reasoning tasks.

3. DeepSeek-R1 Zero

DeepSeek-R1 Zero applies RL directly to the DeepSeek V3-Base model, initially without supervised fine-tuning (SFT).
After RL converges, new SFT data is created (via rejection sampling on the RL checkpoint) and combined with supervised data from DeepSeek-V3 in domains such as writing and QA.
The DeepSeek V3-Base Model is then re-trained (fine-tuned) on this expanded dataset, followed by another RL step that includes prompts from all scenarios.
The final result is DeepSeek-R1, distilled from Qwen and Llama.

3.1. Contributions

Post-Training with RL: Directly applying RL to a base model without initial SFT.
Enhanced Chain of Thought: Encourages exploration and production of detailed reasoning steps.
Self-Verification & Reflection: DeepSeek-R1 Zero demonstrates self-verification, reflection, and generating advanced CoTs—an important milestone for the research community.
Two RL Stages:
Discover improved reasoning patterns.
Align with human preferences.
Two SFT Stages: Provide seeds for both reasoning and non-reasoning capabilities.

3.2. Distillation

Smaller, distilled models (from larger Qwen or Llama) perform surprisingly well.
Fine-tuned dense models outperform certain benchmarks.

4. Evaluation Results

Aspect	Performance
Reasoning Task	DeepSeek-R1 achieves 79.8, slightly surpassing OpenAI-o1-1217. Better than DeepSeek V3.
Knowledge	DeepSeek-R1 outperforms other closed-source models in various knowledge tests.

5. Approach Summary

Large-Scale Reinforcement Learning: Applied directly on top of a base LLM.
DeepSeek-R1 Zero: Base model + RL without any SFT data.
DeepSeek-R1: RL starting from a checkpoint already fine-tuned with thousands of long chain-of-thought examples.
Distillation: Reasoning ability is further distilled into smaller dense models from the main RL-trained checkpoint.

5.1. DeepSeek-R1-Zero: RL on the Base Model

Minimizes reliance on SFT data by gathering data from RL.
Emphasizes the potential of LLMs to develop reasoning abilities autonomously via self-evolution.

5.1.1. Reinforcement Learning Algorithm

Group Relative Policy Optimization (GRPO)

5.1.2. Prompting Format

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in its mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively.

User: [prompt here] Assistant: [reasoning process here] [answer here]

5.1.3. Reward Model

Accuracy Reward
Formatting Rewards (adhering to <think> & <answer> tags)

5.1.4. Training Template

Guides the base model to follow specified instructions.
“Aha moment”: An intermediate version of DeepSeek-R1-Zero learned to generate more structured outputs with reflection and verification.

5.2. DeepSeek-R1

Begins with a small SFT dataset (long chain-of-thought examples).
Uses few-shot prompting, self-generated CoT outputs (from DeepSeek-R1-Zero), and human annotation.
Retrains the base model, and finishes with an additional RL process considering all prompt scenarios.

6. News & Miscellaneous

6.1. Founder

Name: Wun Fung-Laing
Background: Graduated from Zhejiang University (Electrical Engineering) and holds a master’s in Communication Engineering.
Affiliation: High Flyer (Hangzhou-based hedge fund and AI company founded in 2015).

6.2. GPU Fight

Aim: Not to reduce chips but to make them more efficient.
American ban on advanced chips to China; rumor of ~50k GPUs in use.
Jevons Paradox: Increasing efficiency often leads to higher overall consumption.

6.3. Privacy

Information Collected:
Device model, OS, keystroke patterns, IP address, system language
Data from login, sign-up, or linked services
Shared with service providers, business partners, corporate groups, and for legal obligations
User Rights:
Know how personal data is collected and used
Access, change, oppose, or withdraw consent
Request a copy or deletion of data
Data Storage:
Collected and stored in China
Retention depends on data type

6.4. Political Speech

Filtered by default.
One-China Policy is the default stance.

6.5. Cost

Estimated at \$6M (excludes smaller runs, data generation, and DeepSeek R1 training transactions).

6.6. Memory

16 × 80 GB GPU memory in use.

6.7. “Unsensor” Version

An uncensored variant might exist in Perplexity (unverified rumor).

End of DeepSeek Paper Notes