当前位置：首页 > news >正文

从TinyZero的数据与源码来理解DeepSeek-R1-Zero的强化学习训练过程

news 2025/7/15 13:57:39

1. 引入

TinyZero（参考1）是伯克利的博士生复现DeepSeek-R1-Zero的代码参仓库，他使用veRL来运行RL强化学习方法，对qwen2.5的0.5B、1.5B、3B等模型进行训练，在一个数字游戏数据集上，达到了较好的推理效果。

下面解读源码中的关键训练逻辑细节。

2. 训练过程

原始数据

原始数据来自参考2，一共490k条数据，数据中只有两个字段，格式如下：

{"nums": [ 95, 11, 56 ],"target":28
}

这是一个数字游戏，要求对nums中的数据，进行基础数学运算(+, -, *, /)，每个数字只能用一次，最终结果等于target的值。比如上例子，95-11-56=28。

数据处理

具体源码见参考3，下文仅仅解析关键步骤：

（1）训练集和测试集大小

默认值如下：

parser.add_argument('--train_size', type=int, default=327680)
parser.add_argument('--test_size', type=int, default=1024)

（2）对原始数据添加提示词

下面的dp就是一条原始数据（参考2.1例子）：

def make_prefix(dp, template_type):target = dp['target']# 取出目标numbers = dp['nums']# 取出数字# 对于默认模型加的提示词如下if template_type == 'base':"""This works for any base model"""prefix = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>"""# 对于qwen-instruct模型加的提示词如下elif template_type == 'qwen-instruct':"""This works for Qwen Instruct Models"""prefix = f"""<|im_start|>system\nYou are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>\n<|im_start|>user\n Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>\n<|im_start|>assistant\nLet me solve this step by step.\n<think>"""return prefix

（3）对数据进行完整的处理，增加提示词与reward等数据

如下函数中的example就是一条原始数据（参考2.1例子）。

        def process_fn(example, idx):question = make_prefix(example, template_type=args.template_type) # 增加提示词，见2.2.2solution = {"target": example['target'],"numbers": example['nums']}data = {"data_source": data_source, # 任务名称，默认为'countdown'"prompt": [{"role": "user","content": question, # 带有提示词的问题}],"ability": "math","reward_model": {"style": "rule","ground_truth": solution # 含有nums和target},"extra_info": {'split': split,'index': idx,}}return data

最终数据为含有prompt和reward_model等字段的json结构。

训练

从参考4的训练代码中，摘取部分配置如下：

python3 -m verl.trainer.main_ppo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
data.train_batch_size=256 \
data.val_batch_size=1312 \
data.max_prompt_length=256 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$BASE_MODEL \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size=8 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
critic.optim.lr=1e-5 \
critic.model.path=$BASE_MODEL \
critic.ppo_micro_batch_size=8 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=['wandb'] \
+trainer.val_before_train=False \
trainer.default_hdfs_dir=null \
trainer.n_gpus_per_node=$N_GPUS \
trainer.nnodes=1 \
trainer.save_freq=100 \
trainer.test_freq=100 \
trainer.project_name=TinyZero \
trainer.experiment_name=$EXPERIMENT_NAME \
trainer.total_epochs=15 2>&1 | tee verl_demo.log

这条命令是一个典型的 Python 脚本调用，用于训练一个基于 PPO（Proximal Policy Optimization）算法的模型。

用veRL进行训练（参考5），需要指定数据、模型、超参数：

（1）数据相关配置

data.train_files=$DATA_DIR/train.parquet：指定训练数据文件路径（Parquet 格式）。data.val_files=$DATA_DIR/test.parquet：指定验证数据文件路径。data.train_batch_size=256：训练时的批量大小（batch size）。data.val_batch_size=1312：验证时的批量大小。data.max_prompt_length=256：输入提示（prompt）的最大长度。data.max_response_length=1024：生成响应（response）的最大长度。

（2）Actor 模型配置

actor_rollout_ref.model.path=$BASE_MODEL：指定 Actor 模型的预训练权重路径。actor_rollout_ref.actor.optim.lr=1e-6：Actor 模型的学习率。actor_rollout_ref.actor.ppo_mini_batch_size=128：PPO 算法中 Actor 的 mini-batch 大小。actor_rollout_ref.actor.ppo_micro_batch_size=8：PPO 算法中 Actor 的 micro-batch 大小。actor_rollout_ref.rollout.log_prob_micro_batch_size=8：Rollout 阶段计算 log probability 的 micro-batch 大小。actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE：Rollout 阶段的张量并行大小（用于分布式训练）。actor_rollout_ref.rollout.gpu_memory_utilization=0.4：Rollout 阶段的 GPU 内存利用率。actor_rollout_ref.ref.log_prob_micro_batch_size=4：参考模型（ref model）计算 log probability 的 micro-batch 大小。

（3）Critic 模型配置

critic.optim.lr=1e-5：Critic 模型的学习率。critic.model.path=$BASE_MODEL：指定 Critic 模型的预训练权重路径。critic.ppo_micro_batch_size=8：PPO 算法中 Critic 的 micro-batch 大小。

（4）算法配置

algorithm.kl_ctrl.kl_coef=0.001：KL 散度（Kullback-Leibler divergence）的系数，用于控制策略更新的稳定性。

（5）训练器配置

trainer.logger=['wandb']：使用 Weights & Biases（WandB）作为日志记录工具。+trainer.val_before_train=False：在训练开始前不进行验证。trainer.default_hdfs_dir=null：HDFS 目录未设置（HDFS 是分布式文件系统）。trainer.n_gpus_per_node=$N_GPUS：每个节点使用的 GPU 数量。trainer.nnodes=1：使用的节点数量（单节点训练）。trainer.save_freq=100：每 100 步保存一次模型。trainer.test_freq=100：每 100 步进行一次测试。trainer.project_name=TinyZero：WandB 项目名称。trainer.experiment_name=$EXPERIMENT_NAME：实验名称。trainer.total_epochs=15：总训练轮数（epochs）。

训练效果

用强化学习的方法训练后，能如下所示，输出字段（推理过程），并给出最终结果字段。
在这里插入图片描述

3. 总结

通过具体的数据与处理训练过程，来更好的理解DeepSeek-R1-Zero的强化学习训练方法。

4. 参考

项目：https://github.com/Jiayi-Pan/TinyZero
数据：https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4
数据处理源码：https://github.com/Jiayi-Pan/TinyZero/blob/main/examples/data_preprocess/countdown.py
训练源码：https://github.com/Jiayi-Pan/TinyZero/blob/main/scripts/train_tiny_zero.sh
veRL：https://verl.readthedocs.io/en/latest/start/quickstart.html

查看全文

http://www.lryc.cn/news/530407.html