当前位置：首页 > news >正文

SimPO: Simple Preference Optimization with a Reference-Free Reward

news 2025/6/26 6:27:06

https://github.com/princeton-nlp/SimPO

简单代码

class simpo(paddle.nn.Layer):def __init__(self):super(OrPoLoss, self).__init__()self.loss = paddle.nn.CrossEntropyLoss()def forward(self,neg_logit, neg_lab, pos_logit, pos_lab,beta,gamma):neg_logit = paddle.nn.functional.log_softmax(neg_logit, -1)pos_logit = paddle.nn.functional.log_softmax(pos_logit, -1)batch_indices = paddle.arange(neg_lab.shape[0]).unsqueeze(1).tile([1, neg_lab.shape[1]])seq_indices = paddle.arange(neg_lab.shape[1]).unsqueeze(0).tile([neg_lab.shape[0], 1])indices = paddle.stack([batch_indices, seq_indices, neg_lab], axis=-1)# 使用 gather_nd 来提取对应的 logitneg_logit_selected = paddle.mean(paddle.gather_nd(neg_logit, indices),-1)# 选择标签样本batch_indices = paddle.arange(pos_lab.shape[0]).unsqueeze(1).tile([1, pos_lab.shape[1]])seq_indices = paddle.arange(pos_lab.shape[1]).unsqueeze(0).tile([pos_lab.shape[0], 1])indices = paddle.stack([batch_indices, seq_indices, pos_lab], axis=-1)# 使用 gather_nd 来提取对应的 logitpos_logit_selected = paddle.mean(paddle.gather_nd(pos_logit, indices),-1)pi_logratios = pos_logit_selected - neg_logit_selectedgamma_logratios = gamma / betalogits = pi_logratios - gamma_logratioslosses = (-paddle.nn.functional.log_sigmoid(beta * logits) * (1 - 0.3)-paddle.nn.functional.log_sigmoid(-beta * logits) * 0.3)# chosen_rewards = beta * pos_logit_selected# rejected_rewards = beta *neg_logit_selectedreturn losses.mean()

这段代码定义了一个名为simpo的类，继承自paddle.nn.Layer。在类的构造函数中，初始化了一个交叉熵损失函数loss。

forward函数是模型的前向传播函数。它接收四个参数：neg_logit，neg_lab，pos_logit，pos_lab，以及beta和gamma。其中，neg_logit和pos_logit是模型输出的负样本和正样本的预测分数，neg_lab和pos_lab是对应的标签。

在函数内部，首先对neg_logit和pos_logit使用log_softmax函数进行处理，将预测分数转换为对应类别的概率。然后，使用arange函数生成对应的索引，通过gather_nd函数提取出标签样本对应的预测概率。这里使用了mean函数计算平均值，得到负样本和正样本的选中概率。

接下来，计算pi_logratios，即正样本选中概率减去负样本选中概率。然后，计算gamma_logratios，即gamma除以beta。最后，将pi_logratios和gamma_logratios相减得到logits。

根据logits计算损失。损失的计算采用了公式(-log_sigmoid(beta * logits) * (1 - 0.3) -log_sigmoid(-beta * logits) * 0.3)，其中log_sigmoid函数是对beta * logits和-beta * logits进行log sigmoid函数的运算。最终，使用mean函数计算损失的平均值，并返回。

根据文档内容，我梳理了以下大纲：
一、引言

背景介绍：从人类反馈中学习是关键，RLHF是一种流行的方法，DPO是一种简单的离线优化算法。
问题提出：DPO存在训练和推理指标不一致的问题，可能导致次优性能。
本文贡献：提出SimPO，一个简单有效的离线偏好优化算法，通过直接将奖励函数与生成指标对齐，无需参考模型，同时引入目标奖励间隔，提高算法性能。
二、SimPO: 简单偏好优化
背景：介绍DPO算法。
简单的参考无关奖励：提出使用平均对数似然作为奖励，与生成指标对齐，无需参考模型。
SimPO目标：推导SimPO目标函数，引入目标奖励间隔。
三、实验设置
模型与训练设置：使用Llama3和Mistral进行训练，设置Base和Instruct两种。
评估基准：使用AlpacaEval 2、Arena-Hard和MT-Bench。
基准模型：与DPO、IPO、KTO、ORPO、R-DPO进行比较。
四、实验结果
SimPO在所有基准上持续显著优于其他方法。
Instruct设置引入了显著的性能提升。
SimPO的两个关键设计都很重要。
长度标准化防止了长度剥削。
目标奖励间隔对性能有影响。
SimPO优于DPO的原因分析。
五、相关研究
RLHF相关研究。
偏好优化相关研究。
六、讨论
结论。
限制与未来工作。