当前位置: 首页 > news >正文

大模型对齐算法合集(一)

大模型对齐算法合集

DPO

定理:
max⁡μEx∼μ(x){f(x)}+H(μ),s.t. ∑xμ(x)=1\max_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ f(x) \right\} + H(\mu), \quad \text{s.t. } \sum_x \mu(x) = 1 μmaxExμ(x){f(x)}+H(μ),s.t. xμ(x)=1

⇒μ∗(x)=ef(x)∑xef(x)⇒Boltzmann distribution\Rightarrow \mu^*(x) = \frac{e^{f(x)}}{\sum_x e^{f(x)}} \quad \Rightarrow \text{Boltzmann distribution} μ(x)=xef(x)ef(x)Boltzmann distribution

证明:
max⁡μEx∼μ(x){f(x)}+H(μ)=max⁡μEx∼μ(x){f(x)−log⁡μ(x)}=min⁡μEx∼μ(x){log⁡μ(x)−log⁡exp⁡(f(x))}=min⁡μEx∼μ(x){log⁡μ(x)exp⁡(f(x))}=min⁡μEx∼μ(x){log⁡μ(x)/zˉexp⁡(f(x))/zˉ},zˉ=∑xexp⁡(f(x))=min⁡μEx∼μ(x){log⁡μ(x)1zˉexp⁡(f(x))−log⁡zˉ}=min⁡μEx∼μ(x){log⁡μ(x)1zˉexp⁡(f(x))}⏟DKL(μ(x)∥1zˉexp⁡(f(x)))\begin{aligned} &\max_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ f(x) \right\} + H(\mu) \\ &= \max_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ f(x) - \log \mu(x) \right\} \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \mu(x) - \log \exp(f(x)) \right\} \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)}{\exp(f(x))} \right\} \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)/\bar{z}}{\exp(f(x))/\bar{z}} \right\}, \quad \bar{z} = \sum_x \exp(f(x)) \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)}{\frac{1}{\bar{z}} \exp(f(x))} - \log \bar{z} \right\} \\ &= \min_{\mu} \underbrace{\mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)}{\frac{1}{\bar{z}} \exp(f(x))} \right\}}_{D_{KL}\left( \mu(x) \,\|\, \frac{1}{\bar{z}} \exp(f(x)) \right)} \end{aligned} μmaxExμ(x){f(x)}+H(μ)=μmaxExμ(x){f(x)logμ(x)}=μminExμ(x){logμ(x)logexp(f(x))}=μminExμ(x){logexp(f(x))μ(x)}=μminExμ(x){logexp(f(x))/zˉμ(x)/zˉ},zˉ=xexp(f(x))=μminExμ(x){logzˉ1exp(f(x))μ(x)logzˉ}=μminDKL(μ(x)zˉ1exp(f(x)))Exμ(x){logzˉ1exp(f(x))μ(x)}
⇒μ∗(x)=1zˉexp⁡(f(x))=ef(x)∑xef(x)\Rightarrow \mu^*(x) = \frac{1}{\bar{z}} \exp(f(x)) = \frac{e^{f(x)}}{\sum_x e^{f(x)}} μ(x)=zˉ1exp(f(x))=xef(x)ef(x)

RLHF: max⁡πθEx∼DEy∼πθ(y∣x){rϕ(x,y)}−βDKL(πθ(y∣x)∥πref(y∣x))\text{RLHF: } \max_{\pi_\theta} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ r_\phi(x, y) \right\} - \beta \, D_{KL}\left( \pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x) \right) RLHF: πθmaxExDEyπθ(yx){rϕ(x,y)}βDKL(πθ(yx)πref(yx))=Ex∼Dmax⁡πθEy∼πθ(y∣x){rϕ(x,y)−βlog⁡πθ(y∣x)πref(y∣x)}= \mathbb{E}_{x \sim D} \max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right\} =ExDπθmaxEyπθ(yx){rϕ(x,y)βlogπref(yx)πθ(yx)}=Ex∼Dmax⁡πθ{Ey∼πθ(y∣x){rϕ(x,y)+βlog⁡πref(y∣x)}+βH(πθ(y∣x))}= \mathbb{E}_{x \sim D} \max_{\pi_\theta} \left\{ \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ r_\phi(x, y) + \beta \log \pi_{\text{ref}}(y|x) \right\} + \beta \, H(\pi_\theta(y|x)) \right\} =ExDπθmax{Eyπθ(yx){rϕ(x,y)+βlogπref(yx)}+βH(πθ(yx))}⟸max⁡πθEy∼πθ(y∣x){1βrϕ(x,y)+log⁡πref(y∣x)⏟f(y∣x)}+H(πθ(y∣x))⏟H(μ(y∣x))\Longleftarrow \max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \underbrace{\frac{1}{\beta} r_\phi(x, y) + \log \pi_{\text{ref}}(y|x)}_{f(y|x)} \right\} + \underbrace{H(\pi_\theta(y|x))}_{H(\mu(y|x))} πθmaxEyπθ(yx)f(yx)β1rϕ(x,y)+logπref(yx)+H(μ(yx))H(πθ(yx))

⇒πθ∗(y∣x)=e1βrϕ(x,y)⋅πref(y∣x)Z(x)⇒g∗(x,y)=β(log⁡πθ∗(y∣x)πref(y∣x))+βlog⁡Z\Rightarrow \pi_\theta^*(y|x) = \frac{e^{\frac{1}{\beta} r_\phi(x, y)} \cdot \pi_{\text{ref}}(y|x)}{Z(x)} \quad \Rightarrow g^*(x, y) = \beta \left( \log \frac{\pi_\theta^*(y|x)}{\pi_{\text{ref}}(y|x)} \right) + \beta \log Z πθ(yx)=Z(x)eβ1rϕ(x,y)πref(yx)g(x,y)=β(logπref(yx)πθ(yx))+βlogZ⇒BT: p(ym>yn∣x)=σ{β(log⁡πθ(ym∣x)πref(ym∣x)−log⁡πθ(yn∣x)πref(yn∣x))}\Rightarrow \text{BT: } p(y_m > y_n \mid x) = \sigma \left\{ \beta \left( \log \frac{\pi_\theta(y_m|x)}{\pi_{\text{ref}}(y_m|x)} - \log \frac{\pi_\theta(y_n|x)}{\pi_{\text{ref}}(y_n|x)} \right) \right\} BT: p(ym>ynx)=σ{β(logπref(ymx)πθ(ymx)logπref(ynx)πθ(ynx))}


Token-level PPO:

max⁡θEx∼DEy∼πθ(y∣x){∑t=1Tr(st,yt)}−βDKL(πθ(y∣x)∥πref(y∣x))\max_{\theta} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \sum_{t=1}^T r(s_t, y_t) \right\} - \beta \, D_{KL}\left( \pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x) \right) θmaxExDEyπθ(yx){t=1Tr(st,yt)}βDKL(πθ(yx)πref(yx))

其中:
$ (s_t, y_t) = (x, y^{<t}) $

=max⁡θEx∼DEy∼πθ(y∣x){∑t=1Tr(st,yt)−βlog⁡πθ(yt∣y<t,x)πref(yt∣y<t,x)}= \max_{\theta} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \sum_{t=1}^T r(s_t, y_t) - \beta \log \frac{\pi_\theta(y_t \mid y^{<t}, x)}{\pi_{\text{ref}}(y_t \mid y^{<t}, x)} \right\} =θmaxExDEyπθ(yx){t=1Tr(st,yt)βlogπref(yty<t,x)πθ(yty<t,x)}=Ex∼Dmax⁡θEy∼πθ(y∣x){∑t=1Tr(st,yt)+βlog⁡πref(yt∣st)⏟r′(st,yt)−βlog⁡πθ(yt∣st)}= \mathbb{E}_{x \sim D} \max_{\theta} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \sum_{t=1}^T \underbrace{r(s_t, y_t) + \beta \log \pi_{\text{ref}}(y_t \mid s_t)}_{r'(s_t, y_t)} - \beta \log \pi_\theta(y_t \mid s_t) \right\} =ExDθmaxEyπθ(yx)t=1Tr(st,yt)r(st,yt)+βlogπref(ytst)βlogπθ(ytst)=Ex∼Dmax⁡θ∑t=1TEyt∼πθ(yt∣st){r′(st,yt)−βlog⁡πθ(yt∣st)}= \mathbb{E}_{x \sim D} \max_{\theta} \sum_{t=1}^T \mathbb{E}_{y_t \sim \pi_\theta(y_t|s_t)} \left\{ r'(s_t, y_t) - \beta \log \pi_\theta(y_t \mid s_t) \right\} =ExDθmaxt=1TEytπθ(ytst){r(st,yt)βlogπθ(ytst)}⟸max⁡θ∑t=1T{Eyt∼πθ(yt∣st){r′(st,yt)}+βH(πθ(yt∣st))}(Max-Entropy Soft Q-learning)\Longleftarrow \max_{\theta} \sum_{t=1}^T \left\{ \mathbb{E}_{y_t \sim \pi_\theta(y_t|s_t)} \left\{ r'(s_t, y_t) \right\} + \beta \, H(\pi_\theta(y_t \mid s_t)) \right\} \quad \text{(Max-Entropy Soft Q-learning)} θmaxt=1T{Eytπθ(ytst){r(st,yt)}+βH(πθ(ytst))}(Max-Entropy Soft Q-learning)

在时刻 ttt 优化 πθ(yt∣st)\pi_\theta(y_t|s_t)πθ(ytst):

=max⁡θEyt∼πθ(yt∣st)⏟μ{r′(st,yt)⏟f(yt∣st)+∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}⏟1βQ∗(st,yt)+βH(πθ(yt∣st))}= \max_{\theta} \underbrace{\mathbb{E}_{y_t \sim \pi_\theta(y_t|s_t)}}_{\mu} \left\{ \underbrace{r'(s_t, y_t)}_{f(y_t|s_t)} + \underbrace{\sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\}}_{\frac{1}{\beta} Q^*(s_t, y_t)}+ \beta H(\pi_\theta(y_t|s_t)) \right\} =θmaxμEytπθ(ytst)f(ytst)r(st,yt)+β1Q(st,yt)k=t+1T{Eykπθ(yksk){r(sk,yk)}+βH(πθ(yksk))}+βH(πθ(ytst))

⇒πθ∗(yt∣st)=exp⁡{1βr′(st,yt)+1β∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}}Z(yt∣st)\Rightarrow \pi_\theta^*(y_t|s_t) = \frac{ \exp \left\{ \frac{1}{\beta} r'(s_t, y_t) + \frac{1}{\beta} \sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\} \right\} }{Z(y_t|s_t)} πθ(ytst)=Z(ytst)exp{β1r(st,yt)+β1k=t+1T{Eykπθ(yksk){r(sk,yk)}+βH(πθ(yksk))}}

其中令⇒1βQ∗(st,yt)=1βr′(st,yt)+1β∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}\Rightarrow \frac{1}{\beta}Q^*(s_t,y_t) = \frac{1}{\beta} r'(s_t, y_t) + \frac{1}{\beta} \sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\}β1Q(st,yt)=β1r(st,yt)+β1k=t+1T{Eykπθ(yksk){r(sk,yk)}+βH(πθ(yksk))}

Z(yt∣st)=∑ytexp⁡{1βQ∗(st,yt)}=exp⁡{1βV∗(st)}\begin{aligned} Z(y_t|s_t) &= \sum_{y_t} \exp \left\{ \frac{1}{\beta} Q^*(s_t, y_t) \right\} \\ &= \exp \left\{ \frac{1}{\beta} V^*(s_t) \right\} \end{aligned} Z(ytst)=ytexp{β1Q(st,yt)}=exp{β1V(st)}

📌 说明与解释:
  • ∑ytexp⁡(1βQ∗(st,yt))\sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right)ytexp(β1Q(st,yt)):这是对所有可能动作 yty_tyt 的软 Q 值指数和,即 partition function(配分函数),记作 Z(st)Z(s_t)Z(st)
  • V∗(st)V^*(s_t)V(st):软价值函数(soft value function),定义为:V∗(st)=βlog⁡∑ytexp⁡(1βQ∗(st,yt))V^*(s_t) = \beta \log \sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right) V(st)=βlogytexp(β1Q(st,yt))因此有:∑ytexp⁡(1βQ∗(st,yt))=exp⁡(1βV∗(st))\sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right) = \exp\left( \frac{1}{\beta} V^*(s_t) \right) ytexp(β1Q(st,yt))=exp(β1V(st))
  • 这个等式是最大熵强化学习中 Soft Bellman 方程 推导的关键一步,用于归一化最优策略:πθ∗(yt∣st)=exp⁡(1βQ∗(st,yt))∑yt′exp⁡(1βQ∗(st,yt′))=exp⁡(1βQ∗(st,yt))exp⁡(1βV∗(st))\pi_\theta^*(y_t|s_t) = \frac{\exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right)}{\sum_{y_t'} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t') \right)} = \frac{\exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right)}{\exp\left( \frac{1}{\beta} V^*(s_t) \right)} πθ(ytst)=ytexp(β1Q(st,yt))exp(β1Q(st,yt))=exp(β1V(st))exp(β1Q(st,yt))

其中:

  • Z(yt∣st)Z(y_t|s_t)Z(ytst) 是归一化常数(partition function)。
  • 上述形式体现了 递归最优策略 的结构,类似于 Soft Q-learning最大熵强化学习 中的策略更新。

{πθ∗(yt∣st)=exp⁡{1β(Q∗(st,yt)−V∗(st))}Q∗(st,yt)=r′(st,yt)+∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}\left\{ \begin{aligned} \pi_\theta^*(y_t|s_t) &= \exp \left\{ \frac{1}{\beta} \left( Q^*(s_t, y_t) - V^*(s_t) \right) \right\} \\ Q^*(s_t, y_t) &= r'(s_t, y_t) + \sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\} \end{aligned} \right. πθ(ytst)Q(st,yt)=exp{β1(Q(st,yt)V(st))}=r(st,yt)+k=t+1T{Eykπθ(yksk){r(sk,yk)}+βH(πθ(yksk))}

⇒V∗(st)=βlog⁡{∑ytexp⁡{1βQ∗(st,yt)}}\Rightarrow V^*(s_t) = \beta \log \left\{ \sum_{y_t} \exp \left\{ \frac{1}{\beta} Q^*(s_t, y_t) \right\} \right\} V(st)=βlog{ytexp{β1Q(st,yt)}}

⇒Q∗−V∗=βlog⁡πθ∗\Rightarrow Q^* - V^* = \beta \log \pi_\theta^*QV=βlogπθ


📌 说明与解释:
  • πθ∗(yt∣st)\pi_\theta^*(y_t|s_t)πθ(ytst):在状态 sts_tst 下选择动作 yty_tyt 的最优策略(概率分布),服从 Boltzmann 分布。

    • 这是最大熵强化学习中的经典形式:策略正比于 exp⁡(Q−V)\exp(Q - V)exp(QV)
  • Q∗(st,yt)Q^*(s_t, y_t)Q(st,yt):软 Q 值函数(soft Q-function),表示在状态 sts_tst 采取动作 yty_tyt 后的期望累积回报(含熵项)。

    • 包括即时奖励 r′(st,yt)r'(s_t, y_t)r(st,yt) 和未来期望回报(包含后续状态下的奖励和熵)。
  • V∗(st)V^*(s_t)V(st):软价值函数(soft value function),表示在状态 sts_tst 下的最大期望累积回报(考虑熵):

    V∗(st)=max⁡πEπ[∑k=tTr′(sk,yk)+βH(π(yk∣sk))]V^*(s_t) = \max_{\pi} \mathbb{E}_{\pi} \left[ \sum_{k=t}^T r'(s_k, y_k) + \beta H(\pi(y_k|s_k)) \right] V(st)=πmaxEπ[k=tTr(sk,yk)+βH(π(yksk))]

    • 在最优策略下,有:

      V∗(st)=βlog⁡∑ytexp⁡(1βQ∗(st,yt))V^*(s_t) = \beta \log \sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right) V(st)=βlogytexp(β1Q(st,yt))

      即对所有可能动作取 softmax 的 log-sum-exp 形式。

  • Q∗−V∗Q^* - V^*QV:这个差值恰好等于 βlog⁡πθ∗\beta \log \pi_\theta^*βlogπθ,即:

    Q∗(st,yt)−V∗(st)=βlog⁡πθ∗(yt∣st)Q^*(s_t, y_t) - V^*(s_t) = \beta \log \pi_\theta^*(y_t|s_t) Q(st,yt)V(st)=βlogπθ(ytst)

    • 这是最大熵 RL 中的一个重要恒等式,也称为 Soft Bellman 方程 的核心关系。

Q∗Q^{*}Q化简

=r′(st,yt)+βH(πθ∗(yt+1∣st+1))+Eyt+1∼πθ∗(yt+1∣st+1){r′(st+1,yt+1)}+∑k=t+2T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}=r′(st,yt)+βEyt+1∼πθ∗(yt+1∣st+1){log⁡exp⁡(1βQ∗(st+1,yt+1))πθ∗(yt+1∣st+1)}⏟KL 项展开=βEyt+1∼πθ∗(yt+1∣st+1){log⁡exp⁡(1βV∗(st+1))}=V∗(st+1)\begin{aligned} &= r'(s_t, y_t) + \beta H(\pi_\theta^*(y_{t+1}|s_{t+1})) + \mathbb{E}_{y_{t+1} \sim \pi_\theta^*(y_{t+1}|s_{t+1})} \left\{ r'(s_{t+1}, y_{t+1}) \right\} \\ &\quad + \sum_{k=t+2}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\} \\ &= r'(s_t, y_t) + \underbrace{\beta \mathbb{E}_{y_{t+1} \sim \pi_\theta^*(y_{t+1}|s_{t+1})} \left\{ \log \frac{\exp\left( \frac{1}{\beta} Q^*(s_{t+1}, y_{t+1}) \right)}{\pi_\theta^*(y_{t+1}|s_{t+1})} \right\}}_{\text{KL 项展开}} \\ &= \beta \mathbb{E}_{y_{t+1} \sim \pi_\theta^*(y_{t+1}|s_{t+1})} \left\{ \log \exp\left( \frac{1}{\beta} V^*(s_{t+1}) \right) \right\} = V^*(s_{t+1}) \end{aligned}=r(st,yt)+βH(πθ(yt+1st+1))+Eyt+1πθ(yt+1st+1){r(st+1,yt+1)}+k=t+2T{Eykπθ(yksk){r(sk,yk)}+βH(πθ(yksk))}=r(st,yt)+KL 项展开βEyt+1πθ(yt+1st+1)logπθ(yt+1st+1)exp(β1Q(st+1,yt+1))=βEyt+1πθ(yt+1st+1){logexp(β1V(st+1))}=V(st+1)

⇒Q∗(st,yt)=r′(st,yt)+V∗(st+1)\Rightarrow Q^*(s_t, y_t) = r'(s_t, y_t) + V^*(s_{t+1}) Q(st,yt)=r(st,yt)+V(st+1)其中:Q∗(st,yt)={r(st,yt)+βlog⁡πref(yt∣st)+V∗(st+1),yt≠EOSr(st,yt)+βlog⁡πref(yt∣st),yt=EOS(*)\text{其中:} \quad Q^*(s_t, y_t) = \begin{cases} r(s_t, y_t) + \beta \log \pi_{\text{ref}}(y_t|s_t) + V^*(s_{t+1}), & y_t \neq \text{EOS} \\ r(s_t, y_t) + \beta \log \pi_{\text{ref}}(y_t|s_t), & y_t = \text{EOS} \end{cases} \tag{*} 其中:Q(st,yt)={r(st,yt)+βlogπref(ytst)+V(st+1),r(st,yt)+βlogπref(ytst),yt=EOSyt=EOS(*)


BT-Model:

P(yw>yl∣x)=exp⁡(r(x,yw))exp⁡(r(x,yw))+exp⁡(r(x,yl))P(y_w > y_l \mid x) = \frac{\exp(r(x, y_w))}{\exp(r(x, y_w)) + \exp(r(x, y_l))}P(yw>ylx)=exp(r(x,yw))+exp(r(x,yl))exp(r(x,yw))
=σ(r(x,yw)−r(x,yl))= \sigma(r(x, y_w) - r(x, y_l))=σ(r(x,yw)r(x,yl))

Token-level BT

KaTeX parse error: Can't use function '$' in math mode at position 125: …{t+1}) \right) $̲ $ = V^*(s_1) …


Token-level DPO:

⇒P(yw>yl∣x)=σ(∑t=1T1r(stw,ytw)−∑t=1T2r(stl,ytl)),s1w=s1l=x\Rightarrow P(y_w > y_l \mid x) = \sigma \left( \sum_{t=1}^{T_1} r(s_t^w, y_t^w) - \sum_{t=1}^{T_2} r(s_t^l, y_t^l) \right), \quad s_1^w = s_1^l = x P(yw>ylx)=σ(t=1T1r(stw,ytw)t=1T2r(stl,ytl)),s1w=s1l=x=σ(∑t=1T1βlog⁡πθ∗(ytw∣stw)πref(ytw∣stw)−∑t=1T2βlog⁡πθ∗(ytl∣stl)πref(ytl∣stl))= \sigma \left( \sum_{t=1}^{T_1} \beta \log \frac{\pi_\theta^*(y_t^w|s_t^w)}{\pi_{\text{ref}}(y_t^w|s_t^w)} - \sum_{t=1}^{T_2} \beta \log \frac{\pi_\theta^*(y_t^l|s_t^l)}{\pi_{\text{ref}}(y_t^l|s_t^l)} \right) =σ(t=1T1βlogπref(ytwstw)πθ(ytwstw)t=1T2βlogπref(ytlstl)πθ(ytlstl))

📌 说明与解释:
  • Token-level DPO:指在生成序列时,对每个 token(即每一步)使用 Direct Preference Optimization (DPO) 进行建模。
  • P(yw>yl∣x)P(y_w > y_l \mid x)P(yw>ylx):在输入 xxx 下,样本 ywy_wyw(winner)被判断优于 yly_lyl(loser)的概率。
  • r(st,yt)r(s_t, y_t)r(st,yt):奖励函数,通常由奖励模型给出。在最大熵框架下,可表示为:r(st,yt)=βlog⁡πθ∗(yt∣st)πref(yt∣st)r(s_t, y_t) = \beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)} r(st,yt)=βlogπref(ytst)πθ(ytst)
  • πθ∗(yt∣st)\pi_\theta^*(y_t|s_t)πθ(ytst):最优策略,服从 Boltzmann 分布,由软 Q-learning 或最大熵 RL 推导得到。
  • πref(yt∣st)\pi_{\text{ref}}(y_t|s_t)πref(ytst):参考策略(如初始语言模型),用于正则化。
  • σ(z)\sigma(z)σ(z):sigmoid 函数,定义为:σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}} σ(z)=1+ez1
  • s1w=s1l=xs_1^w = s_1^l = xs1w=s1l=x:两个序列从相同的输入 xxx 开始。
  • 最终形式:总奖励差可以分解为各 token 上的对数似然比之和,这正是 DPO 中常用的损失函数形式。

reward等价性

(∗)⇒βlog⁡πθ∗(yt∣st)πref(yt∣st)=r(yt,st)⏟r~+V∗(st+1)−V∗(st)⏟Φ(st+1)−Φ(st)(reward shaping)(*) \Rightarrow \beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)} = \underbrace{r(y_t, s_t)}_{\tilde{r}} + \underbrace{V^*(s_{t+1}) - V^*(s_t)}_{\Phi(s_{t+1}) - \Phi(s_t)} \quad \text{(reward shaping)} ()βlogπref(ytst)πθ(ytst)=r~r(yt,st)+Φ(st+1)Φ(st)V(st+1)V(st)(reward shaping)⇒r~and rare equivalent.\Rightarrow \tilde{r} \text{ and } r \text{ are equivalent.} r~ and r are equivalent.βlog⁡πθ∗(yt∣st)πref(yt∣st)⏟token-level reward⇒can be generalized to step-wise reward\underbrace{\beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)}}_{\text{token-level reward}} \Rightarrow \text{can be generalized to step-wise reward} token-level rewardβlogπref(ytst)πθ(ytst)can be generalized to step-wise reward

📌 说明与解释:
  • βlog⁡πθ∗(yt∣st)πref(yt∣st)\beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)}βlogπref(ytst)πθ(ytst):这是在最大熵强化学习框架下定义的 token-level 奖励(即每一步生成 token 时的奖励)。
  • r(yt,st)r(y_t, s_t)r(yt,st):原始奖励函数(如来自奖励模型),记作 r~\tilde{r}r~
  • V∗(st)V^*(s_t)V(st):软价值函数(soft value function),表示从状态 sts_tst 出发的最大期望回报(考虑熵)。
  • Φ(st)=V∗(st)\Phi(s_t) = V^*(s_t)Φ(st)=V(st):势函数(potential function),用于 reward shaping(奖励塑形)。
  • 奖励塑形(Reward Shaping)
    • 根据 Kaelbling’s Theorem,如果奖励被修改为:r~(st,yt)=r(st,yt)+Φ(st+1)−Φ(st)\tilde{r}(s_t, y_t) = r(s_t, y_t) + \Phi(s_{t+1}) - \Phi(s_t) r~(st,yt)=r(st,yt)+Φ(st+1)Φ(st)那么最优策略保持不变。
    • 在此处,r~=βlog⁡πθ∗(yt∣st)πref(yt∣st)\tilde{r} = \beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)}r~=βlogπref(ytst)πθ(ytst) 是一个等价于原始奖励 rrr 的新奖励形式。
  • 结论
    • r~\tilde{r}r~rrr 在优化目标上是 等价的
    • 因此,可以将这个 token-level 奖励推广到 step-wise reward(逐步奖励),用于训练语言模型。

Reference

  • DPO: https://arxiv.org/abs/2305.18290
  • r2Q*: https://arxiv.org/abs/2404.12358
  • Step-DPO: https://arxiv.org/abs/2406.18629
  • RTO: https://arxiv.org/abs/2404.18922
  • TDPO: https://arxiv.org/abs/2404.11999
  • SimPO: http://arxiv.org/abs/2405.14734
  • ORPO: http://arxiv.org/abs/2403.07691
  • DMPO: https://arxiv.org/pdf/2406.14868
  • DAPO: https://arxiv.org/pdf/2503.14476
  • GSPO: https://arxiv.org/abs/2507.18071
  • GMPO: https://arxiv.org/abs/2507.20673
  • CISPO: https://arxiv.org/abs/2506.13585
  • VAPO: https://arxiv.org/abs/2504.05118
  • TRPO: https://arxiv.org/abs/1502.05477
http://www.lryc.cn/news/623723.html

相关文章:

  • Zemax 中的透镜设计 - 像差理论
  • 评测系统构建
  • 深入分析 Linux PCI Express 子系统
  • 计算机网络 TCP time_wait 状态 详解
  • 10 SQL进阶-SQL优化(8.15)
  • Matlab课程实践——基于MATLAB设计的计算器软件(简单、科学、电工、矩阵及贷款计算)
  • esp32(自定义分区)coredump
  • C语言私人学习笔记分享
  • 关于第一次接触Linux TCP/IP网络相关项目
  • 使用Ansys Fluent进行倒装芯片封装Theta-JA热阻表征
  • 计算机网络 OSI 七层模型和 TCP 五层模型
  • IP 分片和组装的具体过程
  • 数字货币的法律属性与监管完善路径探析
  • Trae 辅助下的 uni-app 跨端小程序工程化开发实践分享
  • 【Java后端】Spring Boot 集成 MyBatis-Plus 全攻略
  • 【昇腾】单张48G Atlas 300I Duo推理卡MindIE+WebUI方式跑14B大语言模型_20250817
  • 前端vue3+后端spring boot导出数据
  • Java 大视界 -- Java 大数据分布式计算在基因测序数据分析与精准医疗中的应用(400)
  • Linux | i.MX6ULL网络通信-套字节 UDP(第十八章)
  • 计算机网络 TCP 延迟确认机制
  • 矿物分类案列 (一)六种方法对数据的填充
  • 安卓开发者自学鸿蒙开发2页面高级技巧
  • 安卓14系统应用收不到开机广播
  • Android原生(Kotlin)与Flutter混合开发 - 设备控制与状态同步解决方案
  • Javascript面试题及详细答案150道之(106-120)
  • Python实现区域生长和RANSAC聚类
  • 职场新人如何在快速适应工作的同时保持自我成长节奏?
  • JUC常用线程辅助类详解
  • JavaScript 性能优化实战大纲
  • [GLM-4.5] LLM推理服务器(SGLang/vLLM) | 工具与推理解析器