大模型对齐算法合集(一)
大模型对齐算法合集
DPO
定理:
maxμEx∼μ(x){f(x)}+H(μ),s.t. ∑xμ(x)=1\max_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ f(x) \right\} + H(\mu), \quad \text{s.t. } \sum_x \mu(x) = 1 μmaxEx∼μ(x){f(x)}+H(μ),s.t. x∑μ(x)=1
⇒μ∗(x)=ef(x)∑xef(x)⇒Boltzmann distribution\Rightarrow \mu^*(x) = \frac{e^{f(x)}}{\sum_x e^{f(x)}} \quad \Rightarrow \text{Boltzmann distribution} ⇒μ∗(x)=∑xef(x)ef(x)⇒Boltzmann distribution
证明:
maxμEx∼μ(x){f(x)}+H(μ)=maxμEx∼μ(x){f(x)−logμ(x)}=minμEx∼μ(x){logμ(x)−logexp(f(x))}=minμEx∼μ(x){logμ(x)exp(f(x))}=minμEx∼μ(x){logμ(x)/zˉexp(f(x))/zˉ},zˉ=∑xexp(f(x))=minμEx∼μ(x){logμ(x)1zˉexp(f(x))−logzˉ}=minμEx∼μ(x){logμ(x)1zˉexp(f(x))}⏟DKL(μ(x)∥1zˉexp(f(x)))\begin{aligned} &\max_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ f(x) \right\} + H(\mu) \\ &= \max_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ f(x) - \log \mu(x) \right\} \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \mu(x) - \log \exp(f(x)) \right\} \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)}{\exp(f(x))} \right\} \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)/\bar{z}}{\exp(f(x))/\bar{z}} \right\}, \quad \bar{z} = \sum_x \exp(f(x)) \\ &= \min_{\mu} \mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)}{\frac{1}{\bar{z}} \exp(f(x))} - \log \bar{z} \right\} \\ &= \min_{\mu} \underbrace{\mathbb{E}_{x \sim \mu(x)} \left\{ \log \frac{\mu(x)}{\frac{1}{\bar{z}} \exp(f(x))} \right\}}_{D_{KL}\left( \mu(x) \,\|\, \frac{1}{\bar{z}} \exp(f(x)) \right)} \end{aligned} μmaxEx∼μ(x){f(x)}+H(μ)=μmaxEx∼μ(x){f(x)−logμ(x)}=μminEx∼μ(x){logμ(x)−logexp(f(x))}=μminEx∼μ(x){logexp(f(x))μ(x)}=μminEx∼μ(x){logexp(f(x))/zˉμ(x)/zˉ},zˉ=x∑exp(f(x))=μminEx∼μ(x){logzˉ1exp(f(x))μ(x)−logzˉ}=μminDKL(μ(x)∥zˉ1exp(f(x)))Ex∼μ(x){logzˉ1exp(f(x))μ(x)}
⇒μ∗(x)=1zˉexp(f(x))=ef(x)∑xef(x)\Rightarrow \mu^*(x) = \frac{1}{\bar{z}} \exp(f(x)) = \frac{e^{f(x)}}{\sum_x e^{f(x)}} ⇒μ∗(x)=zˉ1exp(f(x))=∑xef(x)ef(x)
RLHF: maxπθEx∼DEy∼πθ(y∣x){rϕ(x,y)}−βDKL(πθ(y∣x)∥πref(y∣x))\text{RLHF: } \max_{\pi_\theta} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ r_\phi(x, y) \right\} - \beta \, D_{KL}\left( \pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x) \right) RLHF: πθmaxEx∼DEy∼πθ(y∣x){rϕ(x,y)}−βDKL(πθ(y∣x)∥πref(y∣x))=Ex∼DmaxπθEy∼πθ(y∣x){rϕ(x,y)−βlogπθ(y∣x)πref(y∣x)}= \mathbb{E}_{x \sim D} \max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right\} =Ex∼DπθmaxEy∼πθ(y∣x){rϕ(x,y)−βlogπref(y∣x)πθ(y∣x)}=Ex∼Dmaxπθ{Ey∼πθ(y∣x){rϕ(x,y)+βlogπref(y∣x)}+βH(πθ(y∣x))}= \mathbb{E}_{x \sim D} \max_{\pi_\theta} \left\{ \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ r_\phi(x, y) + \beta \log \pi_{\text{ref}}(y|x) \right\} + \beta \, H(\pi_\theta(y|x)) \right\} =Ex∼Dπθmax{Ey∼πθ(y∣x){rϕ(x,y)+βlogπref(y∣x)}+βH(πθ(y∣x))}⟸maxπθEy∼πθ(y∣x){1βrϕ(x,y)+logπref(y∣x)⏟f(y∣x)}+H(πθ(y∣x))⏟H(μ(y∣x))\Longleftarrow \max_{\pi_\theta} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \underbrace{\frac{1}{\beta} r_\phi(x, y) + \log \pi_{\text{ref}}(y|x)}_{f(y|x)} \right\} + \underbrace{H(\pi_\theta(y|x))}_{H(\mu(y|x))} ⟸πθmaxEy∼πθ(y∣x)⎩⎨⎧f(y∣x)β1rϕ(x,y)+logπref(y∣x)⎭⎬⎫+H(μ(y∣x))H(πθ(y∣x))
⇒πθ∗(y∣x)=e1βrϕ(x,y)⋅πref(y∣x)Z(x)⇒g∗(x,y)=β(logπθ∗(y∣x)πref(y∣x))+βlogZ\Rightarrow \pi_\theta^*(y|x) = \frac{e^{\frac{1}{\beta} r_\phi(x, y)} \cdot \pi_{\text{ref}}(y|x)}{Z(x)} \quad \Rightarrow g^*(x, y) = \beta \left( \log \frac{\pi_\theta^*(y|x)}{\pi_{\text{ref}}(y|x)} \right) + \beta \log Z ⇒πθ∗(y∣x)=Z(x)eβ1rϕ(x,y)⋅πref(y∣x)⇒g∗(x,y)=β(logπref(y∣x)πθ∗(y∣x))+βlogZ⇒BT: p(ym>yn∣x)=σ{β(logπθ(ym∣x)πref(ym∣x)−logπθ(yn∣x)πref(yn∣x))}\Rightarrow \text{BT: } p(y_m > y_n \mid x) = \sigma \left\{ \beta \left( \log \frac{\pi_\theta(y_m|x)}{\pi_{\text{ref}}(y_m|x)} - \log \frac{\pi_\theta(y_n|x)}{\pi_{\text{ref}}(y_n|x)} \right) \right\} ⇒BT: p(ym>yn∣x)=σ{β(logπref(ym∣x)πθ(ym∣x)−logπref(yn∣x)πθ(yn∣x))}
Token-level PPO:
maxθEx∼DEy∼πθ(y∣x){∑t=1Tr(st,yt)}−βDKL(πθ(y∣x)∥πref(y∣x))\max_{\theta} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \sum_{t=1}^T r(s_t, y_t) \right\} - \beta \, D_{KL}\left( \pi_\theta(y|x) \,\|\, \pi_{\text{ref}}(y|x) \right) θmaxEx∼DEy∼πθ(y∣x){t=1∑Tr(st,yt)}−βDKL(πθ(y∣x)∥πref(y∣x))
其中:
$ (s_t, y_t) = (x, y^{<t}) $
=maxθEx∼DEy∼πθ(y∣x){∑t=1Tr(st,yt)−βlogπθ(yt∣y<t,x)πref(yt∣y<t,x)}= \max_{\theta} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \sum_{t=1}^T r(s_t, y_t) - \beta \log \frac{\pi_\theta(y_t \mid y^{<t}, x)}{\pi_{\text{ref}}(y_t \mid y^{<t}, x)} \right\} =θmaxEx∼DEy∼πθ(y∣x){t=1∑Tr(st,yt)−βlogπref(yt∣y<t,x)πθ(yt∣y<t,x)}=Ex∼DmaxθEy∼πθ(y∣x){∑t=1Tr(st,yt)+βlogπref(yt∣st)⏟r′(st,yt)−βlogπθ(yt∣st)}= \mathbb{E}_{x \sim D} \max_{\theta} \mathbb{E}_{y \sim \pi_\theta(y|x)} \left\{ \sum_{t=1}^T \underbrace{r(s_t, y_t) + \beta \log \pi_{\text{ref}}(y_t \mid s_t)}_{r'(s_t, y_t)} - \beta \log \pi_\theta(y_t \mid s_t) \right\} =Ex∼DθmaxEy∼πθ(y∣x)⎩⎨⎧t=1∑Tr′(st,yt)r(st,yt)+βlogπref(yt∣st)−βlogπθ(yt∣st)⎭⎬⎫=Ex∼Dmaxθ∑t=1TEyt∼πθ(yt∣st){r′(st,yt)−βlogπθ(yt∣st)}= \mathbb{E}_{x \sim D} \max_{\theta} \sum_{t=1}^T \mathbb{E}_{y_t \sim \pi_\theta(y_t|s_t)} \left\{ r'(s_t, y_t) - \beta \log \pi_\theta(y_t \mid s_t) \right\} =Ex∼Dθmaxt=1∑TEyt∼πθ(yt∣st){r′(st,yt)−βlogπθ(yt∣st)}⟸maxθ∑t=1T{Eyt∼πθ(yt∣st){r′(st,yt)}+βH(πθ(yt∣st))}(Max-Entropy Soft Q-learning)\Longleftarrow \max_{\theta} \sum_{t=1}^T \left\{ \mathbb{E}_{y_t \sim \pi_\theta(y_t|s_t)} \left\{ r'(s_t, y_t) \right\} + \beta \, H(\pi_\theta(y_t \mid s_t)) \right\} \quad \text{(Max-Entropy Soft Q-learning)} ⟸θmaxt=1∑T{Eyt∼πθ(yt∣st){r′(st,yt)}+βH(πθ(yt∣st))}(Max-Entropy Soft Q-learning)
在时刻 ttt 优化 πθ(yt∣st)\pi_\theta(y_t|s_t)πθ(yt∣st):
=maxθEyt∼πθ(yt∣st)⏟μ{r′(st,yt)⏟f(yt∣st)+∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}⏟1βQ∗(st,yt)+βH(πθ(yt∣st))}= \max_{\theta} \underbrace{\mathbb{E}_{y_t \sim \pi_\theta(y_t|s_t)}}_{\mu} \left\{ \underbrace{r'(s_t, y_t)}_{f(y_t|s_t)} + \underbrace{\sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\}}_{\frac{1}{\beta} Q^*(s_t, y_t)}+ \beta H(\pi_\theta(y_t|s_t)) \right\} =θmaxμEyt∼πθ(yt∣st)⎩⎨⎧f(yt∣st)r′(st,yt)+β1Q∗(st,yt)k=t+1∑T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}+βH(πθ(yt∣st))⎭⎬⎫
⇒πθ∗(yt∣st)=exp{1βr′(st,yt)+1β∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}}Z(yt∣st)\Rightarrow \pi_\theta^*(y_t|s_t) = \frac{ \exp \left\{ \frac{1}{\beta} r'(s_t, y_t) + \frac{1}{\beta} \sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\} \right\} }{Z(y_t|s_t)} ⇒πθ∗(yt∣st)=Z(yt∣st)exp{β1r′(st,yt)+β1∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}}
其中令⇒1βQ∗(st,yt)=1βr′(st,yt)+1β∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}\Rightarrow \frac{1}{\beta}Q^*(s_t,y_t) = \frac{1}{\beta} r'(s_t, y_t) + \frac{1}{\beta} \sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\}⇒β1Q∗(st,yt)=β1r′(st,yt)+β1∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}
Z(yt∣st)=∑ytexp{1βQ∗(st,yt)}=exp{1βV∗(st)}\begin{aligned} Z(y_t|s_t) &= \sum_{y_t} \exp \left\{ \frac{1}{\beta} Q^*(s_t, y_t) \right\} \\ &= \exp \left\{ \frac{1}{\beta} V^*(s_t) \right\} \end{aligned} Z(yt∣st)=yt∑exp{β1Q∗(st,yt)}=exp{β1V∗(st)}
📌 说明与解释:
- ∑ytexp(1βQ∗(st,yt))\sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right)∑ytexp(β1Q∗(st,yt)):这是对所有可能动作 yty_tyt 的软 Q 值指数和,即 partition function(配分函数),记作 Z(st)Z(s_t)Z(st)。
- V∗(st)V^*(s_t)V∗(st):软价值函数(soft value function),定义为:V∗(st)=βlog∑ytexp(1βQ∗(st,yt))V^*(s_t) = \beta \log \sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right) V∗(st)=βlogyt∑exp(β1Q∗(st,yt))因此有:∑ytexp(1βQ∗(st,yt))=exp(1βV∗(st))\sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right) = \exp\left( \frac{1}{\beta} V^*(s_t) \right) yt∑exp(β1Q∗(st,yt))=exp(β1V∗(st))
- 这个等式是最大熵强化学习中 Soft Bellman 方程 推导的关键一步,用于归一化最优策略:πθ∗(yt∣st)=exp(1βQ∗(st,yt))∑yt′exp(1βQ∗(st,yt′))=exp(1βQ∗(st,yt))exp(1βV∗(st))\pi_\theta^*(y_t|s_t) = \frac{\exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right)}{\sum_{y_t'} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t') \right)} = \frac{\exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right)}{\exp\left( \frac{1}{\beta} V^*(s_t) \right)} πθ∗(yt∣st)=∑yt′exp(β1Q∗(st,yt′))exp(β1Q∗(st,yt))=exp(β1V∗(st))exp(β1Q∗(st,yt))
其中:
- Z(yt∣st)Z(y_t|s_t)Z(yt∣st) 是归一化常数(partition function)。
- 上述形式体现了 递归最优策略 的结构,类似于 Soft Q-learning 或 最大熵强化学习 中的策略更新。
{πθ∗(yt∣st)=exp{1β(Q∗(st,yt)−V∗(st))}Q∗(st,yt)=r′(st,yt)+∑k=t+1T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}\left\{ \begin{aligned} \pi_\theta^*(y_t|s_t) &= \exp \left\{ \frac{1}{\beta} \left( Q^*(s_t, y_t) - V^*(s_t) \right) \right\} \\ Q^*(s_t, y_t) &= r'(s_t, y_t) + \sum_{k=t+1}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\} \end{aligned} \right. ⎩⎨⎧πθ∗(yt∣st)Q∗(st,yt)=exp{β1(Q∗(st,yt)−V∗(st))}=r′(st,yt)+k=t+1∑T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}
⇒V∗(st)=βlog{∑ytexp{1βQ∗(st,yt)}}\Rightarrow V^*(s_t) = \beta \log \left\{ \sum_{y_t} \exp \left\{ \frac{1}{\beta} Q^*(s_t, y_t) \right\} \right\} ⇒V∗(st)=βlog{yt∑exp{β1Q∗(st,yt)}}
⇒Q∗−V∗=βlogπθ∗\Rightarrow Q^* - V^* = \beta \log \pi_\theta^*⇒Q∗−V∗=βlogπθ∗
📌 说明与解释:
-
πθ∗(yt∣st)\pi_\theta^*(y_t|s_t)πθ∗(yt∣st):在状态 sts_tst 下选择动作 yty_tyt 的最优策略(概率分布),服从 Boltzmann 分布。
- 这是最大熵强化学习中的经典形式:策略正比于 exp(Q−V)\exp(Q - V)exp(Q−V)。
-
Q∗(st,yt)Q^*(s_t, y_t)Q∗(st,yt):软 Q 值函数(soft Q-function),表示在状态 sts_tst 采取动作 yty_tyt 后的期望累积回报(含熵项)。
- 包括即时奖励 r′(st,yt)r'(s_t, y_t)r′(st,yt) 和未来期望回报(包含后续状态下的奖励和熵)。
-
V∗(st)V^*(s_t)V∗(st):软价值函数(soft value function),表示在状态 sts_tst 下的最大期望累积回报(考虑熵):
V∗(st)=maxπEπ[∑k=tTr′(sk,yk)+βH(π(yk∣sk))]V^*(s_t) = \max_{\pi} \mathbb{E}_{\pi} \left[ \sum_{k=t}^T r'(s_k, y_k) + \beta H(\pi(y_k|s_k)) \right] V∗(st)=πmaxEπ[k=t∑Tr′(sk,yk)+βH(π(yk∣sk))]
-
在最优策略下,有:
V∗(st)=βlog∑ytexp(1βQ∗(st,yt))V^*(s_t) = \beta \log \sum_{y_t} \exp\left( \frac{1}{\beta} Q^*(s_t, y_t) \right) V∗(st)=βlogyt∑exp(β1Q∗(st,yt))
即对所有可能动作取 softmax 的 log-sum-exp 形式。
-
-
Q∗−V∗Q^* - V^*Q∗−V∗:这个差值恰好等于 βlogπθ∗\beta \log \pi_\theta^*βlogπθ∗,即:
Q∗(st,yt)−V∗(st)=βlogπθ∗(yt∣st)Q^*(s_t, y_t) - V^*(s_t) = \beta \log \pi_\theta^*(y_t|s_t) Q∗(st,yt)−V∗(st)=βlogπθ∗(yt∣st)
- 这是最大熵 RL 中的一个重要恒等式,也称为 Soft Bellman 方程 的核心关系。
Q∗Q^{*}Q∗化简
=r′(st,yt)+βH(πθ∗(yt+1∣st+1))+Eyt+1∼πθ∗(yt+1∣st+1){r′(st+1,yt+1)}+∑k=t+2T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}=r′(st,yt)+βEyt+1∼πθ∗(yt+1∣st+1){logexp(1βQ∗(st+1,yt+1))πθ∗(yt+1∣st+1)}⏟KL 项展开=βEyt+1∼πθ∗(yt+1∣st+1){logexp(1βV∗(st+1))}=V∗(st+1)\begin{aligned} &= r'(s_t, y_t) + \beta H(\pi_\theta^*(y_{t+1}|s_{t+1})) + \mathbb{E}_{y_{t+1} \sim \pi_\theta^*(y_{t+1}|s_{t+1})} \left\{ r'(s_{t+1}, y_{t+1}) \right\} \\ &\quad + \sum_{k=t+2}^T \left\{ \mathbb{E}_{y_k \sim \pi_\theta^*(y_k|s_k)} \left\{ r'(s_k, y_k) \right\} + \beta H(\pi_\theta^*(y_k|s_k)) \right\} \\ &= r'(s_t, y_t) + \underbrace{\beta \mathbb{E}_{y_{t+1} \sim \pi_\theta^*(y_{t+1}|s_{t+1})} \left\{ \log \frac{\exp\left( \frac{1}{\beta} Q^*(s_{t+1}, y_{t+1}) \right)}{\pi_\theta^*(y_{t+1}|s_{t+1})} \right\}}_{\text{KL 项展开}} \\ &= \beta \mathbb{E}_{y_{t+1} \sim \pi_\theta^*(y_{t+1}|s_{t+1})} \left\{ \log \exp\left( \frac{1}{\beta} V^*(s_{t+1}) \right) \right\} = V^*(s_{t+1}) \end{aligned}=r′(st,yt)+βH(πθ∗(yt+1∣st+1))+Eyt+1∼πθ∗(yt+1∣st+1){r′(st+1,yt+1)}+k=t+2∑T{Eyk∼πθ∗(yk∣sk){r′(sk,yk)}+βH(πθ∗(yk∣sk))}=r′(st,yt)+KL 项展开βEyt+1∼πθ∗(yt+1∣st+1)⎩⎨⎧logπθ∗(yt+1∣st+1)exp(β1Q∗(st+1,yt+1))⎭⎬⎫=βEyt+1∼πθ∗(yt+1∣st+1){logexp(β1V∗(st+1))}=V∗(st+1)
⇒Q∗(st,yt)=r′(st,yt)+V∗(st+1)\Rightarrow Q^*(s_t, y_t) = r'(s_t, y_t) + V^*(s_{t+1}) ⇒Q∗(st,yt)=r′(st,yt)+V∗(st+1)其中:Q∗(st,yt)={r(st,yt)+βlogπref(yt∣st)+V∗(st+1),yt≠EOSr(st,yt)+βlogπref(yt∣st),yt=EOS(*)\text{其中:} \quad Q^*(s_t, y_t) = \begin{cases} r(s_t, y_t) + \beta \log \pi_{\text{ref}}(y_t|s_t) + V^*(s_{t+1}), & y_t \neq \text{EOS} \\ r(s_t, y_t) + \beta \log \pi_{\text{ref}}(y_t|s_t), & y_t = \text{EOS} \end{cases} \tag{*} 其中:Q∗(st,yt)={r(st,yt)+βlogπref(yt∣st)+V∗(st+1),r(st,yt)+βlogπref(yt∣st),yt=EOSyt=EOS(*)
BT-Model:
P(yw>yl∣x)=exp(r(x,yw))exp(r(x,yw))+exp(r(x,yl))P(y_w > y_l \mid x) = \frac{\exp(r(x, y_w))}{\exp(r(x, y_w)) + \exp(r(x, y_l))}P(yw>yl∣x)=exp(r(x,yw))+exp(r(x,yl))exp(r(x,yw))
=σ(r(x,yw)−r(x,yl))= \sigma(r(x, y_w) - r(x, y_l))=σ(r(x,yw)−r(x,yl))
Token-level BT
KaTeX parse error: Can't use function '$' in math mode at position 125: …{t+1}) \right) $̲ $ = V^*(s_1) …
Token-level DPO:
⇒P(yw>yl∣x)=σ(∑t=1T1r(stw,ytw)−∑t=1T2r(stl,ytl)),s1w=s1l=x\Rightarrow P(y_w > y_l \mid x) = \sigma \left( \sum_{t=1}^{T_1} r(s_t^w, y_t^w) - \sum_{t=1}^{T_2} r(s_t^l, y_t^l) \right), \quad s_1^w = s_1^l = x ⇒P(yw>yl∣x)=σ(t=1∑T1r(stw,ytw)−t=1∑T2r(stl,ytl)),s1w=s1l=x=σ(∑t=1T1βlogπθ∗(ytw∣stw)πref(ytw∣stw)−∑t=1T2βlogπθ∗(ytl∣stl)πref(ytl∣stl))= \sigma \left( \sum_{t=1}^{T_1} \beta \log \frac{\pi_\theta^*(y_t^w|s_t^w)}{\pi_{\text{ref}}(y_t^w|s_t^w)} - \sum_{t=1}^{T_2} \beta \log \frac{\pi_\theta^*(y_t^l|s_t^l)}{\pi_{\text{ref}}(y_t^l|s_t^l)} \right) =σ(t=1∑T1βlogπref(ytw∣stw)πθ∗(ytw∣stw)−t=1∑T2βlogπref(ytl∣stl)πθ∗(ytl∣stl))
📌 说明与解释:
- Token-level DPO:指在生成序列时,对每个 token(即每一步)使用 Direct Preference Optimization (DPO) 进行建模。
- P(yw>yl∣x)P(y_w > y_l \mid x)P(yw>yl∣x):在输入 xxx 下,样本 ywy_wyw(winner)被判断优于 yly_lyl(loser)的概率。
- r(st,yt)r(s_t, y_t)r(st,yt):奖励函数,通常由奖励模型给出。在最大熵框架下,可表示为:r(st,yt)=βlogπθ∗(yt∣st)πref(yt∣st)r(s_t, y_t) = \beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)} r(st,yt)=βlogπref(yt∣st)πθ∗(yt∣st)
- πθ∗(yt∣st)\pi_\theta^*(y_t|s_t)πθ∗(yt∣st):最优策略,服从 Boltzmann 分布,由软 Q-learning 或最大熵 RL 推导得到。
- πref(yt∣st)\pi_{\text{ref}}(y_t|s_t)πref(yt∣st):参考策略(如初始语言模型),用于正则化。
- σ(z)\sigma(z)σ(z):sigmoid 函数,定义为:σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}} σ(z)=1+e−z1
- s1w=s1l=xs_1^w = s_1^l = xs1w=s1l=x:两个序列从相同的输入 xxx 开始。
- 最终形式:总奖励差可以分解为各 token 上的对数似然比之和,这正是 DPO 中常用的损失函数形式。
reward等价性
(∗)⇒βlogπθ∗(yt∣st)πref(yt∣st)=r(yt,st)⏟r~+V∗(st+1)−V∗(st)⏟Φ(st+1)−Φ(st)(reward shaping)(*) \Rightarrow \beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)} = \underbrace{r(y_t, s_t)}_{\tilde{r}} + \underbrace{V^*(s_{t+1}) - V^*(s_t)}_{\Phi(s_{t+1}) - \Phi(s_t)} \quad \text{(reward shaping)} (∗)⇒βlogπref(yt∣st)πθ∗(yt∣st)=r~r(yt,st)+Φ(st+1)−Φ(st)V∗(st+1)−V∗(st)(reward shaping)⇒r~and rare equivalent.\Rightarrow \tilde{r} \text{ and } r \text{ are equivalent.} ⇒r~ and r are equivalent.βlogπθ∗(yt∣st)πref(yt∣st)⏟token-level reward⇒can be generalized to step-wise reward\underbrace{\beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)}}_{\text{token-level reward}} \Rightarrow \text{can be generalized to step-wise reward} token-level rewardβlogπref(yt∣st)πθ∗(yt∣st)⇒can be generalized to step-wise reward
📌 说明与解释:
- βlogπθ∗(yt∣st)πref(yt∣st)\beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)}βlogπref(yt∣st)πθ∗(yt∣st):这是在最大熵强化学习框架下定义的 token-level 奖励(即每一步生成 token 时的奖励)。
- r(yt,st)r(y_t, s_t)r(yt,st):原始奖励函数(如来自奖励模型),记作 r~\tilde{r}r~。
- V∗(st)V^*(s_t)V∗(st):软价值函数(soft value function),表示从状态 sts_tst 出发的最大期望回报(考虑熵)。
- Φ(st)=V∗(st)\Phi(s_t) = V^*(s_t)Φ(st)=V∗(st):势函数(potential function),用于 reward shaping(奖励塑形)。
- 奖励塑形(Reward Shaping):
- 根据 Kaelbling’s Theorem,如果奖励被修改为:r~(st,yt)=r(st,yt)+Φ(st+1)−Φ(st)\tilde{r}(s_t, y_t) = r(s_t, y_t) + \Phi(s_{t+1}) - \Phi(s_t) r~(st,yt)=r(st,yt)+Φ(st+1)−Φ(st)那么最优策略保持不变。
- 在此处,r~=βlogπθ∗(yt∣st)πref(yt∣st)\tilde{r} = \beta \log \frac{\pi_\theta^*(y_t|s_t)}{\pi_{\text{ref}}(y_t|s_t)}r~=βlogπref(yt∣st)πθ∗(yt∣st) 是一个等价于原始奖励 rrr 的新奖励形式。
- 结论:
- r~\tilde{r}r~ 和 rrr 在优化目标上是 等价的
- 因此,可以将这个 token-level 奖励推广到 step-wise reward(逐步奖励),用于训练语言模型。
Reference
- DPO: https://arxiv.org/abs/2305.18290
- r2Q*: https://arxiv.org/abs/2404.12358
- Step-DPO: https://arxiv.org/abs/2406.18629
- RTO: https://arxiv.org/abs/2404.18922
- TDPO: https://arxiv.org/abs/2404.11999
- SimPO: http://arxiv.org/abs/2405.14734
- ORPO: http://arxiv.org/abs/2403.07691
- DMPO: https://arxiv.org/pdf/2406.14868
- DAPO: https://arxiv.org/pdf/2503.14476
- GSPO: https://arxiv.org/abs/2507.18071
- GMPO: https://arxiv.org/abs/2507.20673
- CISPO: https://arxiv.org/abs/2506.13585
- VAPO: https://arxiv.org/abs/2504.05118
- TRPO: https://arxiv.org/abs/1502.05477