当前位置：首页 > news >正文

4.5 优化器中常见的梯度下降算法

news 2025/7/27 15:57:37

梯度下降算法（Gradient Descent）的数学公式可以通过以下步骤严格表达：

1. 基本梯度下降（Batch Gradient Descent）

目标：最小化损失函数 $L(θ)\mathcal{L}(\theta)$ ，其中 $θ\theta$ 是模型参数。
参数更新规则： $θt+1=θt−η⋅∇θL(θt)\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \mathcal{L}(\theta_t)$

$θt\theta_t$ ：第 $t $次迭代的参数值
$η\eta$ ：学习率（Learning Rate）
$∇θL(θt)\nabla_\theta \mathcal{L}(\theta_t)$ ：损失函数对参数 $θ\theta$ 的梯度（所有训练样本的平均梯度）。

2. 随机梯度下降（Stochastic Gradient Descent, SGD）

每次迭代随机选取一个样本计算梯度：
$θt+1=θt−η⋅∇θL(θt;xi,yi)\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \mathcal{L}(\theta_t; x_i, y_i)$

$∇θL(θt;xi,yi)\nabla_\theta \mathcal{L}(\theta_t; x_i, y_i)$ ：仅基于单个样本 $x_i, y_i)$ 的梯度。

3. 小批量梯度下降（Mini-Batch Gradient Descent）

每次迭代使用一个小批量（Batch）数据计算梯度：

$θt+1=θt−η⋅1B∑i=1B∇θL(θt;xi,yi)\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{B} \sum_{i=1}^B \nabla_\theta \mathcal{L}(\theta_t; x_i, y_i)$

$B$ ：批量大小（Batch Size）
$1B∑i=1B∇θL\frac{1}{B} \sum_{i=1}^B \nabla_\theta \mathcal{L}$ ：批量内样本梯度的平均值。

4. 带动量的梯度下降（Momentum）

引入动量项 $v_t$ 加速收敛并减少震荡：

$vt+1=βvt+(1−β)∇θL(θt)v_{t+1} = \beta v_t + (1-\beta) \nabla_\theta \mathcal{L}(\theta_t)$
$θt+1=θt−η⋅vt+1\theta_{t+1} = \theta_t - \eta \cdot v_{t+1}$

$β\beta$ ：动量系数（通常设为 0.9），控制历史梯度的影响。

5. Adam 优化器

结合动量（一阶矩）和自适应学习率（二阶矩），并引入偏差修正：

$mt+1=β1mt+(1−β1)∇θL(θt)(一阶矩估计)m_{t+1} = \beta_1 m_t + (1-\beta_1) \nabla_\theta \mathcal{L}(\theta_t) \quad \text{(一阶矩估计)}$
$vt+1=β2vt+(1−β2)(∇θL(θt))2(二阶矩估计)v_{t+1} = \beta_2 v_t + (1-\beta_2) \left( \nabla_\theta \mathcal{L}(\theta_t) \right)^2 \quad \text{(二阶矩估计)}$
$m^t+1=mt+11−β1t+1(一阶矩偏差修正)\hat{m}{t+1} = \frac{m{t+1}}{1 - \beta_1^{t+1}} \quad \text{(一阶矩偏差修正)}$
$v^t+1=vt+11−β2t+1(二阶矩偏差修正)\hat{v}{t+1} = \frac{v{t+1}}{1 - \beta_2^{t+1}} \quad \text{(二阶矩偏差修正)}$
$θt+1=θt−η⋅m^t+1v^t+1+ϵ\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}{t+1}}{\sqrt{\hat{v}{t+1}} + \epsilon}$

$β1,β2\beta_1, \beta_2$ ：衰减率（通常设为 0.9 和 0.999）
$ϵ\epsilon$ ：数值稳定性常数（通常设为$10^{-8} $）。

6. 梯度下降的数学意义

梯度下降的更新方向由 负梯度方向 $−∇θL-\nabla_\theta \mathcal{L}$ 决定，其本质是沿着损失函数曲面的最速下降方向移动参数 $θ\theta$ 。学习率 $η\eta$ 控制步长，过大可能导致震荡，过小则收敛缓慢。

公式总结表

算法	公式
批量梯度下降	$θt+1=θt−η⋅∇θL\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \mathcal{L}$
随机梯度下降	$θt+1=θt−η⋅∇θL(xi,yi)\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \mathcal{L}(x_i, y_i)$
小批量梯度下降	$θt+1=θt−η⋅1B∑i=1B∇θL(xi,yi)\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{B} \sum_{i=1}^B \nabla_\theta \mathcal{L}(x_i, y_i)$
Momentum	$vt+1=βvt+(1−β)∇θL)，(θt+1=θt−ηvt+1v_{t+1} = \beta v_t + (1-\beta)\nabla_\theta \mathcal{L} )，( \theta_{t+1} = \theta_t - \eta v_{t+1}$
Adam	$θt+1=θt−η⋅m^t+1v^t+1+ϵ\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}{t+1}}{\sqrt{\hat{v}{t+1}} + \epsilon}$