# 梯度下降优化算法总结

WARNING

UNDER CONSTRUCTION

# 背景

回顾一下，现实中我们有了数据集 $D = \lbrace(x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)})\rbrace$ ，也定义了损失函数 $\ell\colon\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}$ ，我们想要寻找一个假设函数 $h$ （其所有参数由 $\theta$ 表示），which 能最小化经验误差：

$\min_\theta \widehat{E}(\theta;D)\text{,}\quad\text{where } \widehat{E}(\theta;D)=\frac{1}{m}\sum_{i=1}^m \ell\mathopen{}\left(h(x^{(i)};\theta), y^{(i)}\right)\mathclose{}.$

这个优化问题就可以用梯度下降来解决

$\theta^\prime = \theta - \eta\cdot g$

其中 $\eta$ 为学习率 (learning rate)， $g = \nabla_\theta\widehat{E}(\theta)$ 为梯度（方便起见，下文将经验误差 $\widehat{E}$ 简记为 $f$ ）

假设当前时刻的神经网络权值为 $\theta_t$ ，经过梯度下降更新后为 $\theta_{t+1}$ ，引入如下标记

原始梯度 $g_t = \nabla_\theta f(\theta_t)$
权值更新量 $u_t$ （与学习率无关），满足 $\theta_{t+1} = \theta_t - \eta\cdot u_t$

很显然，对于随机梯度下降 (SGD) 来说， $u_t = g_t$ ，即 $\theta_{t+1} = \theta_t - \eta\cdot g_t$

TODO

一阶动量 $m_t$
二阶动量 $v_t$

full GD vs. mini-batch GD

# 朴素

# 挑战

选择学习率
选择学习率 scheduler
如果数据稀疏或特征频率不同（？），我们想给不常出现的特征更大的 update
局部极小值

# SGD

$\begin{alignedat}{2} u_t &= g_t \\ \theta_{t+1} &= \theta_t - \eta\cdot u_t. \end{alignedat}$

# SGD with momentum

——引入一阶动量

$\begin{alignedat}{2} \textcolor{#8b959e}{g_t} &\textcolor{#8b959e}{= \nabla_\theta f(\theta_t)} \\ \textcolor{#8b959e}{m_0} &\textcolor{#8b959e}{= g_0} \\ \textcolor{#F26400}{m_t} &= \textcolor{#F26400}{\gamma\cdot m_{t-1}} + g_t \\ u_t &= \textcolor{#F26400}{m_t} \\ \theta_{t+1} &= \theta_t - \eta\cdot u_t. \end{alignedat}$

可见一阶动量 $m_t$ 即为历史梯度的指数移动平均 (exponential moving average)，其权重 $\gamma$ 常取 0.9

优点：有助于加速优化过程（跨过平坦区域，平滑震荡的梯度）
缺点：可能冲出最小值区域，「停不下来」（TODO）

NOTE

上述 SGD with Momentum 的公式为 PyTorch 中的实现方式 (opens new window)，与论文^[1]略有不同，其学习率 $\eta$ 是乘在梯度 $g$ 上而不是动量 $m_t$ 上，即

$\begin{alignedat}{2} m_t &= \gamma\cdot m_{t-1} + \eta\cdot g_t \\ \theta_{t+1} &= \theta_t - m_t. \end{alignedat}$

对于固定的学习率 $\eta$ 两者是等价的（对动量权重 $\gamma$ 进行缩放即可），而考虑到网络实际训练时往往需要动态调节学习率 (lr schedule)，前者在改变学习率的时候不会影响动量 $m_t$ 的计算。

下文统一采用与 PyTorch 一致的写法。

# SGD with Nesterov accelerated gradient (NAG)

$\begin{alignedat}{2} \textcolor{#8b959e}{g_t} &\textcolor{#8b959e}{= \nabla_\theta f(\theta_t)} \\ \textcolor{#8b959e}{m_0} &\textcolor{#8b959e}{= g_0} \\ \textcolor{#F26400}{\tilde{g}_t} &= \nabla_\theta f(\theta_t\textcolor{#F26400}{- \gamma\cdot m_{t-1}}) \\ m_t &= \gamma\cdot m_{t-1} + \textcolor{#F26400}{\tilde{g}_t} \\ u_t &= m_t \\ \theta_{t+1} &= \theta_t - \eta\cdot u_t. \end{alignedat}$

PyTorch

$\begin{alignedat}{2} \textcolor{#8b959e}{g_t} &\textcolor{#8b959e}{= \nabla_\theta f(\theta_t)} \\ \textcolor{#8b959e}{m_0} &\textcolor{#8b959e}{= g_0} \\ m_t &= \gamma\cdot m_{t-1} + g_t \\ u_t &= \gamma\cdot m_t + u_{t-1} \\ \theta_{t+1} &= \theta_t - \eta\cdot u_t. \end{alignedat}$

# Adagrad

——引入二阶动量。自适应学习率优化算法的到来

$g_{t,i} = \nabla f(\theta_{t,i}).$

( $\theta_{t+1,i} = \theta_{t,i} - \eta\cdot g_{t,i}$ )

$\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\textcolor{#F26400}{\sqrt{G_{t,ii}+\epsilon}}}\cdot g_{t,i}$

$G_t$

Adadelta

RMSprop

# Adam

——Adaptive Moment Estimation 同时使用一阶动量与二阶动量

# (AdaMax, Nadam, AMSGrad)

# 阅读材料

总览：

一个框架看懂优化算法之异同 SGD/AdaGrad/Adam - 知乎专栏 (opens new window)
An overview of gradient descent optimization algorithms (opens new window)
Wilson, Ashia C., et al. “The marginal value of adaptive gradient methods in machine learning” (opens new window). In Advances in Neural Information Processing Systems. 2017.
Parameter updates - CS231n (opens new window)

Momentum/NAG:

https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c
https://blog.christianperone.com/2020/11/optimization-deep-learning/

Sutskever et al. “On the importance of initialization and momentum in deep learning”. ICML. 2013. ↩︎

← 计算学习理论