Posted: 2016-03-06
, Modified: 2016-03-06
Tags: gradient
Gradient Descent
Definition: A convex function is \(L\)-smooth if \[ f(y) \le \blu{f(x)+\an{\nb f(x),y-x}}+\redd{\fc L2 \ve{y-x}^2}\] or equivalently \[ \ve{\nb f(x)-\nb f(y)}_*\le L\ve{x-y}.\]
Assume \(f\) is \(L\)-smooth.
- Update:
\[\begin{align}
x_{k+1} &= \amin_y \ba{\fc{L}{2} \ve{y-x_k}^2 + \an{\nb f(x),y-x_k}}\\
&= x_k - \rc{L}\nb f(x_k)&\text{for $\ell_2$ norm}.
\end{align}\]
- Lemma: \[f(x')\le f(x) - \fc{\ve{\nb f(x)}_*^2}{2L}.\]
- Guarantee: Let \(R=\max_{x:f(x)\le f(x_0)}\ve{x-x^*}_*\). \[f(x_T)-f(x^*) \le O\pf{LR^2}{T}.\] Obtain an \(\ep\)-approximation in \(\fc{LR^2}{\ep}\) steps.
Proof
See also gradient descent.
- Bound the difference from optimal by the distance to the optimal times the gradient (using convexity and CS).
- Bound the gradient by the progress.
- Now let \(D_k = f(x_k)-f(x^*)\). Measure progress by \(\rc{D_k}\) whose differences telescope.
Let
\(D_k = f(x_k) - f(x^*)\).
\[\begin{align}
\ub{f(x_k)-f(x^*)}{D_k}&\le \an{\nb f(x_k),x_k-x^*} \le \ve{\nb f(x_k)} \ve{x_k-x^*}\le R \ve{\nb f(x_k)}_*\\
D_k - D_{k+1} &\ge \rc{2L}\ve{\nb f(x_k)}_*^2\\
D_k^2 & \le 2LR^2(D_k-D_{k+1})\\
\rc{D_{k+1}}-\rc{D_k} &\ge \rc{2LR^2}\fc{D_k}{D_{k+1}} \ge \rc{2LR^2}\\
\implies \rc{D_T}&\ge T.
\end{align}\]
Mirror descent
Definition:
\(w(x):Q\to \R\) (
\(Q\) convex) is a
distance generating function if the following holds. Then define
Bregman divergence as follows. For all
\(x\in Q\bs \pl Q, y\in Q\),
\[\begin{align}
w(y) &\ge w(x) + \an{\nb w(x),y-x} + \rc2\ve{x-y}^2\\
V_x(y) & = w(y) - \an{\nb w(x),y-x} - w(x).
\end{align}\]
Note we can replace the gradient by a subgradient below.
- Update:
\[\begin{align}
\wt x &= \text{Mirr}(\al\nb f(x))\\
\text{where }\text{Mirr}_x(\xi) &= \amin_{y\in Q} V_x(y)+\an{\xi,y-x}.
\end{align}\]
- Lemma: For all \(u\),
\[\begin{align}
\al (f(x_k)-f(u))
&\le \al\an{\nb f(x_k),x_k-u}\\
&\le \fc{\al^2}2\ve{\nb f(x_k)}^2 +V_{x_k}(u) - V_{x_{k+1}}(u)\\
&= \fc{\al^2}2\ve{\nb f(x_k)}^2 +\rc 2 \ve{z_k-u}^2 -\rc2 \ve{z_{k+1}-u}^2 & \text{for }\ell^2.
\end{align}\]
Generalization:
\[\begin{align}
\al (f(x_k)-f(u))
&\le \al\an{\nb f(x_k),x_k-u}\\
&\le \fc{\al^2}2\ve{\nb f(x_k)}^2 +V_{x_k}(u) - V_{x_{k+1}}(u)
\end{align}\]
- Guarantee: When \(V_{x_0}(x^*)\le \Te, \ve{\nb f(x)}_*\le \rh\) everywhere, \[ f(\ol x) - f(x^*) \le \fc{\sqrt{2\Te}\rh}{\sqrt T}.\] Obtain an \(\ep\)-approximation in \(\fc{2\Te\rh^2}{\ep^2}\) steps.
Examples
\[\begin{align}
w(y) &= \rc2 \ve{y}_2^2\\
V_x(y) &= \rc2\ve{x-y}_2^2\\
w(y) &= \sum_i y_i \ln y_i &\text{w.r.t. }\ell_1\text{ over }\De\\
V_x(y) &= \sum_{y_i} \ln \pf{y_i}{x_i} \ge \prc{x-y}_1^2.
\end{align}\]
Intuition
Note for \(\ved_2\) MD is the same as GD except for a factor \(\al\).
Note 3 formulations of mirror descent.
- Above.
- Set \[\nb w(x_{k+1}) \leftarrow \nb w(x_k) - \al \nb f(x_{k}).\] (By this we mean take the value of \(x_{k+1}\) that makes this true.)
- Regularized follow the leader (RFTL): Take \[x_{k+1} = \amin_y w(y)+ \al \an{y, \sum_{i=0}^k \nb f(x_i)}.\]
(1)\(\iff\)(2): The min is when the gradient is 0.
(2)
\(\iff\)(3): The min is when the gradient is 0. Write this out for
\(k,k-1\)and subtract.
\[\begin{align}
\nb w(x_{k+1}) + \al \sum_{i=0}^k \nb f(x_i) &=0\\
\nb w(x_k) + \al \sum_{i=0}^{k-1} \nb f(x_i) &=0\\
\nb w(x_{k+1})-w(x_k) + \nb f(x_k)&=0
\end{align}\]
Coupling
Unconstrained version (Q=R)
- Update:
\[\begin{align}
\label{eq:weighted}
x_{k+1} &= \tau z_k + (1-\tau) y_k\\
y_{k+1} &= \text{Grad}(x_{k+1})\\
z_{k+1} &= \text{Mirr}_{z_k}(\al \nb f(x_{k+1})).
\end{align}\]
- Lemma: If \(\rc{1-\tau}{\tau}=\al L\), then \[
\al \an{\nb f(x_{k+1}), x_{k+1}-u} \le \al^2L (f(y_k)-f(y_{k+1}))+V_{z_k}(u) - V_{z_{k+1}}(u).
\]
- Guarantee: \[f(\ol x) - f(x^*) \le \fc{\sqrt{2\Te}\rh}{T}.\] Starting at \(f(x_0)-f(x^*)\le d\), \(V_{x_0}(x^*)\le \Te)\), in \(T=4\sfc{L\Te}{d}\) steps, obtain \(f(\ol x)-f(x^*)\le \fc d2\). Hence, get an \(\ep\)-approximate solution in \(O(\sfc{L\Te}{\ep})\) iterations.
Proof
- Why the weird definition for \(x_{k+1}\)? If we defined \(z_{k+1}=\text{Mirr}_{x_{k+1}}(\al \nb f(x_{k+1}))\), the expressions involve \(x_{k+1},z_{k+1}\) and do not telescope.
- Thus, do the mirror descent starting from \(z_k\) instead. If we try to bound the regret now,
\[\begin{align}
\al \an{\nb f(x_{k+1}), x_{k+1}-u} &\le \fc{\al^2}{2} \ve{\nb f(x_k)}^2 + V_{x_{k+1}}(u) - V_{z_{k+1}}(u)\\
&\le \al^2 L (\nb f(x_{k+1}) - \nb f(y_{k+1})) + V_{x_{k+1}}(u) - V_{z_{k+1}}(u).
\end{align}\]
which still does not telescope.
- We want \(z_k\) to take the place of \(x_{k+1}\).
- Now write the real as the fake regret plus another term.
\[\begin{align}
\al \an{\nb f(x_{k+1}), z_{k}-u} & \le \fc{\al^2}{2} \ve{\nb f(x_k)}^2 + V_{z_k}(u) - V_{z_{k+1}}(u)\\
&\le \al^2 L (\nb f(x_{k+1}) - \nb f(y_{k+1})) + V_{z_k}(u) - V_{z_{k+1}}(u) \label{eq:lem3-2}\\
\al \an{\nb f(x_{k+1}),x_{k+1}-u}
&\le \al \an{\nb f(x_{k+1}), z_{k}-u} + \al \an{\nb f(x_{k+1}), x_{k+1}-z_k}\\
&\le \al \an{\nb f(x_{k+1}), z_{k}-u} + \fc{(1-\tau)\al}{\tau} \an{\nb f(x_{k+1}), y_{k}-x_{k+1}}&\text{by \eqref{eq:weighted}}\\
&\le \al^2 L (\nb f(x_{k+1}) - \nb f(y_{k+1})) + V_{z_k}(u) - V_{z_{k+1}}(u) + \fc{(1-\tau)\al}{\tau}\an{\nb f(x_{k+1}), y_k-x_{k+1}}&\eqref{eq:lem3-2},\text{ convexity}\\
&\le \al^2L(f(y_{k})-f(y_{k+1})) + V_{z_k}(u) - V_{z_{k+1}(u)}.
\end{align}\]
Telescoping, balancing \(\al^2Ld=\Te\implies \al = \sfc{\Te}{Ld}\), we get error \(\fc{2\sqrt{L\Te d}}{T}\). It takes \(T=4\sfc{L\Te}{d}\) steps to halve error from \(\fc d2\), and \(O\pa{\sfc{L\Te}{\ep}}\) time to go to \(\ep\) error.
Index cards:
- Gradient lemma
- Gradient guarantee
- DGF and Bregman divergence
- Mirror step
- Mirror lemma
- Mirror guarantee
- Linear coupling
- Linear coupling lemma
- Linear coupling guarantee
- Describe the linear coupling framework. What are the update rules?
- Give the linear coupling lemma.
Page notes
- estimation sequence? Thought experiment. Cutoff \(K\) for \(\ve{\nb f(x)}_2\). Equate gradient and mirror: \[\fc{\ep L}{K^2}=\fc{K^2}{\ep^2}.\]
- Gradient descent guarantee. \(f(x_T)-f(x^*) \le O\pf{L\ve{x_0-x^*}_2^2}{T}\). Distance generating function, Bregman divergence.