(AO15) Linear coupling

Posted: 2016-03-06 , Modified: 2016-03-06

Tags: gradient

Gradient Descent

Definition: A convex function is \(L\)-smooth if \[ f(y) \le \blu{f(x)+\an{\nb f(x),y-x}}+\redd{\fc L2 \ve{y-x}^2}\] or equivalently \[ \ve{\nb f(x)-\nb f(y)}_*\le L\ve{x-y}.\]

Assume \(f\) is \(L\)-smooth.

Proof

See also gradient descent.

Let \(D_k = f(x_k) - f(x^*)\). \[\begin{align} \ub{f(x_k)-f(x^*)}{D_k}&\le \an{\nb f(x_k),x_k-x^*} \le \ve{\nb f(x_k)} \ve{x_k-x^*}\le R \ve{\nb f(x_k)}_*\\ D_k - D_{k+1} &\ge \rc{2L}\ve{\nb f(x_k)}_*^2\\ D_k^2 & \le 2LR^2(D_k-D_{k+1})\\ \rc{D_{k+1}}-\rc{D_k} &\ge \rc{2LR^2}\fc{D_k}{D_{k+1}} \ge \rc{2LR^2}\\ \implies \rc{D_T}&\ge T. \end{align}\]

Mirror descent

Definition: \(w(x):Q\to \R\) (\(Q\) convex) is a distance generating function if the following holds. Then define Bregman divergence as follows. For all \(x\in Q\bs \pl Q, y\in Q\), \[\begin{align} w(y) &\ge w(x) + \an{\nb w(x),y-x} + \rc2\ve{x-y}^2\\ V_x(y) & = w(y) - \an{\nb w(x),y-x} - w(x). \end{align}\]

Note we can replace the gradient by a subgradient below.

Examples

\[\begin{align} w(y) &= \rc2 \ve{y}_2^2\\ V_x(y) &= \rc2\ve{x-y}_2^2\\ w(y) &= \sum_i y_i \ln y_i &\text{w.r.t. }\ell_1\text{ over }\De\\ V_x(y) &= \sum_{y_i} \ln \pf{y_i}{x_i} \ge \prc{x-y}_1^2. \end{align}\]

Intuition

Note for \(\ved_2\) MD is the same as GD except for a factor \(\al\).

Note 3 formulations of mirror descent.

  1. Above.
  2. Set \[\nb w(x_{k+1}) \leftarrow \nb w(x_k) - \al \nb f(x_{k}).\] (By this we mean take the value of \(x_{k+1}\) that makes this true.)
  3. Regularized follow the leader (RFTL): Take \[x_{k+1} = \amin_y w(y)+ \al \an{y, \sum_{i=0}^k \nb f(x_i)}.\]

(1)\(\iff\)(2): The min is when the gradient is 0.

(2)\(\iff\)(3): The min is when the gradient is 0. Write this out for \(k,k-1\)and subtract. \[\begin{align} \nb w(x_{k+1}) + \al \sum_{i=0}^k \nb f(x_i) &=0\\ \nb w(x_k) + \al \sum_{i=0}^{k-1} \nb f(x_i) &=0\\ \nb w(x_{k+1})-w(x_k) + \nb f(x_k)&=0 \end{align}\]

Coupling

Unconstrained version (Q=R)

Proof

  1. Why the weird definition for \(x_{k+1}\)? If we defined \(z_{k+1}=\text{Mirr}_{x_{k+1}}(\al \nb f(x_{k+1}))\), the expressions involve \(x_{k+1},z_{k+1}\) and do not telescope.
  2. Thus, do the mirror descent starting from \(z_k\) instead. If we try to bound the regret now, \[\begin{align} \al \an{\nb f(x_{k+1}), x_{k+1}-u} &\le \fc{\al^2}{2} \ve{\nb f(x_k)}^2 + V_{x_{k+1}}(u) - V_{z_{k+1}}(u)\\ &\le \al^2 L (\nb f(x_{k+1}) - \nb f(y_{k+1})) + V_{x_{k+1}}(u) - V_{z_{k+1}}(u). \end{align}\] which still does not telescope.
  3. We want \(z_k\) to take the place of \(x_{k+1}\).
  4. Now write the real as the fake regret plus another term.
\[\begin{align} \al \an{\nb f(x_{k+1}), z_{k}-u} & \le \fc{\al^2}{2} \ve{\nb f(x_k)}^2 + V_{z_k}(u) - V_{z_{k+1}}(u)\\ &\le \al^2 L (\nb f(x_{k+1}) - \nb f(y_{k+1})) + V_{z_k}(u) - V_{z_{k+1}}(u) \label{eq:lem3-2}\\ \al \an{\nb f(x_{k+1}),x_{k+1}-u} &\le \al \an{\nb f(x_{k+1}), z_{k}-u} + \al \an{\nb f(x_{k+1}), x_{k+1}-z_k}\\ &\le \al \an{\nb f(x_{k+1}), z_{k}-u} + \fc{(1-\tau)\al}{\tau} \an{\nb f(x_{k+1}), y_{k}-x_{k+1}}&\text{by \eqref{eq:weighted}}\\ &\le \al^2 L (\nb f(x_{k+1}) - \nb f(y_{k+1})) + V_{z_k}(u) - V_{z_{k+1}}(u) + \fc{(1-\tau)\al}{\tau}\an{\nb f(x_{k+1}), y_k-x_{k+1}}&\eqref{eq:lem3-2},\text{ convexity}\\ &\le \al^2L(f(y_{k})-f(y_{k+1})) + V_{z_k}(u) - V_{z_{k+1}(u)}. \end{align}\]

Telescoping, balancing \(\al^2Ld=\Te\implies \al = \sfc{\Te}{Ld}\), we get error \(\fc{2\sqrt{L\Te d}}{T}\). It takes \(T=4\sfc{L\Te}{d}\) steps to halve error from \(\fc d2\), and \(O\pa{\sfc{L\Te}{\ep}}\) time to go to \(\ep\) error.

Index cards:

  1. Describe the linear coupling framework. What are the update rules?
  2. Give the linear coupling lemma.

Page notes

  1. estimation sequence? Thought experiment. Cutoff \(K\) for \(\ve{\nb f(x)}_2\). Equate gradient and mirror: \[\fc{\ep L}{K^2}=\fc{K^2}{\ep^2}.\]
  2. Gradient descent guarantee. \(f(x_T)-f(x^*) \le O\pf{L\ve{x_0-x^*}_2^2}{T}\). Distance generating function, Bregman divergence.