Second-order methods

Posted: 2016-04-22 , Modified: 2016-04-22

Tags: Newton, second-order

See gradient descent.

Main points

Proofs

Convergence

Proof (for Lipschitz). We do the calculations for 1 dimension. The calculations are the same, except we have to use matrices and vectors. Let \(\te(y)\) be a quantity in \([-y,y]\). Suppose we are at \((0,0)\). Let \(d=f'(0), a=f''(0), \la = \fc{d}{a^{\rc 2}}\), \(\De x_{nt} = \fc{d}{a} = \fc{\la}{a^{\rc 2}}\). \[\begin{align} f(x) &= dx + \rc 2 ax^2 + \te\pa{\rc 6 Lx^3}\\ f\pa{-\fc da} &= -\rc 2 \fc{d^2}a\\ &\le -\fc{d^2}{a} \ub{\pa{\rc 2 - \rc 6 L \fc{d}{a^2}}}{\ge \al}. \end{align}\]

In order for the quantity to be \(\ge \al\), noting \(\fc{d}{a^2} = \fc{d}{a^{\fc 32}}\), we want \(\la \le \fc{3(1-2\al) a^{\fc 32}}L\); it’s sufficient for \(f'\le \fc{3(1-2\al)m^2}{L}\).

Note that unlike for linear convergence, we don’t prove something like \(f(x') - f(x^*)\le \cdots\). We have to work with the gradients to get quadratic convergence. (Gradients also control the distance to the optimum.) We have \[\begin{align} f'(x) &= d+ ax + \te\pa{\rc L x^2}\\ f'\pa{-\fc da} & \le \rc 2 L\fc{d^2}{a^2} \le \fc{L}{2m^2}f'(0). \end{align}\] Proof (for self-concordant). Work in 1-D. Instead of integrating \(\int f'''=\int (f'')^{\fc 32}\), we let \(F(y)=y^{-\rc 2}\) and integrate \(\int (F(f''))'\). \[\begin{align} |(f'')^{-\rc 2} (t)| &= \ab{\int_0^t ((f'')^{-\rc 2})'} =\int_0^t \ab{-\rc 2 (f'')^{-\fc 32} f'''}\le t\\ \label{eq:f''} \implies \rc{(-t + f''(0)^{-\rc 2})^2} &\ge f''(t) \ge \rc{(t+f''(0)^{-\rc 2})^2}\\ \implies \rc{-t+f''(0)^{-\rc 2}} - f''(0)^{\rc 2} & \ge f'(t)-f'(0) \ge \rc{t+f''(0)^{-\rc 2}} + f''(0)^{\rc 2}\\ \implies f(t) &\le f(0) + (f'(0) - f''(0)^{\rc 2}) t + \ln \pf{f''(0)^{-\rc 2}}{t + f''(0)^{-\rc 2}}\\ \implies f\pa{-\fc{f'}{f''}t} & \le f(0) - \la^2 t + \la t - \ln (1-t\la(x)). \end{align}\]

Now consider 2 cases.

More intuition

Implementation issues

  1. Precompute line searches: it can be more efficient to simultaneously compute \(f(x+t\De x)\) for many values of \(t\), e.g., if it involves a linear/matrix function.
  2. Computing \(\De x_{nt}=H^{-1}\nb f(x)\) is more efficient if \(H\) has band structure, sparsity, etc.

Review

  1. Describe the Newton method. What is the step size and Newton decrement?
  2. What is the convergence rate for functions with Lipschitz Hessian?
  3. Define self-concordant. What is the convergence for self-concordant functions?

Scraps

? 9.31,

estimating Hessian?