(HMR16) Gradient descent learns linear dynamical systems
Posted: 2016-12-27 , Modified: 2016-12-27
Tags: dynamical systems, quasi-convexity
Posted: 2016-12-27 , Modified: 2016-12-27
Tags: dynamical systems, quasi-convexity
Main theorem:
Note:
A function \(f\) is \(\tau\)-weakly-quasi-convex (\(\tau\)-WQC) over \(B\) with respect to a global minimum \(\te^*\) if there is a positive \(\tau>0\) such that for all \(\te\in B\), \[ \nb f(\te)^T(\te-\te^*) \ge \tau (f(\te) - f(\te^*)). \] \(f\) is \(\Ga\)-weakly smooth if \[ \ve{\nb f(\te)}^2 \le \Ga(f(\te)-f(\te^*)). \]
Note a convex function satisfies \[ f(\te^*) - f(\te) \ge \nb f(\te)^T (\te^*-\te) \] so is 1-WQC. If \(f''\ge a\) everywhere, then \(f(\te)-f(\te^*) \ge \ve{\nb f(\te)}^2\) so \(f\) is \(4a\)-weakly smooth.
For stochastic gradient descent, weak QC and S are enough for convergence guarantees.
Theorem. Suppose \(f\) is \(\tau\)-WQC and \(\Ga\)-WS and \(r\) is an unbiased estimator for \(\nb f(\te)\) with variance \(V\), \(\te^*\in B\), \(\ve{\te_0-\te^*}\le R\). Then projected SGD with a suitable learning rate returns \(\te_K\) with error \[ \E f(\te_K)-f(\te^*) \le O\pa{ \max\bc{\fc{\Ga R^2}{\tau^2 K}, \fc{R\sqrt V}{\tau \sqrt K}} }. \] (Dependence on \(K\) is most important here. Check this.) (Variance makes the convergence \(\rc{\sqrt K}\) rather than \(\rc K\).)
(This is a SISO - single-input single-output system.) The population risk is \[ f(\wh \Te) = \EE_{\{x_t\}, \{\xi_t\}} \ba{ \rc T \sumo tT \ve{\wh y_t - y_t}^2 } \] where \(\wh y_t\) is generated from the estimated parameters without noise. Evert SISO of order \(n\) can be put into controllable canonical form \[ A = \matt{\vec 0}{I_{n-1}}{-a_n}{-a_{n-1:1}}, \quad B = \colthree{0}{\vdots}{1} = e_n. \] Write \(A=CC(a)\), where \(a=-[a_n,\ldots, a_1]\).
A SISO is controllable iff \([B|AB|\cdots|A^{n-1}B]\) has rank \(n\). (See Control theory/2 Controllability.)
Let \[ p_a(z) = z^n + a_1z^{n-1}+\cdots + a_n = \det (zI-A) \] be the characteristic polynomial of \(A\).
The transfer function of the linear system is \[\begin{align} G(z) &= C(zI - A)^{-1}B=:\fc{s(z)}{p(z)}\\ & = \sumo t{\iy} z^{-t} \ub{CA^{t-1}B}{r_{t-1}}\\ & = \fc{c_1+\cdots + c_n z^{n-1}}{z^n + a_1z^{n-1}+\cdots + a_n} \end{align}\]where \(p\) is monic. (We have \(M^{-1} = \fc{\text{cofactor}(M)}{\det(M)}\).)
(Why is the transfer function useful?)
Define the idealized risk as \[ g(\wh A, \wh C) = \sumz k{\iy} (\wh C\wh A^k B - CA^k B)^2. \] In the Fourier domain, \[ g(\wh A, \wh C) = \int_0^{2\pi} |\wh G (e^{i\te}) - G(e^{i\te})|^2\,d\te \] (The hat on the parameters means “estimate”, the hat on \(G\) means “Fourier transform”.)
Explanation. By unfolding the recurrences, then taking the average, \[\begin{align} \E [\ve{\wh y_t - y_t}^2] & = \ve{\wh D - D}^2 + \sumo k{t-1} \ve{\wh C \wh A^{t-k-1} \wh B - CA^{t-k-1} B}^2 + \E[\ve{\xi_t}^2]\\ f(\wh \Te) & = \ve{\wh D - D}^2 + \sumo k{t-1} \pa{1-\fc kT}\ve{\wh C \wh A^{k-1} B - CA^{k-1} B}^2 + \si^2. \end{align}\] The idealized risk takes \(T\to \iy\), and ignores the first term. Now by Parseval’s Theorem, \[\begin{align} G(z) &= \sumo t{\iy} CA^{t-1} Bz^{-t}\\ \wh G(z) &= \sumo t{\iy} \wh C \wh A^{t-1} B z^{-t}\\ \int_0^{2\pi} |\wh G(e^{i\te}) - G(e^{i\te})|^2 & = \sumo t{\iy} \ve{CA^{t-1}B - \wh C \wh A^{t-1} B}^2. \end{align}\](This can give some motivation for definition of \(G\). Simply put \(CA^kB\) as coefficients of power series; we get \(G\). This also motivates looking at the Fourier series.)
(Why is it useful to pass to the Fourier domain?)
Rouche’s Theorem. wikipedia
Let \(T_1=\fc T4\). Using loss function \[ \ell((x,y),\wh\Te) = \rc{T-T_1} \sum_{t>T_1} \ve{\wt y_t - y_t}^2, \] gradients wrt \(\wh a, \wh C, \wh D\) are almost (exponentially) unbiased.
Omitted.
Ex. when we can’t hope to properly recover: \(G(z) = G_1(z) \fc{z^n-r_1^n}{z^n-r_2^n}\) where \(r_1\approx r_2\approx 1\). They are exponentially close but the characteristic polys are very different.
Approximation: Prove that the inverse of a poly can be easily approximated (cf. proof for conjugate gradients, Chebyshev approx).
Using the approximation result, get \(\ab{\fc{pu}{z^{n+d}} - 1} <0.5\), so \(\fc{pu}{z^{n+d}}\in C\) for appropriate \(\tau_i\).