High-dimensional probability
Posted: 2016-08-01 , Modified: 2016-08-04
Tags: random matrix
Posted: 2016-08-01 , Modified: 2016-08-04
Tags: random matrix
Based on Ramon von Handel’s ORF570 notes.
Themes:
Trivial bound: \[ \Var[f(X)]\le \rc4 (\sup f - \inf f)^2 \qquad \Var[f(x)] \le \E[(f(X)-\inf f)^2].\]
Tensorization gives a bound for functions of independent random variables from bounds for functions of each individual random variable.
Theorem (Tensorization of variance): \[\Var[f(X_1,\ldots, X_n)]\le \E\ba{\sumo in \Var_i f(X_1,\ldots, X_n)}\] whenever \(X_{1:n}\) are independent.
This is sharp for linear functions.
Proof. Write \[ f(X_{1:n}) - \E f(X_{1:n}) = \sumo kn \ub{\E[f(X_{1:n}|X_{1:k})] - \E[f(X_{1:n})|X_{1:k-1}]}{\De_k}. \] The \(\De_k\) form a martingale. By independence of martingale increments, \[\begin{align} \Var(f) &= \sumo kn \E[\De_k^2] \\ \E[\De_k^2] &= \E[\E[\wt \De_k |X_{1:k}]^2]\\ & \le \E[(\ub{f - \E[f|X_{1:k-1,k+1:n}]}{\wt \De_k})^2] & \text{Jensen}\\ &= \E\ba{\sumo in \Var_f f(X_1,\ldots, X_n)}. \end{align}\] Define \[\begin{align} D_i f(x) &= (\sup_z-\inf_z)(f(x_{1:i-1},z,x_{i+1:n}))\\ D_i^- f(x) &= f(x) - \inf_z(f(x_{1:i-1},z,x_{i+1:n}))\\ \end{align}\]Corollary (Bounded difference inequality): Tensorization + trivial inequality.
Example: Consider Bernoulli symmetric matrices. What is the variance of \(\la_{\max}(M) = \an{v_{\max}(M), Mv_{\max}(M)}\)? Fix \(i,j\). Let \[M^- = \amin_{\text{only }M_{ij} \text{ varies}} \la_{\max}(M).\] Then \[D_{ij}^-\la_{\max}(M) \le \an{v_{\max}(M), (M-M^-) v_{\max}(M)}\le 4|v_{\max}(M)_i||v_{\max}(M)_j|.\] Use the corollary to get \(\le 16\).
This is not sharp. (\(\sim n^{-\rc 3}\) is correct.)
Drawbacks to this method:
Inequalities in this section are roughly of the following form (Poincare inequalities): \[\Var(f) \le \E[\ve{\nb f}^2].\] “The validity of a Poincar´e inequality for a given distribution is intimately connected the convergence rate of a Markov process that admits that distribution as a stationary measure.”
A Markov process satisfies: For every bounded measurable \(f\) and \(s,t\in \R_+\), here is a bounded measurable \(P_sf\) such that \[\E[f(X_{t+s})|\{X_r\}_{r\le t}] = (P_s f)(X_t).\] \(\mu\) is stationary if \(\mu(P_tf)=\mu(f)\) for all \(t\in \R_+\), bounded measurable \(f\).
Lemma 2.7. \(\{P_t\}_{t\in \R_+}\) defines a semigroup of linear operators on \(L^p(\mu)\). It is contractive and conservative (\(P_t1=1\) \(\mu\)-a.s.).
Proof. Jensen.
The semigroup in fact acts on any \(f\in L^1(\mu)\).
Lemma 2.9. If \(\mu\) is stationary, for every \(f\), \(\Var_\mu(P_tf)\) is decreasing.
Proof. \(L^2\) contractivity and semigroup property.
The generator is \[\cal L f = \lim_{t\searrow 0} \fc{P_tf-f}t.\] The set of \(f\) where this is defined is the domain; \(\cal L:\text{Dom}(\cal L) \to L^2(\mu)\).
Warning: for Markov processes with continuous sample paths, such as Brownian motion, \(Dom(\cL)\sub L^2(\mu)\). Functional analysis is required for rigor, but results usually extend.
\(P_t\) is the solution of the Kolmogorov equation \[ \ddd{t} P_t f = P_t \cL f, \quad P_0f=f.\] The generator and semigroup commute.
A finite-state Markov process with \[ \Pj[X_{t+\de}=j|X_t=i] = \la_{ij} \de + o(\de), \quad i\ne j\] has generator equal to the transition matrix \(\La\).1
Then \(P_t=e^{t\La}\). (In the non-finite case, this makes sense as a power series.)
\(P_t\) is reversible if \(P_t\) are self-adjoint on \(L^2(\mu)\): \[\an{f,P_tg}_\mu = \an{P_tf,g}_\mu.\] Reversibility implies \[P_tf(x) =\E[f(X_t)|X_0=x] = \E[f(X_0)|X_t=x];\] i.e., the Markov process viewed backwards has the same law.
For finite state space, this is equivalent to \[\mu_i \La_{ij} = \mu_j \La_{ji},\] detailed balance.2
Definition. A Markov semigroup is ergodic if \(P_tf \to \mu f\) in \(L^2(\mu)\) as \(t\to \iy\) for every \(f\in L^2(\mu)\).
A measure \(\mu\) satisfies a Poincare inequality for a certain notion of “gradient” if and only if an ergodic Markov semigroup associated to this “gradient” converges exponentially fast to \(\mu\).
The Dirichlet form is \[\cal E(f,g) = -\an{f,\cL g}_\mu.\] Note: for complex-valued functions, we take the real part.
Theorem (Poincare inequality). Let \(P_t\) be reversible ergodic Markov semigroup with stationary measure \(\mu\). For \(c\ge 0\), TFAE:
Note \(5\Leftarrow 3\implies 1\Leftrightarrow 2\Rightarrow 4\) doesn’t require reversibility.
Example: Gaussian distribution
Note: The O-U process is the solution of the stochastic differential equation \[dX_t = -X_t \,dt + \sqrt2 \, dB_t.\] Revisit this after I learn stochastic calculus.
Tensorization using Poincare inequality:
We prove the Poincare inequality.
3In finite dimensions, if \(\mu f=0\), \[ \cE(f,f) \ge (\ub{\la_1}0-\la_2) \Var_\mu(f). \] The best constant in the Poincare inequality is the spectral gap. The spectral gap controls the exponential convergence rate. Note it’s essential that \(\La\) admits a real spectral decomposition.
Smooth \(f\) and use the Gaussian Poincare inequality.
Note we have \(\Var[f(x)]\le \E[(f(x)-f(0))^2]\le \E L^2x^2 = L^2\) but this doesn’t help us, because if we sum up derivatives along different coordinates, we overestimate \(L\) to \(Ln\) instead.Only the \(|I|=\phi, 1\) terms are significant as \(\sum_{|I|=k} p_I = Poisson(n, t, k)\).
\[\begin{align}\cL f &= \lim_{t\to 0^+} \pf{e^{-tn} \E f - \E f + \sum (1-e^{-t}) e^{-t (n-1)} \int f\, \mu_i (dx_i|x)}{t}\\ &= \sum_{i=1}^n \ub{\pa{\pa{\int f(x)\mu_i(dx_i|x)} - f(x)}}{=: - \de_if}\\ \cE (f, g) &= -\int f \cL g\,d\mu\\ &= -\int \pa{-\sumo in f\de_i g}\,d\mu\\ &= \sumo in \int \de_if\de_ig\,d\mu \end{align}\] where we used \(\int (f-\de_if)\de_ig=0\) because the first term has mean 0 and \(\de_ig\) doesn’t depend on \(x_i\).Use \(5\implies 1\) of Poincare.
Then \(\Pj(X-\E X \ge t) \le e^{-\psi^*(t)}\) for all \(t\ge 0\).
Proof. Exponentiate and Markov.
The log-moment generating function is continuous and can be investigated using calculus.
Example. Gaussian: \(\psi(\la) = \fc{\la^2\si^2}2\) and \(\psi^*(t) = \fc{t^2}{2\si^2}\) so bound of \(e^{-\fc{t^2}{2\si^2}}\).
A rv is \(\si^2\)-subgaussian if \(\psi(\la)\le \fc{\la^2\si^2}2\). Then we get tail bounds of \(e^{-\fc{t^2}{2\si^2}}\).
Lemma 3.6 (Hoeffding): If \(X\in [a,b]\) a.s., then \(X\) is \((b-a)^2/4\) subgaussian.
Proof. Interpret \(\psi''\) as a variance, get \(\psi''\le \fc{(b-a)^2}{4}\), integrate twice.
We want to show \(f\) is subgaussian with variance proxy controlled by a “square gradient” of \(f\).
The subgaussian property does not tensorize.
The proof of subgaussian inequailties can be reduced to a strengthened form of Poincare inequalities, log-Sobolev inequalities, that do tensorize.
Lemma (Azuma): Let \(\cF_k\) be a filtration, and 1. (Martingale difference) \(\De_k\) is \(\cF_k\)-measurable, \(\E[\De_k |\cF_{k-1}]=0\). 2. (Conditional subgaussian) \(\E[e^{\la \De_k}|\cF_{k-1}]\le e^{\la^2\si_k^2/2}\). Then \(\sumo kn \De_k\) is subgaussian with variance proxy \(\sumo kn \si_k^2\).
Corollary (Azuma-Hoeffding): Replace (2) by \(A_k\le \De_k\le B_k\) where \(A_k,B_k\) are \(\cF_{k-1}\)-measurable. The variance proxy is \(\rc 4 \sumo kn \ve{B_k-A_k}^2_{\iy}\). The tail bound is \(\exp\pa{-\fc{2t^2}{\sumo kn \ve{B_k-A_k}^2_{\iy}}}\).
Theorem 3.11 (McDiarmid): For \(X_{1:n}\) independent, \(f(X)\) is subgaussian with variance proxy \(\rc 4\sumo kn \ve{D_kf}_{\iy}^2\) where \[D_if(x) = (\sup_z-\inf_z)f(x_{1:i-1},z,x_{i+1:n}).\]
Proof. Use Azuma-Hoeffding on martingale differences \(\De_k =\E[f|X_{1:k}] - \E[f|X_{1:k-1}]\).
This is unsatisfactory because the variance proxy is controlled by a uniform upper bound on square gradient rather than its expectation. Something like \(\ve{\sumo kn |D_kf|^2}_{\iy}\) would be better.
The subgaussian property is equivalent to \(\la^{-1}\psi(\la)\precsim \la\), so it suffices to show \(\ddd{\la}(\la^{-1}\psi)\precsim 1\).
The entropic formulation of subgaussianity and the tensorization inequality tell us that if we prove (for some notion of \(\nb\)) \[ \Ent(e^g) \precsim \E[\ve{\nb g}^2]\] in one dimension, then in any number of dimensions, \[ \Ent(e^{\la f})\precsim \E[\ve{\nb (\la f)}^2e^{\la f}]\] so \(f\) is subgaussian with \(\max\ve{\nb f}^2\).
Example (Random Bernoulli symmetric matrices). Using \(D_{ij}^-\la_{\max(M)}\), get \[ \Pj(\la_{\max}(M) - \E\la_{\max}(M)\ge t)\le e^{-\fc{t^2}{64}}. \] We can’t use the same technique to look at the lower tail because the bound is in terms of different \(M^{(ij)}\)’s.
We have an entropic analogue of just the easy parts of the Poincare inequality equivalence.
Theorem. 1 and 2 are equivalent. 3 implies 1, 2 if \(\Ent_\mu(P_tf)\to 0\) (entropic ergodicity).
Proof.
The log-Sobolev equivalences cannot reproduce the tensorization inequality for entropy.
Theorem (Gaussian log-Sobolev). For independent Gaussian variables, \[\begin{align} \Ent[f] &\le \rc 2 \E [\nb f \cdot \nb \ln f]& (f\ge 0)\\ \Ent[e^f] & \le \rc 2 \E[\ve{\nb f}^2 e^f]. \end{align}\]As a result \(f\) is \(\si^2 = \ve{\ve{\nb f}^2}_{\iy}\) subgaussian and we get Gaussian concentration, \[\Pj[f - \E f\ge t] \le e^{-t^2/2\si^2}.\]
Proof. Recall \(\cE(f,g) = \mu(f'g')\), \((P_tf)' = e^{-t}P_t f'\). Note \(|P_t(fg)|^2 \le P_t(f^2)P_t(g^2)\) by CS (expand out). \[\begin{align} (\ln P_t f)' (P_tf)' &= e^{-2t} \fc{|P_tf|^2}{P_tf}\\ |P_t f'|^2 &\le P_t((\ln f)'f') P_t f&\text{by CS}\\ \implies \cE(\ln (P_tf), P_tf) &\le e^{-2t}\cE(\ln f, f) &\text{by }\int\\ \implies \Ent_\mu(f) &\le \cE(\ln f, f)&(3\implies 1). \end{align}\] Note several different forms of log-Sobolev, equivalent in the Gaussian case (or anytime the chain rule holds for \(\cE\)): \[\begin{align} \Ent(f) &\le \rc 2 \E[\nb f \cdot \nb \ln f] = \rc2 \cE (\ln f, f)\\ \Ent(f) &\le \rc 2 \E\pf{\ve{\nb f}^2}{f}\\ \Ent(e^f) &\le \rc 2 \E[\ve{\nb f}^2 e^f]\\ \Ent(f^2) &\le 2\E[\ve{\nb f}^2] = 2\cE(f,f)\\ \E(f^2\ln f)-\E[f^2]\ln \ve{f}_2 &\le c\ve{\nb f}_2^2. \end{align}\]Classical Sobolev inequalities are for \(\ved_q\), \(q\ge 2\) and do not tensorize.
Lemma 3.28: Log-Sobolev \(\Ent(f) \le c\cE (\ln f, f)\) implies the Poincare inequality \(\Var(f) \le 2c\cE (f,f)\).
Proof. \[ \ub{\E[\la f e^{\la f}]}{\la^2\cE(f,f) + o(\la^2)} - \ub{\E[e^{\la f}] \ln \E[e^{\la f}]}{\la \E f + \la^2(\E[f^2] + \E[f]^2)/2 + o(\la^2)} = \ub{\E[\la f e^{\la f}]}{\la \E f + \la^2 \E [f^2] + o(\la^2)} \]
Equivalent conditions for subgaussianity:
Problems
Rather than measuring the sensitivity of the function f in terms of a gradient, we will introduce a metric viewpoint that emphasizes the role of Lipschitz functions.
We can express Gaussian concentration and McDiarmid’s inequalities in terms of Lipschitz functions.
In the case of gradient bounds, the sensitivity of the function f is measured locally, while the Lipschitz property quantifies the sensitivity in a global manner.
Basic question:
For which probability measures μ on the metric space \((X, d)\) is it true that if \(X\sim \mu\), then \(f(X)\) is \(\si^2\)-subgaussian for every \(f \in Lip(\mathbb X)\)?
Wasserstein distance: For \(\mu,\nu\in \cP_1(\mathbb X)=\set{\rh}{\int d(x_0,x)\rh(dx)<\iy}\), the Wasserstein distance is \[ W_1(\mu, \nu) := \sup_{f\in Lip(\mathbb X)} \ab{\int f\,d\mu - \int f\,d\nu}.\]
Relative entropy: \(D(\nu||\mu) = \Ent_{\mu}\pa{\dd{\nu}{\mu}}\).
Theorem 4.8 (Bobkov-Gotze): Let \(\mu\in \cP_1(\mathbb X)\). TFAE for \(X\sim \mu\):
Example (Pinsker’s inequality):
Proof.
Lemma 4.10 (Gibbs variational principle) \(\log \E_{\mu} e^f = \sup_\nu [\E_\nu [f] - D(\nu||\mu)]\).
Proof. \(\log \E_\mu[e^f] - D(\nu||\wt \mu) = \E_\n [f] - D(\nu||\mu)\).
(?? “Dual convex optimization to the variational formula for entropy.”)evaluate over \(f\) to get \(W_1\), then evaluate over \(\la\).