Fisher information
Posted: 2016-04-04 , Modified: 2016-04-04
Tags: none
Posted: 2016-04-04 , Modified: 2016-04-04
Tags: none
(This generalizes immediately to the multivariate case; for simplicity we consider the univariate case.)
The expecation of score is 0: \[\E s=\int_{-\iy}^{\iy} s(X;\te) f\,dx=\int_{-\iy}^{\iy} \fc{\ln f}{f}f\dx=(\int_{-\iy}^{\iy}f\,dx)_{\te}=0.\] Thus \[I(\te) = \Var(s(X;\te)) = \E [(\ln f)_\te^2] = -\E((\ln f)_{\te\te}).\]
Suppose a data point \(x\) is observed. What is the posterior distribution on the parameter \(\te\)? Consider the log of this probability, the log-likelihood. The Fisher information measures how curved the log-likelihood is at \(x\).
Consider the Fisher information at the MLE. If \(I(\te)\) is large, then we are reasonably certain of the value of \(\te\) (changing \(\te\) by a bit decreases the log-probability of observing \(x\) a lot).1 If \(I(\te)\) is small, then we are not certain.
This intuition is formalized by Cramer-Rao: the variance of any unbiased estimator for \(\te\) is lower-bounded by the inverse of the Fisher information.
Theorem (Cramer-Rao): Suppose \(T(X)\) is an unbiased estimator of \(\te\). Then \(\Var(T) \ge \rc{I(\te)}\). More generally, \(\Var(\psi(T)) \ge \fc{\psi'(\te)}{I(\te)}\).
In higher dimensions, \[ Var(T) \succeq \pd{\psi}{\te} I^{-1}\pd{\psi}{\te}. \]
Proof: Suppose \(T=t(x)\). By Cauchy-Schwarz, \[ \Var(T) \ge \fc{\text{Covar}(T,s_\te)^2}{\Var(s_\te)} = \fc{\int t(x)f (\ln f_\te)}{I(\te)} = \fc{\int g(x) f_{\te}(x)}{I(\te)} =\fc{(\E g)_\te}{I(\te)}= \rc{I(\te)}. \]
Define the standard error by \(\se=\sqrt{\Var_\te(\wh{\te_n})}\).
Theorem (Asymptotic normality of MLE): \(\se\sim \sfc1{nI(\te)}\) and \(\fc{\wh{\te_n}-\te}{\se}\to N(0,1)\).
(With a little more work, we can replace se by \(\wh{se}\) (estimated standard error).)
Proof: Denoting the log-likelihood by \(\ell(\te):= \ln \Pj(x^n|\te) = \sum_{i=1}^n \ln f(x_i;\te)\), linearize to find that \[ \ell'(\wh \te)-\ell'(\te)\approx (\wh \te-\te)(\ell''(\te))\implies -\fc{\ell'}{\ell''}(\te)\approx \wh{\te}-\te. \] Now \[ \sqrt n(\wh{\te_n}-\te)=\fc{\rc{\sqrt n}\ell'(\te)}{-\rc n\ell''(\te)}\to \fc{N(0,I(\te))}{I(\te)}\to N(0,1), \] the top in distribution, the bottom in probability. (The top uses CLT on \(\sum (\ln f)_\te\); the bottom uses LoLN on \(\sum -(\ln f)_{\te\te}\).)
For sake of discussion suppose the log-likelihood function is convex in \(\te\), so there aren’t other local minima.↩