Fisher information

Posted: 2016-04-04 , Modified: 2016-04-04

Tags: none

Definitions

Define the score and Fisher information by \[\begin{align} s(X;\te)&=\pd{\ln f}{\te}\\ I(\te)&=\Var_\te(s(X;\te)) \end{align}\]

(This generalizes immediately to the multivariate case; for simplicity we consider the univariate case.)

The expecation of score is 0: \[\E s=\int_{-\iy}^{\iy} s(X;\te) f\,dx=\int_{-\iy}^{\iy} \fc{\ln f}{f}f\dx=(\int_{-\iy}^{\iy}f\,dx)_{\te}=0.\] Thus \[I(\te) = \Var(s(X;\te)) = \E [(\ln f)_\te^2] = -\E((\ln f)_{\te\te}).\]

Intuition

Suppose a data point \(x\) is observed. What is the posterior distribution on the parameter \(\te\)? Consider the log of this probability, the log-likelihood. The Fisher information measures how curved the log-likelihood is at \(x\).

Consider the Fisher information at the MLE. If \(I(\te)\) is large, then we are reasonably certain of the value of \(\te\) (changing \(\te\) by a bit decreases the log-probability of observing \(x\) a lot).1 If \(I(\te)\) is small, then we are not certain.

Theorems

Cramer-Rao

Wikipedia

This intuition is formalized by Cramer-Rao: the variance of any unbiased estimator for \(\te\) is lower-bounded by the inverse of the Fisher information.

Theorem (Cramer-Rao): Suppose \(T(X)\) is an unbiased estimator of \(\te\). Then \(\Var(T) \ge \rc{I(\te)}\). More generally, \(\Var(\psi(T)) \ge \fc{\psi'(\te)}{I(\te)}\).

In higher dimensions, \[ Var(T) \succeq \pd{\psi}{\te} I^{-1}\pd{\psi}{\te}. \]

Proof: Suppose \(T=t(x)\). By Cauchy-Schwarz, \[ \Var(T) \ge \fc{\text{Covar}(T,s_\te)^2}{\Var(s_\te)} = \fc{\int t(x)f (\ln f_\te)}{I(\te)} = \fc{\int g(x) f_{\te}(x)}{I(\te)} =\fc{(\E g)_\te}{I(\te)}= \rc{I(\te)}. \]

Asymptotic normality

Define the standard error by \(\se=\sqrt{\Var_\te(\wh{\te_n})}\).

Theorem (Asymptotic normality of MLE): \(\se\sim \sfc1{nI(\te)}\) and \(\fc{\wh{\te_n}-\te}{\se}\to N(0,1)\).

(With a little more work, we can replace se by \(\wh{se}\) (estimated standard error).)

Proof: Denoting the log-likelihood by \(\ell(\te):= \ln \Pj(x^n|\te) = \sum_{i=1}^n \ln f(x_i;\te)\), linearize to find that \[ \ell'(\wh \te)-\ell'(\te)\approx (\wh \te-\te)(\ell''(\te))\implies -\fc{\ell'}{\ell''}(\te)\approx \wh{\te}-\te. \] Now \[ \sqrt n(\wh{\te_n}-\te)=\fc{\rc{\sqrt n}\ell'(\te)}{-\rc n\ell''(\te)}\to \fc{N(0,I(\te))}{I(\te)}\to N(0,1), \] the top in distribution, the bottom in probability. (The top uses CLT on \(\sum (\ln f)_\te\); the bottom uses LoLN on \(\sum -(\ln f)_{\te\te}\).)


  1. For sake of discussion suppose the log-likelihood function is convex in \(\te\), so there aren’t other local minima.