Reinforcement learning convergence
Posted: 2016-10-24 , Modified: 2016-11-21
Tags: reinforcement learning
Posted: 2016-10-24 , Modified: 2016-11-21
Tags: reinforcement learning
Here we’re interested in convergence guarantees for algorithms/methods used in practice. (Rather than, e.g., coming up with provable polytime theoretical algorithms that work but are too slow to be used in practice.)
Recall: Mixing time \(\tau_\ep = \min\set{t}{\max_{P_0} \ve{P_t - P_{\text{eq}}}\le \ep} \le \fc{\ln \pf{N}{\ep}}{\de}\), \(\de=1-\max_{k\ge 2}|\la_k|\).
So convergence happens at rate of \(\ga \ve{P_\pi}_2\) to \(v_\pi\).
For nondiscounted case, we get \[ v_t = \rc t[d + P_\pi d+ \cdots + P_\pi^{t-1} d]. \] We find (are stochastic matrices diagonalizable?) \(v\) is the projection of \(d\) onto space of 1-eigenvectors.
(Notation is easier if the reward only depends on \(s,a\); then we just get \(v_\pi = r_\pi + \ga P_\pi v_\pi\).)
Why does it improve? Let
(Note \(I-\ga P_{\pi'}^T\) is a geometric series so has positive entries.)
Value iteration: \(\ve{v^{n+1}-v^n}\le \fc{\ep(1-\la)}{2\la}\implies \ve{v^{n+1}-v_\la^*}<\eph\). (161)
Proof: Let \(v^{n+1}\) be the vale if you follow \(\pi^n\) after choosing the best \(a\) and \(v^{*n+1}\) be the value if you follow \(\pi^{n+1}\). Then \[\begin{align} \ve{v^{*n+1} - v^{n+1}} &\le \ve{Lv^{*n+1} - Lv^{n+1}} + \ve{Lv^{n+1} - v^{n+1}}\\ \implies \ve{v^{*n+1} - v^{n+1}} &\le \fc{\la }{1-\la} \ve{v^{n+1}-v^n} \end{align}\] Geometric series gives \[\ve{v^{n+1} - v_\la^*} \le \fc{\la}{1-\la} \ve{v^{n+1} - v^n}.\]Policy iteration (180)
Do a triangle inequality between \[ v_{*'}^{\ga'} = (I-\ga'P_*')^{-1}r_*, v_{*'}^{\ga}, v_\pi^\ga, v_\pi^{\ga'}. \] More involved with averaging \(\lim_{T\to \iy}\rc T\cdots\). Choose \(\ga\) small enough so that can be approximated with rectangles, etc.
\[\pi'(s) = \amax_a \sum_{s'} p(s'|s,a) [r(s,a,s') + \ga v_\pi(s')].\]
Main idea: to solve a fixed-point equation, need an unbiased approximation of operator \(F\), and \(F\) to be a contraction. (Noise is not a big deal—finiteness is usually enough.)
Abstraction: Solve fixed point equation \[F(x)=x.\] Let \(T^i\) be the set of times when \(x_i\) is updated, and the update equation be \[x_i(t+1) = x_i(t) + \al_i(t) (F_i(x^i(t)) + w_i(t) - x_i(t)).\] The components of \(x^i(t)\) are \(x_j(\tau_j^i(t))\), possibly outdated. (If they are not outdated, \(\tau=t\).)
Assumptions: Let \(\mathcal F(t)\) be the subfield at time \(t\). Define \(\ve{x}_v=\max_i \fc{|x_i|}{v_i}\).
Theorem: 1, 2, 3, 5 imply that \(x(t)\to x^*\) w.p. 1.
Lemma: Let \(\{\mathcal F(t)\}\) be an increasing sequence of \(\si\)-fields, \(\al(t),w(t-1), B(t)\) be \(\mathcal F(t)\)-measurable. If
and \(W(t+1)=(1-\al(t))W(t) + \al(t) w(t)\) then \(\lim_{t\to \iy}W(t)=0\) wp 1.
Proof. Write out the expression for \(W(t)\). The term \(\prodz t{T} (1-\al(t))W^0(t)\to 0\). The other terms: square and use Chebyshev.
Theorem. 1, 2, 3, 6 imply that \(x(t)\) is bounded.
Proof. We have \(\max |F_i(x)|\le \ga \max(\max_j (|x_j|,G_0))\).
Look at the relative fluctuations (wrt the nearest power of \(G_0\)). As \(M(t)=\max_{\tau\le t}\ve{x(\tau)}_\iy\) gets big it gets more and more difficult for the relative fluctuations (actually, relative to the next smaller \(G\ga^{-k}\), for technical reasons) to be big. To argue this, apply lemma to normalized \(\wt w_i(t) = \fc{w_i(t)}{G(t)}\) to show the fluctuations after \(t_0\) are go to 0 as \(t_0\to \iy\). (Define \(\wt W(t_0;t_0)=0\) and define \(\wt W(t;t_0)\) with the recurrence but with \(\wt w_i(t)\).) Now \(\wt W_i(t;t_0)\) small means that \(x_i(t+1)\) can’t “overcome” the \(\ga\) contraction factor.
Adding in assumption 5: Now show that \(\ve{x(t)}_{\iy}\le D_k\) implies that \(\ve{x(t)}_{\iy}\le D_{k+1} := \be (1+2\ep)D_k\) at some later \(r\). Proof is similar, except we establish contraction instead of just non-expansion. When bounding \(x_i(t+1)\) use assumption 5.
Question: what is the convergence rate? (Exercise!) Based on the analysis (proceeding by a geometric sequence), it’s probably linear.
\(c_{iu}\) is cost of \(u\) at state \(i\).
Here
\[\begin{align} T_i(V) &= \min_{u\in U(i)} \ba{\E[c_{iu}] + \be\sum_{j\in S} p_{ij}(u)V_j}\\ F_{iu} (Q) & = \E c_{iu} + \be \sum_{j\in S} p_{ij}(u) \min_{v\in U(j)} Q_{jv}\\ W_{iu}(t) &= c_{iu} - \E c_{iu} + \min_{v\in U(s(i,u))} Q_{s(i,u),v}(t) - \E[\min_{v\in U(s,(i,u))} Q_{s(i,u),v}(t) | \mathcal F(t)] \end{align}\](Note that implicitly \(\min_{v...}Q... = \sum_{j\in S} p_{ij}(u) \min_{v\in U(j)}Q_{jv}\).)
(Skip: discounted policies.)
(cf. stochastic gradient descent?)