MDP's with continuous state space

Posted: 2016-10-14 , Modified: 2016-10-14

Tags: none

Other pages

Finding optimal policy (given dynamics)

Simplest problem

State is \(s\in \R^n\).

Suppose

Then we can solve this in closed form (geometric series). The best action is the same at each time step, \[\max_{a\in A} \an{(I-\ga U)^{T\,-1}r, a}.\]

For infinite-horizon, we look at instead the average of rewards over next \(T\) time steps as \(T\to \iy\); interesting case is when \(U\) has eigenvalues equal to 1. Straightforward.

Different linear transformations

Finite actions

Now consider a more general case. (We don’t put in probability yet.)

Given fixed discount factor \(\ga\), desired approximation \(\ep\), can we find something that does at most \(\ep\) worse in \(\poly\prc{\ep}\) time?

Questions

  1. Can we find a nice class of functions (SVM, etc.), and find the best \(v\) within that class?
  2. Is there a class that can approximate all possible \(v\)’s within \(\ep\) or constant factor?
  3. Can \(v\)’s be complicated? (Ex. break the space into exponentially many regions.)

Careful: finding the best approximation to a unitary transformation (take \(U_a\) unitary and \(v_a=0\)) with a certain set of unitaries is a well-studied problem that can involve number theory—we want to exclude this. Ex. make sure we’re not in this regime - have the discount factor, or add noise.

Continuous actions

Can also make \(A\) continuous.

11/21/16

Assume that the dynamics and reward are given by \[\begin{align} s_{t+1} &= s_t + U\phi(s_t,a_t)\\ r_{t+1} &= \an{r, \phi(s_t)}. \end{align}\]

Ex. \(\phi\) low-degree monomials. Estimate \(U\), \(r\) with samples.

Then what?

Latent variable models: Instead of predicting the next state directly from action and current state, what if you add an autoencoder-RNN, and the action feeds into the RNN which then generates the next state? The memory of the RNN is the latent, lower-dimensional space.

(Q: can you have an argmax (cf. max pooling, but LOCATION) in a NN? How would you differentiate through it? Assume that the argmax is not affected by the parameters? This seems reasonable.)

Prior work on continuous HMMs