MDP's with continuous state space
Posted: 2016-10-14 , Modified: 2016-10-14
Tags: none
Posted: 2016-10-14 , Modified: 2016-10-14
Tags: none
Other pages
State is \(s\in \R^n\).
Suppose
Then we can solve this in closed form (geometric series). The best action is the same at each time step, \[\max_{a\in A} \an{(I-\ga U)^{T\,-1}r, a}.\]
For infinite-horizon, we look at instead the average of rewards over next \(T\) time steps as \(T\to \iy\); interesting case is when \(U\) has eigenvalues equal to 1. Straightforward.
Now consider a more general case. (We don’t put in probability yet.)
Given fixed discount factor \(\ga\), desired approximation \(\ep\), can we find something that does at most \(\ep\) worse in \(\poly\prc{\ep}\) time?
Questions
Careful: finding the best approximation to a unitary transformation (take \(U_a\) unitary and \(v_a=0\)) with a certain set of unitaries is a well-studied problem that can involve number theory—we want to exclude this. Ex. make sure we’re not in this regime - have the discount factor, or add noise.
Can also make \(A\) continuous.
Ex. \(\phi\) low-degree monomials. Estimate \(U\), \(r\) with samples.
Then what?
Latent variable models: Instead of predicting the next state directly from action and current state, what if you add an autoencoder-RNN, and the action feeds into the RNN which then generates the next state? The memory of the RNN is the latent, lower-dimensional space.
(Q: can you have an argmax (cf. max pooling, but LOCATION) in a NN? How would you differentiate through it? Assume that the argmax is not affected by the parameters? This seems reasonable.)