See also last week for unfinished threads.
RL project
Alexa/NLP experiments
Directions
- Learning the LQR: Suppose we get a noisy observation of reward, and possibly there is noise in the dynamical system. We don’t know the parameters (matrices). Can we “PAC learn” the optimal policy?
- Have to define “PAC learn” in this context.
- I think so. cf. Contextual bandits, UCRL, LinUCB. There will be a lot of details to work out and a lot of possible ways to state a result. The question is whether this is actually valuable to work on.
- This is more the optimization standpoint, not the “existence of good solution” standpoint.
- Replace the quadratic function by a convex function \(f\).
- Suppose we know the dynamics and \(f\). Approximate the optimal policy. (Ex. Find a simple class of functions such that optimizing within this class will give an approximate solution.)
- So here we just care about “existence of good solution” (in a convenient class).
- Start with linear controls first. How do they do?
- Optimization standpoint: given a (nice, ex. linear) class of functions (for value or for policy) find the optimal within that class knowing dynamics including \(f\).
- Combine the previous two: don’t know dynamics or \(f\), and learn it!