Figure out what’s provably known about RL!
Known model
- LP solves in poly time.
- Policy improvement converges to global optimum. Is it poly time? (cf. simplex method is not poly-time, but is under smoothed analysis)
- This seems unclear - it’s an open problem as of 04.
- Does alternating policy estimation/improvement converge? In poly time? (cf. alternating minimization)
- What about attaining the optimal Cesaro sum?
Unknown model
- Episodic: does MC converge? What is the convergence rate (regret)?
- Exploring starts, etc. (ability to choose starts? cf. optimal learning)
- TD learning (non-episodic): Convergence (or non-convergence) rate of
- SARSA (model estimation)
- Q-learning.
Parametrized policy
Suppose payout is convex in policy parameters. But why would this ever be the case???
Or: have to decide between several experts.
References