Inverse RL
Posted: 2017-04-03 , Modified: 2017-04-03
Tags: reinforcement learning
Posted: 2017-04-03 , Modified: 2017-04-03
Tags: reinforcement learning
Compute constraints that characterize set of reward functions so observed behavior maximizes reward. Use max-margin heuristic.
Human knows reward function. Robot does not. Robot payoff is human reward.
IRL: \(\Pj(u|\te, x_0) \propto e^{U_\te(x_0,u)}\).
Desiderata:
At time \(t\), observe \(s_t\) and select action \(a_t^H, a_t^R\). Both achieve reward \(r_t=R(s_t,a_t^H, a_t^R;\te)\).
Note: decentralized POMDP - compute optimal joint policy is NEXP-complete.
Here, private info is restricted to \(\te\), so reduction to coordination-POMDP does not blow up state space. (\(|S_C|=|S||\Te|\)). (State is tuple or world state, reward parameters, and R’s belief.)
Belief about \(\te\) is sufficient statistic for optimal behavior. \((\pi^{H*},\pi^{R*})\) depends only on current state and R’s belief.
Apprenticeship learning: imitate demonstrations.
ACIRL: 2 phases, human and robot takes turns; then robot acts independently (deployment).
Ex. With linear dependence on \(\te\), in deployment, optimal policy is to maximize reward induced by mean.
DBE (demonstration by expert): greedily maximizes immediate reward. Best response is to compute posterior over \(\te\).
There exist ACIRL games where \(br(br(\pi^E))\ne \pi^E\).
Suppose reward is \(\phi(s)^T\te\).
\[\tau^H = \amax_\tau \phi(\tau)^T \te - \eta\ve{\phi_\te-\phi(\tau)}^2.\]
Optimal \(\pi^R\) under DBE tries to match observed feature counts. (I don’t get this..)
Coordination problem.