Inverse RL

Posted: 2017-04-03 , Modified: 2017-04-03

Tags: reinforcement learning

[NR00] Inverse reinforcement learning

Compute constraints that characterize set of reward functions so observed behavior maximizes reward. Use max-margin heuristic.

[HDAR16] Cooperative inverse reinforcement learning

Human knows reward function. Robot does not. Robot payoff is human reward.

IRL: \(\Pj(u|\te, x_0) \propto e^{U_\te(x_0,u)}\).

Desiderata:

  1. Leverage action to improve learning.
  2. Human is not uninterested expert, but cooperative teacher.

Formalism

At time \(t\), observe \(s_t\) and select action \(a_t^H, a_t^R\). Both achieve reward \(r_t=R(s_t,a_t^H, a_t^R;\te)\).

Note: decentralized POMDP - compute optimal joint policy is NEXP-complete.

Here, private info is restricted to \(\te\), so reduction to coordination-POMDP does not blow up state space. (\(|S_C|=|S||\Te|\)). (State is tuple or world state, reward parameters, and R’s belief.)

Belief about \(\te\) is sufficient statistic for optimal behavior. \((\pi^{H*},\pi^{R*})\) depends only on current state and R’s belief.

Apprenticeship learning: imitate demonstrations.

ACIRL: 2 phases, human and robot takes turns; then robot acts independently (deployment).

Ex. With linear dependence on \(\te\), in deployment, optimal policy is to maximize reward induced by mean.

DBE (demonstration by expert): greedily maximizes immediate reward. Best response is to compute posterior over \(\te\).

There exist ACIRL games where \(br(br(\pi^E))\ne \pi^E\).

Simple approximation scheme

Suppose reward is \(\phi(s)^T\te\).

\[\tau^H = \amax_\tau \phi(\tau)^T \te - \eta\ve{\phi_\te-\phi(\tau)}^2.\]

Optimal \(\pi^R\) under DBE tries to match observed feature counts. (I don’t get this..)

Future work

Coordination problem.

Other papers