IRL β‡’ Inverse Reinforcement Learning

πŸ“’
Inverse Reinforcement Learning: Infer reward functions from demonstrations

Standard Imitation Learning

  • Copy the actions performed by the expert
  • No reasoning about outcomes of actions

Human Imitation Learning

  • Copy the intent of the expert
  • might take very different actions!

Problem: many reward functions can explain the same behavior

Top-left: behavior, top-center: possible reward block, top-right: possible reward block, bottom-left: possoble reward blocks

Reward Parameterization

Feature Matching IRL

Linear Reward Function

rψ(s,a)=βˆ‘iψifi(s,a)=ψ⊀f(s,a)r_\psi(s,a) = \sum_i \psi_i f_i(s,a) = \psi^{\top} f(s,a)

If features ffο»Ώ are important, what if we match their expectations?

Let Ο€rψ\pi^{r_\psi}ο»Ώ be the optimal policy for rψr_\psiο»Ώ

If we pick ψ\psiο»Ώ such that EΟ€rψ[f(s,a)]=EΟ€βˆ—[f(s,a)]\mathbb{E}_{\pi^{r_\psi}}[f(s,a)] = \mathbb{E}_{\pi^*}[f(s,a)]ο»Ώ, It’s ambiguous because

So now if we choose to use max margin principle (similar to SVM):

max⁑ψ,mms.t.ψ⊀EΟ€βˆ—[f(s,a)]β‰₯maxβ‘Ο€βˆˆΞ ΟˆβŠ€EΟ€[f(s,a)]+m\max_{\psi, m} m \\ \text{s.t.} \\ \psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^{\top} \mathbb{E}_\pi [f(s,a)] + m

apply the β€œSVM trick”, the problem becomes

min⁑ψ12∣∣ψ∣∣2s.t.ψ⊀EΟ€βˆ—[f(s,a)]β‰₯maxβ‘Ο€βˆˆΞ ΟˆβŠ€EΟ€[f(s,a)]+1\min_\psi \frac{1}{2} ||\psi||^2 \\ \text{s.t.} \\ \psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^\top \mathbb{E}_{\pi}[f(s,a)] + 1

Let’s also add in a measure of difference in policy

min⁑ψ12∣∣ψ∣∣2s.t.ψ⊀EΟ€βˆ—[f(s,a)]β‰₯maxβ‘Ο€βˆˆΞ ΟˆβŠ€EΟ€[f(s,a)]+D(Ο€,Ο€βˆ—)\min_\psi \frac{1}{2} ||\psi||^2 \\ \text{s.t.} \\ \psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^\top \mathbb{E}_{\pi}[f(s,a)] + D(\pi, \pi^*)

Note:

D(β‹…)D(\cdot)ο»Ώ can be either difference in feature expectations or KL divergence

Issues:

Abbeel & Ng: Apprenticeship learning via inverse reinforcement learning Ratliff et al: Maximum margin planning

Optimal Control as a Model of Human Behavior

p(s1:T,a1:T)=?p(s_{1:T}, a_{1:T}) = ?

What we know:

p(Ot∣st,at)=exp⁑(r(st,at))p(O_t|s_t,a_t) = \exp(r(s_t,a_t))

But we can do optimality inference (What is the probability of trajectory given optimality)

p(Ο„βˆ£O1:T)=p(Ο„,O1:T)p(O1:T)∝p(Ο„)∏texp⁑(r(st,at))=p(Ο„)exp⁑(βˆ‘tr(st,at))\begin{split} p(\tau|O_{1:T}) &= \frac{p(\tau, O_{1:T})}{p(O_{1:T})} \\ &\propto p(\tau) \prod_t \exp(r(s_t,a_t)) \\ &\quad = p(\tau) \exp(\sum_{t} r(s_t,a_t)) \end{split}

Learning the optimality variable

p(Ot∣st,at,ψ)=exp⁑(rψ(st,at))p(O_t|s_t,a_t,\psi) = \exp(r_\psi(s_t,a_t))
p(Ο„βˆ£O1:T,ψ)∝p(Ο„)undefinedturnsΒ outΒ weΒ canΒ ignoreΒ becauseΒ independentΒ of ψexp⁑(βˆ‘trψ(st,at))p(\tau|O_{1:T}, \psi) \propto \underbrace{p(\tau)}_{\mathclap{\text{turns out we can ignore because independent of $\psi$}}} \exp(\sum_t r_\psi(s_t,a_t))

Maximum Likelihood Learning:

max⁑ψ1Nβˆ‘i=1Nlog⁑p(Ο„i∣O1:T,ψ)=max⁑ψ1Nβˆ‘i=1Nrψ(Ο„i)βˆ’log⁑Z\begin{split} \max_{\psi} \frac{1}{N} \sum_{i=1}^N \log p(\tau_i | O_{1:T}, \psi) &= \max_{\psi} \frac{1}{N} \sum_{i=1}^N r_\psi(\tau_i) - \log Z \end{split}
πŸ“’
Why we add a negative log⁑Z\log Z term here? Because if we only want to max rewards, we can make them high everywhere and the learned reward function is basically unusable. ZZ is called the partition function

IRL Parition Function

Z=∫p(Ο„)exp⁑(rψ(Ο„))dΟ„Z = \int p(\tau) \exp(r_\psi(\tau)) d\tau
🀧
Wow! That looks like an integral over all possible trajectories and we know that this is probably intractable, but let’s plug this into maximum likelihood learning and see what happens anyway!

βˆ‡ΟˆL=1Nβˆ‘i=1Nβˆ‡Οˆrψ(Ο„i)βˆ’1Z∫p(Ο„)exp⁑(rψ(Ο„))βˆ‡Οˆrψ(Ο„)dΟ„=EΟ„βˆΌΟ€βˆ—(Ο„)[βˆ‡Οˆrψ(Ο„i)]βˆ’EΟ„βˆΌp(Ο„βˆ£O1:T,ψ)[βˆ‡Οˆrψ(Ο„)]\begin{split} \nabla_\psi L &= \frac{1}{N} \sum_{i=1}^N \nabla_\psi r_\psi(\tau_i) - \frac{1}{Z} \int p(\tau) \exp(r_\psi(\tau)) \nabla_\psi r_\psi(\tau) d\tau \\ &= \mathbb{E}_{\tau \sim \pi^*(\tau)}[\nabla_\psi r_\psi(\tau_i)] - \mathbb{E}_{\tau \sim p(\tau|O_{1:T},\psi)}[\nabla_{\psi} r_\psi(\tau)] \end{split}
EΟ„βˆΌp(Ο„βˆ£O1:T,ψ)[βˆ‡Οˆrψ(Ο„)]=EΟ„βˆΌp(Ο„βˆ£O1:T,ψ)[βˆ‡Οˆβˆ‘t=1Trψ(st,at)]=βˆ‘t=1TE(st,at)∼p(st,at∣O1:T,ψ)[βˆ‡Οˆrψ(st,at)]\begin{split} \mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi r_\psi (\tau)] &= \mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi \sum_{t=1}^T r_\psi(s_t,a_t)] \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t|O_{1:T},\psi)}[\nabla_\psi r_\psi(s_t,a_t)] \end{split}

Note:

p(st,at)∣O1:T,ψ)=p(at∣st,O1:T,ψ)undefined=β(st,at)β(st)p(st∣O1:T,ψ)undefined∝α(st)β(st)∝β(st,at)α(st)\begin{split} p(s_t,a_t)|O_{1:T}, \psi) &= \overbrace{p(a_t|s_t, O_{1:T}, \psi)}^{\mathclap{=\frac{\beta(s_t,a_t)}{\beta(s_t)}}} \underbrace{p(s_t|O_{1:T},\psi)}_{\mathclap{\propto \alpha (s_t) \beta(s_t)}} \\ &\propto \beta(s_t,a_t) \alpha(s_t) \end{split}

Let μt(st,at)∝β(st,at)α(st)\mu_t(s_t,a_t) \propto \beta(s_t,a_t) \alpha(s_t)

EΟ„βˆΌp(Ο„βˆ£O1:T,ψ)[βˆ‡Οˆrψ(Ο„)]=βˆ‘t=1TE(st,at)∼p(st,at∣O1:T,ψ)[βˆ‡Οˆrψ(st,at)]=βˆ‘t=1T∫∫μt(st,at)βˆ‡Οˆrψ(st,at)dstdat=βˆ‘t=1TΞΌβƒ—βŠ€βˆ‡Οˆrβƒ—Οˆ\begin{split} \mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi r_\psi (\tau)] &= \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t|O_{1:T},\psi)}[\nabla_\psi r_\psi(s_t,a_t)] \\ &=\sum_{t=1}^T \int \int \mu_t(s_t,a_t) \nabla_\psi r_\psi(s_t,a_t) ds_t da_t \\ &=\sum_{t=1}^T \vec{\mu}^\top \nabla_\psi \vec{r}_\psi \end{split}

Note: This works for small and discrete state and action spaces so we can estimate ΞΌβƒ—\vec{\mu}ο»Ώ

MaxEnt IRL

Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement Learning

Its nice because

Extending to high dimensional spaces

Assume we can sample from the real environment

Idea:

Importance weights

wj=p(Ο„)exp⁑(rψ(Ο„j))Ο€(Ο„j)=p(s1)∏tp(st+1∣st,at)exp⁑(rψ(st,at))p(s1)∏tp(st+1∣st,at)Ο€(at∣st)=exp⁑(βˆ‘trψ(st,at))∏tΟ€(at∣st)\begin{split} w_j &= \frac{p(\tau) \exp(r_\psi(\tau_j))}{\pi(\tau_j)} \\ & = \frac{p(s_1)\prod_t p(s_{t+1}|s_t,a_t) \exp(r_\psi(s_t,a_t))}{p(s_1) \prod_t p(s_{t+1}|s_t,a_t) \pi(a_t|s_t)} \\ &=\frac{\exp(\sum_t r_\psi(s_t,a_t))}{\prod_t \pi(a_t|s_t)} \end{split}

IRL and GANs

The IRL algorithm described earlier looks like a game

β€œPolicy trying to look as good as the expert, reward trying to differentiate experts from policy”

Generative Adversarial Networks (GANs)

Inverse RL as a GAN

Finn*, Christiano* et al. β€œA Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”

In GAN, the optimal discriminator is

Dβˆ—(x)=pβˆ—(x)pΞΈ(x)+pβˆ—(x)D^*(x) = \frac{p^*(x)}{p_\theta(x) + p^*(x)}
The density distribution of real and fake if the discriminator is converged

We know for IRL, the optimal policy approaches

πθ(Ο„)∝p(Ο„)exp⁑(rψ(Ο„))\pi_\theta(\tau) \propto p(\tau) \exp(r_\psi(\tau))

The parameterization for discriminator for IRL is:

Dψ(Ο„)=p(Ο„)1Zexp⁑(r(Ο„))pΞΈ(Ο„)+p(Ο„)1Zexp⁑(r(Ο„))=p(Ο„)Zβˆ’1exp⁑(r(Ο„))p(Ο„)∏tπθ(at∣st)+p(Ο„)Zβˆ’1exp⁑(r(Ο„))=Zβˆ’1exp⁑(r(Ο„))∏tπθ(at∣st)+Zβˆ’1exp⁑(r(Ο„))\begin{split} D_\psi(\tau) &= \frac{p(\tau) \frac{1}{Z} \exp(r(\tau))}{p_\theta(\tau) + p(\tau) \frac{1}{Z}\exp(r(\tau))} \\ &= \frac{p(\tau) Z^{-1} \exp(r(\tau))}{p(\tau)\prod_t \pi_\theta(a_t|s_t) + p(\tau) Z^{-1} \exp(r(\tau))} \\ &= \frac{Z^{-1} \exp(r(\tau))}{\prod_t \pi_\theta(a_t|s_t) + Z^{-1}\exp(r(\tau))} \end{split}

We optimize DψD_\psi w.r.t. ψ\psi

Οˆβ†arg max⁑ψEΟ„βˆΌpβˆ—[log⁑Dψ(Ο„)]+EΟ„βˆΌΟ€ΞΈ[log⁑(1βˆ’Dψ(Ο„))]\psi \leftarrow \argmax_{\psi} \mathbb{E}_{\tau \sim p^*}[\log D_{\psi}(\tau)] + \mathbb{E}_{\tau \sim \pi_\theta}[\log (1-D_\psi(\tau))]

What if we instead of using

Dψ(Ο„)=Zβˆ’1exp⁑(r(Ο„))∏tπθ(at∣st)+Zβˆ’1exp⁑(r(Ο„))\begin{split} D_\psi(\tau) &= \frac{Z^{-1} \exp(r(\tau))}{\prod_t \pi_\theta(a_t|s_t) + Z^{-1}\exp(r(\tau))} \end{split}

Can we just use a standard binary neural net classifier?

Suggested Reading

Classic Papers:

Modern Papers: