Supervised Learning of Policies (Behavior Cloning)

Imitation Learning

⚠️
Does it work: NO!!!!

In practice, after a long time of training it actually worked.

“tightrope walker scenerio”: has to follow exact same behavior otherwise the model does not know what to do

Three angles supervised to steer differently! This trick helped a lot

More like in training data we have scenerios to train the machine how to correct its little mistakes.

Instead of being clever about pπθ(ot)p_{\pi_\theta}(o_t), let’s be clever about pdata(ot)p_{data}(o_t) ⇒ Do DAgger(Data Aggregation)

DAgger (Data Aggregation)

🔥
Addresses “Distributional Drift”, adds on-policy data to the model

Idea: To collect data from pπθ(ot)p_{\pi_\theta}(o_t) instead of pdata(ot)p_{data}(o_t)

How: Just run πθ(ot)\pi_\theta(o_t) but we need labels ata_t

  1. train πθ(atot)\pi_\theta(a_t|o_t) on human dataset D={o1,a1,,oN,aN}D = \{o_1,a_1,\dots,o_N,a_N\}
  1. run πθ(atot)\pi_\theta(a_t|o_t) to obtain Dπ={o1,,oM}D_\pi = \{o_1,\dots,o_M\}
  1. Ask humans to label DπD_\pi with actions ata_t
  1. Aggregate DDDπD \leftarrow D \cup D_\pi

But:

  1. Lots of the problem is in step 3
    1. We as humans have to watch for feedback and then give optimal actions, cannot just watch a state and give output
  1. What if the model does NOT drift?
  1. Need to mimic expert behavior very accurately

Why might we fail the expert?

Figure 1
Figure 2
  1. Non-Markovian Behavior
    1. Unnatural for humans to do perfect Markovian actions: Humans more like πθ(ato1,,ot)\pi_\theta (a_t|o_1, \dots, o_t) ⇒ based on previous observations
    1. Solve this by adding RNN / LSTM cells to the network (Fig. 1)
      1. May still work poorly (Figure 2)
  1. Multimodal Behavior (Multiple Modes/Peaks in real distribution)
    1. If discrete actions, then this is not an issue
      1. But continuous requires exponential discrete bins (for softmax)
    1. Solved by
      1. Output mixture of Gaussians
        1. π(ao)=iwiN(μi,Σi)\pi(a|o)=\sum_i w_i N(\mu_i,\Sigma_i)
        1. Tradeoffs:
          1. need more output parameters
          1. The ability to model with this in high dimensions is challenging (in theory the number of mixture elements rises exponentially with dimension increase)
      1. Latent variable models
        1. The output is still Gaussian
        1. In addition to inputting an image into NN, we also input a latent variable(may be drawn from a prior distribution) into the model.
          1. Conditional Variational Autoencoders (VAEs)
          1. Normalizing Flow/realNVP
          1. Stein Variational Gradient Descent
        1. Complex to train
      1. Audoregressive discretization
        1. discretizes one dimension at a time using an NN trick ⇒ never incur the exponential cost
        1. Adds a softmax for one dimension, does a discrete sampling (and obtains a dim 1 value), feed the dim 1 value into another NN and softmax layer, do sampling, obtain dim 2 value, so on.
        1. Have to pick bins

Autoregressive discretization

Questions:

  1. Does including history info (LSTM/RNN) mitigate causal confusion?
    1. My guess is no since history has no correlation with confusion
  1. Can DAgger mitigate causal confusion?
    1. My guess is yes since the model will confuse and then this the part of data that the model is confused on will then be manually labeled.

Whats wrong with imitation learning:

  1. Humans need to provide data, which is limited
    1. DL works best when data is plentiful
  1. Humans are not good at providing some actions
  1. Humans can learn autonomously, can we do the same on machines?
    1. Unlimited data from own experience
    1. continuous self-improvement

Analysis of Imitation Learning

Ross et al. “Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

We assume:

c(s,a)={0if a=π(s)1otherwiseπθ(aπ(s)s)ϵ for sptrain(s)the model is doing well on training setactually enough for Eptrain(s)[πθ(aπ(s)s]ϵ\begin{split} & c(s,a) = \begin{cases} 0 & \text{if } a = \pi^*(s) \\ 1 & \text{otherwise} \end{cases}\\ & \pi_\theta(a \ne \pi^*(s)|s) \le \epsilon \space \text{for } s \sim p_{train}(s) \quad \leftarrow \text{the model is doing well on training set} \\ & \text{actually enough for } \mathbb{E}_{p_{train}(s)}[\pi_\theta(a \ne \pi^*(s)|s] \le \epsilon \\ \end{split}

then…

with DAgger

ptrain(s)pθ(s)E[tc(st.at)]ϵTp_{train}(s) \rightarrow p_\theta (s) \\ \mathbb{E}[\sum_t c(s_t.a_t)] \le \epsilon T

Without DAgger

if ptrain(s)pθ(s):pθ(st)=(1ϵ)tundefinedprobability that we made no mistakesptrain(st)+(1(1ϵ)t)pmistake(st)\begin{split} &\text{if } p_{train}(s) \ne p_\theta(s): \\ & p_\theta(s_t)= \underbrace{(1-\epsilon)^t}_{\mathclap{\text{probability that we made no mistakes}}} p_{train}(s_t) + (1-(1-\epsilon)^t) p_{mistake}(s_t) \\ \end{split}

We cannot assume anything for pmistakep_{mistake}

But we can bound the total variation divergence of training and testing

pθ(st)ptrain(st)=(1(1ϵ)t)pmistake(st)ptrain(st)undefinedsubstitute in the above equation in for pθ(st)(1(1ϵ)t)×2| p_{\theta}(s_t)-p_{train}(s_t)| = \underbrace{(1-(1-\epsilon)^t)|p_{mistake}(s_t)-p_{train}(s_t)|}_{\mathclap{\text{substitute in the above equation in for $p_\theta(s_t)$}}} \le (1-(1-\epsilon)^t) \times 2

Since (1ϵ)t1ϵt(1-\epsilon)^t \ge 1-\epsilon t for ϵ[0,1]\epsilon \in [0,1],

pθ(st)ptrain(st)2(1ϵt)| p_{\theta}(s_t)-p_{train}(s_t)| \le 2(1-\epsilon t)

So finally

Epθ(st)[tc(st,at)]=tstpθ(st)ct(st)tstptrain(st)ct(st)+pθ(st)ptrain(st)cmaxtϵ+2ϵtϵT+2ϵT2O(ϵT2)\begin{split} \mathbb{E}_{p_\theta(s_t)}[\sum_t c(s_t,a_t)] &= \sum_t \sum_{s_t} p_\theta (s_t) c_t(s_t) \\ &\le \sum_t \sum_{s_t} p_{train}(s_t) c_t(s_t) + |p_\theta(s_t) - p_{train}(s_t)| c_{max} \\ &\le \sum_t \epsilon + 2 \epsilon t \\ &\le \epsilon T + 2 \epsilon T^2 \\ & \in O(\epsilon T^2) \end{split}

Another way to imitate (Goal Conditional Behavioral Cloning)

After clarification from class:

  1. Sometimes we have bad “demonstrations” that may lead to different end states
  1. Those demonstrations may be close to each other enough that we can train some “shared policy” for each of those different end states (in a sense that most of previous states can be obtained by a shared weight)
  1. During production / test, specify the end state you want to achieve and then we can predict πθ(as,g)\pi_\theta(a|s,g)

“Learning to Reach Goals via Iterated Supervised Learning” - Dibya, Abhishek