Supervised Learning of Policies (Behavior Cloning)

⚠️

Does it work: NO!!!!

In practice, after a long time of training it actually worked.

“tightrope walker scenerio”: has to follow exact same behavior otherwise the model does not know what to do

Three angles supervised to steer differently! This trick helped a lot

More like in training data we have scenerios to train the machine how to correct its little mistakes.

Instead of being clever about $p_{\pi_\theta}(o_t)$ , let’s be clever about $p_{data}(o_t)$ ⇒ Do DAgger(Data Aggregation)

🔥

Addresses “Distributional Drift”, adds on-policy data to the model

Idea: To collect data from $p_{\pi_\theta}(o_t)$ instead of $p_{data}(o_t)$

How: Just run $\pi_\theta(o_t)$ but we need labels $a_t$

But:

Lots of the problem is in step 3
1. We as humans have to watch for feedback and then give optimal actions, cannot just watch a state and give output

Why might we fail the expert?

Non-Markovian Behavior
1. Unnatural for humans to do perfect Markovian actions: Humans more like $\pi_\theta (a_t|o_1, \dots, o_t)$ ⇒ based on previous observations
1. Solve this by adding RNN / LSTM cells to the network (Fig. 1)
  1. May still work poorly (Figure 2)

Questions:

Does including history info (LSTM/RNN) mitigate causal confusion?
1. My guess is no since history has no correlation with confusion

Can DAgger mitigate causal confusion?
1. My guess is yes since the model will confuse and then this the part of data that the model is confused on will then be manually labeled.

Whats wrong with imitation learning:

Humans need to provide data, which is limited
1. DL works best when data is plentiful

Humans can learn autonomously, can we do the same on machines?
1. Unlimited data from own experience
1. continuous self-improvement

Ross et al. “Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

We assume:

\begin{split} & c(s,a) = \begin{cases} 0 & \text{if } a = \pi^*(s) \\ 1 & \text{otherwise} \end{cases}\\ & \pi_\theta(a \ne \pi^*(s)|s) \le \epsilon \space \text{for } s \sim p_{train}(s) \quad \leftarrow \text{the model is doing well on training set} \\ & \text{actually enough for } \mathbb{E}_{p_{train}(s)}[\pi_\theta(a \ne \pi^*(s)|s] \le \epsilon \\ \end{split}

then…

with DAgger

p_{train}(s) \rightarrow p_\theta (s) \\ \mathbb{E}[\sum_t c(s_t.a_t)] \le \epsilon T

Without DAgger

\begin{split} &\text{if } p_{train}(s) \ne p_\theta(s): \\ & p_\theta(s_t)= \underbrace{(1-\epsilon)^t}_{\mathclap{\text{probability that we made no mistakes}}} p_{train}(s_t) + (1-(1-\epsilon)^t) p_{mistake}(s_t) \\ \end{split}

We cannot assume anything for $p_{mistake}$

But we can bound the total variation divergence of training and testing

| p_{\theta}(s_t)-p_{train}(s_t)| = \underbrace{(1-(1-\epsilon)^t)|p_{mistake}(s_t)-p_{train}(s_t)|}_{\mathclap{\text{substitute in the above equation in for $p_\theta(s_t)$}}} \le (1-(1-\epsilon)^t) \times 2

Since $(1-\epsilon)^t \ge 1-\epsilon t$ for $\epsilon \in [0,1]$ ,

| p_{\theta}(s_t)-p_{train}(s_t)| \le 2(1-\epsilon t)

So finally

\begin{split} \mathbb{E}_{p_\theta(s_t)}[\sum_t c(s_t,a_t)] &= \sum_t \sum_{s_t} p_\theta (s_t) c_t(s_t) \\ &\le \sum_t \sum_{s_t} p_{train}(s_t) c_t(s_t) + |p_\theta(s_t) - p_{train}(s_t)| c_{max} \\ &\le \sum_t \epsilon + 2 \epsilon t \\ &\le \epsilon T + 2 \epsilon T^2 \\ & \in O(\epsilon T^2) \end{split}

After clarification from class:

Those demonstrations may be close to each other enough that we can train some “shared policy” for each of those different end states (in a sense that most of previous states can be obtained by a shared weight)

During production / test, specify the end state you want to achieve and then we can predict $\pi_\theta(a|s,g)$

“Learning to Reach Goals via Iterated Supervised Learning” - Dibya, Abhishek