Model-based RL

Model-based Methods

🔥

Why Model-based methods? Because if we know the transition function

f(s_t, a_t) = \mathbb{P}_{(s_{t+1}|s_t,a_t)}

then we can use trajectory planning ⇒ So we can learn

f(s_t,a_t)

from data

📌

Model-based RL can be prone to error in distribution mismatch ⇒ very like behavior cloning because when planned trajectory queries out-of-sample states, then

f(\cdot)

is likely to be wrong

NPC

Performance Gap Issue

We want good exploration to gather more data to address the distribution shift, but with a not-so-good early model we cannot deploy good exploration techniques

Basically the model early on overfits an function approximator and the exploration policy is stuck on a local minima / maxima

Uncertainty Estimation

Uncertainty estimation helps us reduce performance gap.

Under this trick, the expectation of reward under high-variance prediction is very low, even though the mean is the same

However:

We still need exploration to get better model

Expected value is not the same as pessimistic value

But how to build it?

Use NN?
1. Cannot express uncertainty of unseen data accurately ⇒ when querying out-of-sample data, the uncertainty output is arbitrary
1. “The model is certain about data, but we are not certain about the model”

Estimate Model Uncertainty
1. Estimate $\argmax_\theta \log \mathbb{P}(\theta|D) = \argmax_\theta \log \mathbb{P}(D|\theta)$
1. Can we instead estimate the distribution $\mathbb{P}(\theta|D)$ ?
  1. We can use this to get a posterior distribution over the next state
    1. $\int p(s_{t+1}|s_t,a_t,\theta) p(\theta|D) d\theta$
  1. Bayesian Neural Nets
    1. Every weight, instead of being a number, is a distribution
    1. But it’s difficult to join parameters into a single distribution
    1. So we do approximation
      1. $p(\theta|D) = \prod_i p(\theta_i|D)$
        Very Crude approximation
      1. $p(\theta_i|D) = N(\mu_i, \sigma_i^2)$
        Instead of learning numerical value learn mean value and std/variance
      1. Bllundell et al. Weight Uncertainty in Neural Networks
      1. Gal et al. Concrete Dropout

Bootstrap Ensembles
1. Train several different neural nets and bootstrap samples( $D_i$ sample with replacement from $D$ ) to feed to those networks
  1. So each network learns slightly differently
  1. In theory each NN learns well on their own training data but makes different mistakes outside of training data
1. Then estimate $p(\theta|D) \approx \frac{1}{N} \sum_i \delta(\theta_i)$
1. $\int p(s_{t+1}|s_t,a_t,\theta) p(\theta|D) d\theta \approx \frac{1}{N} \sum_i p(s_{t+1}|s_t,a_t,\theta_i)$
1. Very crude approximation
  1. Because the number of models is usually small(expensive)
  1. But works
  1. Resampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent

With Complex Observations (images)

High dimensionality with observation prediction model
1. Redundancy

Partial Observability

Can we seperately learn $p(o_t | s_t)$ , which is high-dim but not dynamic, and $p(s_{t+1}|s_t,a_t)$ , which is low-dim but dynamic?

Now we need to maximize

\max_\phi \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \mathbb{E}[\log p_\phi(s_{t+1}|s_{t,i},a_{t,i}) + \log p_\phi (o_{t,i}|s_{t,i})]

Note that the expectation is conditioned on

(s_t,s_{t+1}) \sim p(s_t,s_{t+1}|o_{1:T},a_{1:T})

So can we learn an additional approximate posterior (encoder):

q_\psi (s_t|o_{1:t},a_{1:t})

Also we have other choices:

$q_\psi(s_t,s_{t+1}|o_{1:T},a_{1:T})$ - full smoothing posterior
1. most accurate
1. most complicated to train
1. Preferred when the state is less fully-observed

$q_\psi (s_t | o_t)$
1. single-step encoder
1. simplest but less accurate
1. stochastic case requires variational inference
1. Deterministic: $q_\psi(s_t|o_t) = \delta(s_t = g_\psi(o_t))$

To use uncertainty

Before uncertainty is introduced, we have (assuming $s_{t+1} = f(s_t,a_t)$

J(a_1,\dots,a_H) = \sum_{t=1}^H r(s_t,a_t)

Now(With bootstrap):

J(a_1,\dots,a_H) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^H r(s_{t,i},a_i)

Other options:

Moment matching

more complex posterior estimation with BNNs

Can we just backprop through policy / dynamics model?

📌

Generally No!

Similar parameter sensitivity problems as shooting methods
1. But no longer have convenient second order LQR-like method, because policy parameters couple all the time steps

Similar problems to training long RNNs with BPTT
1. Vanishing and exploding gradients

Let’s compare policy gradient:

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\hat{Q}_{i,t}^\pi

with backprop (pathwise) gradient:

\nabla_\theta J(\theta) = \sum_{t=1}^T \frac{\partial a_t}{\partial \theta} \frac{\partial s_{t+1}}{\partial a_t}(\sum_{t'=t+1}^T \frac{\partial r_{t'}}{\partial s_{t'}}(\prod_{t'' = t+2}^{t'} \frac{\partial s_{t''}}{\partial a_{t''-1}}\frac{\partial a_{t''-1}}{\partial s_{t''-1}}+\frac{\partial s_{t''}}{\partial s_{t''-1}}))

Policy gradient might be more stable (if enough samples are used) because it does not require multiplying many Jacobians

Parmass et al. “18: PIPP: Flexible Model-Based Policy Search Robust to the Curse of Chaos”

It we use policy gradient to backprop through our model dynamics model, there would still be a problem (although generating samples does not require interacting with physical world or simulator)

Left: Open-loop planning with long rollouts; Middle: Open-loop planning with shorter rollouts; Right: Collect real-world data from current policy and do short rollouts planning with data sampled from real distribution

But this does not solve the problem because the point of using model-based RL is being sample efficient and using sampled real world data and do short-rollout open-loop planning restrict how much we can change our policy.

Model-based RL with Policies

📌

Question is how to use the model’s predictions? fit Q functions or Value functions or what?

Model-Based Acceleration (MBA)
1. Gu et al. Continuous deep Q-learning with model-based acceleration. ‘16

Model-Based Value Expansion (MVE)
1. Feinberg et al. Model-based value expansion. ’18

Model-Based Policy Optimization (MBPO)
1. Janner et al. When to trust your model: model-based policy optimization. ‘19

Good Idea:

Sample Efficient

Bad Idea:

Additional Source of bias if model is not correct
1. Reduce this by using ensemble methods

If not seen states before then we still cannot perform valid action

Multi-step Models and Successor Representations

📌

The job of the model is to evaluate the policy

\begin{split} V^\pi(s_t) &= \sum_{t=t'}^\infin \gamma^{t'-t} \mathbb{E}_{p(s_{t'}|s_t} \mathbb{E}_{a_{t'}|s_{t'})}[r(s_{t'},a_{t'})] \\ &= \sum_{t=t'}^\infin \gamma^{t' - t} \mathbb{E}_{p(s_{t'}|s_t)}[r(s_{t'})] \\ &= \sum_{t = t'}^\infin \gamma^{t'-t} \sum_s p(s_{t'} = s|s_t)r(s) \\ &= \sum_s \underbrace{(\sum_{t=t'}^\infin \gamma^{t' - t} p(s_{t'} = s | s_t))}_{\mathclap{\text{Add $(1-\gamma)$ term to the front makes it $p_\pi(s_{future}=s|s_t)$}}} r(s) \end{split}

p_\pi(s_{future}=s|s_t) = (1-\gamma)\sum_{t=t'}^\infin \gamma^{t' - t} p(s_{t'} = s | s_t)

Therefore

V^\pi(s_t) = \frac{1}{1-\gamma} \underbrace{\sum_s p_\pi(s_{future}=s|s_t) r(s)}_{\mathclap{\mu^\pi(s_t)^\top \vec{r}}}

Where

\mu_i^\pi = p_\pi(s_{future} = s_i|s_t)

is the “successor representation”

Dayan Improving Generalization for Temporal Difference Learning: The Successor Representation, 1993.

\begin{split} \mu_i^\pi(s_t) &= (1-\gamma) \sum_{t'=t}^\infin \gamma^{t'-t} p(s_{t'} = i | s_t) \\ &= (1-\gamma)\delta(s_t = i) + \gamma \mathbb{E}_{a_t \sim \pi(a_t|s_t), s_{t+1} \sim p(s_{t+1}|s_t,a_t)}[\mu_i^\pi(s_{t+1}] \end{split}

Hmm…looks like a bellman backup with “reward” $r(s_t) = 1-\gamma \delta(s_t = i)$

But issues:

Not clear if learning successor representation is easier than model-free RL

how to scale to large state spaces?

How to extend to continuous state spaces?

To scale to large state spaces, use successor feature of $s_t$

\psi_j^\pi(s_t) = \sum_s \mu_s^\pi(s_t)\phi_j(s) = \mu_\pi(s_t)^\top \vec{\phi}_j

If we can express the reward function as combination of features of state s

r(s) = \sum_j \phi_j(s) w_j = \phi(s)^\top w

Turns out we can express the value function as

V^\pi(s_t) = \psi^\pi(s_t)^\top w = \sum_j \mu^\pi(s_t)^\top \vec{\phi}_j w = \mu^\pi(s_t)^\top \sum_j \vec{\phi}_j w = \mu^\pi(s_t)^\top \vec{r}

The dimensionality of $\phi$ and $\psi$ would be a lot lower than the dimension of possible states

Can also construct a Q-function like version

\psi_j^\pi(s_t,a_t) = \phi_j(s_t) + \gamma \mathbb{E}_{s_{t+1} \sim p(s_{t+1}|s_t,a_t), a_{t+1} \sim \pi(a_{t+1}|s_{t+1})}[\psi_j^\pi(s_{t+1},a_{t+1})]

Q^\pi(s_t,a_t) \approx \psi^\pi(s_t,a_t)^\top w

(when $r(s) = \phi(s)^\top w$ )

Using Successor Features

Recover a Q-function very quickly
1. Train $\psi^\pi(s_t,a_t)$ via Bellman backups
1. Get some reward samples $\{s_i, r_i \}$
1. Get $w \leftarrow \argmin_w \sum_i ||\phi(s_i)^\top w - r_i||^2$
1. Recover $Q^\pi(s_t,a_t) \approx \psi^\pi(s_t,a_t)^\top w$
  1. $\pi’(s) = \argmax_a \psi^\pi(s,a)^\top w$
    1. This is only equivalent to one step of policy iteration because Q is with respect to the actor

Recover many Q-function
1. Train $\psi^{\pi_k}(s_t,a_t)$ for many policies $\pi_k$ via Bellman backups
1. Get some reward samples $\{s_i, r_i \}$
1. Get $w \leftarrow \argmin_w \sum_i ||\phi(s_i)^\top w - r_i||^2$
1. Recover $Q^{\pi_k}(s_t,a_t) \approx \psi^{\pi_k}(s_t,a_t)^\top w$ for every $\pi_k$
  1. $\pi’(s) = \argmax_a \max_k \psi^{\pi_k}(s,a)^\top w$
    1. This is only equivalent to one step of policy iteration because Q is with respect to the actor
    1. Finding to highest reward policy in each state

Continuous Successor Representations (C-Learning)

🧙🏽‍♂️

So can we just frame successor representation as classification?

p^\pi(\underbrace{F=1}_{\mathclap{\text{means $s_{future}$ is a future state under $\pi$}}}|s_t,a_t,s_{future}) = \frac{p^\pi(s_{future}|s_t,a_t)}{p^\pi(s_{future}|s_t,a_t)+p^\pi(s_{future})}

Positive set during training:

D_+ \sim p^\pi(s_{future}|s_t,a_t)

Negative set just pulled randomly.

To train (On-policy algorithm, can also derive an off-policy algorithm that is also very efficient):

Sample $s\sim p^\pi(s)$ ⇒ run policy, sample from trajectories

Sample $s \sim p^\pi(s_{future}|s_t,a_t)$ ⇒ pick positive future pairs

Update $p^\pi(F=1|s_t,a_t,s)$ using SGD with cross entropy loss

Eysenbach, Salakhutdinov, Levine. C-Learning: Learning to Achieve Goals via Recursive Classification. 2020.