Offline RL

Huge gap in DL and DRL ⇒ supervised learning has tons of data!

Formally:

D = \{ (s_i,a_i,s_i',r_i) \} \\ s \sim d^{\pi_\beta}(s) \\ a \sim \pi_\beta(a|s) \\ r \leftarrow r(s,a)

$\pi_\beta$ is the unknown policy that collected the data

Off-Policy Evaluation (OPE)

Given $D$ , estimate return of some policy $J(\pi_\beta)$

Offline Reinforcement Learning

Given $D$ , learn the best possible policy $\pi_\theta$ (such that the evidence of the model is good is supported by the dataset)

How is this even possible?

Find the “good stuff” in a dataset full of good and bad behaviors

Generalization: Good behavior in one place may suggest good behavior in another place

“Stitching”: Parts of good behaviors (even badly-performed parts can have good behaviors) can be recombined

Bad intuition: It’s like imitation learning
- Though it can be shown to be provably better than imitation learning even with optimal data, under some structural assumptions ⇒ Kumar, Hong, Singh, Levine. Should I Run Offline Reinforcement Learning or Behavioral Cloning?

Better Intuition: Get order from chaos
- e.g. Get object from a drawer ⇒ Singh, Yu, Yang, Zhang, Kumar, Levine. COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning.

What can go wrong if we just use off-policy RL (on data collected by other agents)?

Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-based Robotic Manipulation Skills

So it works in practice, but seems like a huge GAP between not-fine-tuned and fine-tuned result?

Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS ‘19

Q-function overfitting with an off-policy actor-critic algorithm

Fundamental Problem: Counterfactual queries

Online RL algorithms don’t have to handle this because they can simply try this action and see what happens ⇒ not possible for offline policies

Offline RL methods must somehow account for these unseen (”out-of-distribution”) actions, ideally in a safe way

Distribution Shift

In a supervised learning setting, we want to perform empirical risk minimization (ERM):

\theta \leftarrow \argmin_\theta \mathbb{E}_{x \sim p(x), y\sim p(y|x)} [(f_\theta(x)-y)^2]

Given some $x^*$

\mathbb{E}_{x \sim p(x), y\sim p(y|x)} [(f_\theta(x)-y)^2] \text{ is low(since distribution is same as training set)} \\ \mathbb{E}_{x \sim \bar{p}(x), y \sim p(y|x)} [(f_\theta(x)-y)^2] \text{ is generally not if $\bar{p}(x) \ne p(x)$}

Yes, neural nets generalize well ⇒ but is is well enough?

If we pick $x^* \leftarrow \argmax_x f_\theta(x)$

Blue curve is the fitted and the green is the actual

When we choose argmax we are probably including a huge positive noise!

Q-Learning Actor-Critic with Distribution Shift

Q-objective:

\min_{Q} \mathbb{E}_{(s,a) \sim \pi_\beta(s,a)}[(Q(s,a)-y(s,a))^2]

The function approximates:

Q(s,a) \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]

The distribution shift problem kicks in when

\pi_{new} \ne \pi_\beta

And even worse,

\pi_{new} = \argmax_\pi \mathbb{E}_{a \sim \pi(a|s)} [Q(s,a)]

Online RL Setting (Sample “optimal” and find error in our approximation)

Offline Setting (No way to find error in our approximation)

🧙🏽‍♂️

Existing Challenges with sampling error and function approximation error in standard RL become much more severe in offline RL

Batch RL via Important Sampling (Traditional Offline Algorithms)

Importance Sampling:

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\frac{\pi_\theta(\tau_i)}{\pi_\beta(\tau_i)}}_{\mathclap{\text{importance weight}}} \sum_{t=0}^T \nabla_\theta \gamma^t \log \pi_\theta(a_{t,i}|s_{t,i})\hat{Q}(s_{t,i},a_{t,i})

Where the importance weight, is exponential in T

\frac{\pi_\theta(\tau)}{\pi_\beta(\tau)} = \frac{p(s_1) \prod_t p(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)}{p(s_1)\prod_t p(s_{t+1}|s_t,a_t) \pi_\beta(a_t|s_t)} \in O(e^T)

Can we fix this?

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \underbrace{(\prod_{t'=0}^{t-1} \frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})})}_{\mathclap{\text{accounts for difference in probability of landing in $s_{t,i}$}}} \nabla_\theta \gamma^t \log \pi_\theta(a_{t,i}|s_{t,i}) \underbrace{(\prod_{t'=t}^T\frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})}) }_{\mathclap{\text{accounts for incorrect $\hat{Q}(s_{t,i},a_{t,i})$}}}\hat{Q}(s_{t,i},a_{t,i})

Classic Advanced Policy Gradient RL completely strips away the first product term ⇒ assuming that the new policy $\pi_\theta$ is not too far away from $\pi_\beta$

We can estimate

\hat{Q}(s_{t,i},a_{t,i}) = \mathbb{E}_{\pi_\theta}[\sum_{t'=t}^T \gamma^{t'-t} r_{t'}] \approx \sum_{t'=t}^T \gamma^{t'-t} r_{t',i}

We can simplify the portion (using the fact that future actions don’t affect current reward)

(\prod_{t'=t}^T\frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})})\hat{Q}(s_{t,i},a_{t,i}) = \sum_{t'=t}^T(\prod_{t''=t}^T\frac{\pi_\theta(a_{t'',i}|s_{t'',i})}{\pi_\beta(a_{t'',i}|s_{t'',i})})\hat{Q}(s_{t,i},a_{t,i}) \approx \sum_{t'=t}^T(\prod_{t''=t}^{t'}\frac{\pi_\theta(a_{t'',i}|s_{t'',i})}{\pi_\beta(a_{t'',i}|s_{t'',i})})\hat{Q}(s_{t',i},a_{t',i})

But this is still exponential in $T$

🧙🏽‍♂️

To avoid exponentially exploding importance weights, we must use value function estimation!

The Doubly Robust Estimator

Jiang, N. and Li, L. (2015). Doubly robust off-policy value evaluation for reinforcement learning

\begin{split} V^{\pi_\theta}(s_0) &\approx \sum_{t=0}^T (\prod_{t'=0}^t \frac{\pi_\theta(a_{t'}|s_{t'})}{\pi_\beta(a_{t'}|s_{t'})})\gamma^t r_t \\ &= \sum_{t=0}^T(\prod_{t'=0}^t \rho_{t'}) \gamma^t r_t \\ &= \rho_0 r_0 + \rho_0\gamma\rho_1r_1 + \rho_0\gamma \rho_1 \gamma \rho_2 r_2 + \dots \\ &=\rho_0(r_0 + \gamma(\rho_1(r_1+...))) \\ &=\bar{V}^T \end{split}

We notice a recursion relationship:

\bar{V}^{T+1-t} = \rho_t(r_t + \gamma \bar{V}^{T-t})

Bandit case of doubly robust estimation:

V_{DR}(s) = \hat{V}(s) + \rho(s,a)(r_{s,a} - \hat{Q}(s,a))

Doubly Robust Estimation:

\bar{V}_{DR}^{T+1-t} = \hat{V}(s_t) + \rho_t(r_t + \gamma \bar{V}_{DR}^{T-t} - \hat{Q}(s_t,a_t))

Marginalized Important Sampling

In previous times we were trying to do importance sampling on action distributions ⇒ but it’s actually possible to do it on state distributions

🧙🏽‍♂️

Instead of using

\prod_t \frac{\pi_\theta(a_t|s_t)}{\pi_\beta(a_t|s_t)}

, estimate

w(s,a) = \frac{d^{\pi_\theta}(s,a)}{d^{\pi_\beta}(s,a)}

How to determine $w(s,a)$ ?

Typically solve some kind of consistency condition

e.g. (Zhang et al. GenDICE)
- $d^{\pi_\beta}(s’,a’)w(s’,a’) = (1-\gamma) p_0 (s’) \pi_\theta(a’|s’) + \gamma \sum_{s,a} \pi_\theta(a’|s’) p(s’|s,a) d^{\pi_\beta}(s,a) w(s,a)$
- Probability of seeing $(s’,a’)$ is the probability of starting with $s’$ and executing $a’$ or do transitions and arrive at the state $s’$ and executing $a’$

Solving for $w(s,a)$ typically involves some fixed point problem

Batch RL via Linear Fitted Value Functions

How have people thought about it before?
- Extend existing ideas for approximate dynamic programming and Q-learning to offline setting
- Derive tractable solutions with simple (e.g., linear) function approximators
- Distribution Shift not a huge problem since value functions are relatively simple

How are people thinking about it now?
- Derive approximate solutions with highly expressive function approximators (e.g., deep nets)
- The primary challenge turns out to be distributional shift

Assume we have a feature matrix

\Phi \in \mathbb{R}^{|S| \times K}

where $|S|$ is the number of states, and $K$ is the number of features. Then in order to do model-based RL in feature space

Estimate the reward
1. $\Phi w_r \approx r$ ⇒ $w_r = (\Phi^{\top} \Phi)^{-1} \Phi^{\top} \vec{r}$

Estimate the transitions
1. $\Phi P_{\Phi} \approx P^{\pi} \Phi$
  1. $P_\Phi$ ⇒ estimated feature space transition matrix $\in \mathbb{R}^{K \times K}$
  1. $P_\phi = (\Phi^{\top} \Phi)^{-1} \Phi^{\top} P^{\pi} \Phi$
1. $P^\pi \in \mathbb{R}^{|S| \times |S|}$ ⇒ Real transition matrix (on states), dependent on the policy

Estimate the value function
1. $V^{\pi} \approx V_{\Phi}^\pi = \Phi w_V$
  1. Solving for $V^\pi$ in terms of $P^\pi$ and $r$ :
    1. $V^\pi = r + \gamma P^\Pi V^\pi$
    1. $(1-\gamma P^\pi) V^\pi = r$
    1. $V^\pi = (I - \gamma P^\pi)^{-1}r$
1. $w_V = (I - \gamma P_\Phi)^{-1} w_r = (\Phi^\top \Phi-\gamma \Phi^\top P^\pi \Phi)^{-1} \Phi^\top \vec{r}$ ⇒ Least-sqaures Temporal Difference (LSTD)
1. But we don’t know $P^\pi$
1. With samples introduced, we will change our feature matrix $\Phi \in \mathbb{R}^{|D| \times K}$
1. And now we are going to replace $P^\pi \Phi$ with $\Phi’$ such that each row of $\Phi’_i$ is feature at the next timestep $\Phi_i’ = \phi(s_i’)$
  1. However, if we change policy, we no longer have $P^\pi$
1. Also replace $\vec{r}_i = r(s_i,a_i)$
1. Everything else works exactly same, but only that we now have some sampling error

Improve the policy
1. $\pi’(s) \leftarrow Greedy(\Phi w_V)$

Least Squares Policy Iteration (LSPI)

🧙🏽‍♂️

Replace LSTD with LSTDQ - LSTD but for Q functions

w_Q = (\Phi^\top \Phi-\gamma \Phi^\top \underbrace{\Phi'}_{\mathclap{\Phi'_i = \phi(s_i',\pi(s_i'))}})^{-1} \Phi^\top \vec{r}

LSPI:

Compute $w_Q$ for $\pi_k$

$\pi_{k+1}(s) = \argmax_a \phi(s,a)w_Q$

Set $\Phi_i’ = \phi(s_i’, \pi_{k+1}(s_i’))$

Problem:

Does not solve the distribution shift problem (in step 2 of argmax)

Policy Constraints

Very old idea, generally only implement this would not work well

Q \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]

\pi_{new}(a|s) = \argmax_\pi \mathbb{E}_{a \sim \pi(a|s)}[Q(s,a)] \text{ s.t. } D_{KL}(\pi,\pi_\beta) \le \epsilon

$D_{KL}(\pi, \pi_\beta)$

Easy to implement

But not necessarily what we want

So maybe instead we want to pose a “support constraint”

\pi(a|s) \ge 0 \text{ iff } \pi_\beta(a|s) \ge \epsilon

Can approximate with Maximum Mean Discrepancy (MMD)

significantly more complex to implement

much closer to what we want

How to implement?

Implementing Implicit Policy Constraint in practice

Modify the actor objective
1. Common objective $\theta \leftarrow \argmax_\theta \mathbb{E}_{s \sim D} [\mathbb{E}_{a \sim \pi_\theta(a|s)}[Q(s,a)]]$
1. $D_{KL}(\pi, \pi_\beta) = \mathbb{E}[\log \pi(a|s) - \log \pi_\beta (a|s)] = -\mathbb{E}_\pi[\log \pi_\beta(a|s)] - H(\pi)$
1. So we can incorporate the KL divergence into the actor objective
1. Use Lagrange Multiplier
  1. $\theta \leftarrow \argmax_\theta \mathbb{E}_{s \sim D} [\mathbb{E}_{a \sim \pi_\theta(a|s)}[Q(s,a) + \lambda \log \pi_\beta (a|s)]+\lambda H(\pi (a|s))]$
  1. In practice we either solve the lagrange multiplier or treat it as hyper-parameter and tune it

Modify the reward funciton
1. $\bar{r}(s,a) = r(s,a) - D(\pi, \pi_\beta)$
1. Simple modification to directly penalize divergence
1. Also accounts for future divergence (in later steps)
1. Wu, Tucker, Nachum, Behavior Regularized Offline Reinforcement Learning. ‘19

Implicit Policy Constraint Methods
1. Solve for lagrangian condition of the KL constrained problem (straightforward to show via duality), we have
1. $\pi^*(a|s) = \frac{1}{Z(s)} \pi_\beta (a|s) \exp(\frac{1}{\lambda} A^\pi (s,a))$
1. Peters et al. (REPS)
1. Rawlik et al. (”psi-learning”)
1. We can do this by
  1. Approximate via weighted max likelihood
  1. $\pi_{new}(a|s) = \argmax_{\pi} \mathbb{E}_{\underbrace{(s,a) \sim \pi_\beta}_{(s,a) \sim D}} [\log \pi(a|s) \underbrace{\frac{1}{Z(s)} \exp(\frac{1}{\lambda} A^{\pi_{old}} (s,a)) }_{w(s,a)}]$
  1. Imitating good actions “more”(determined by $w(s,a)$ ) than bad actions
1. Problem: In order to get advantage values / target values we still need to query OOD actions
  1. If we choose $\lambda$ correctly the constraints will be respected at convergence, but not necessarily at intermediate steps

🧙🏽‍♂️

Can we avoid ALL OOD actions in the Q update?

Instead of doing

Q(s,a) \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]

We can do:

Q(s,a) \leftarrow r(s,a) + V(s')

But how do we fit this $V’$ function?

$V \leftarrow \argmin_{V} \frac{1}{N} \sum_{i=1}^N l(V(s_i), Q(s_i, a_i))$
1. MSE Loss $(V(s_i) - Q(s_i, a_i))^2$
  1. Problem is ⇒ the actions come from the exploration policy $\pi_\beta$ not from $\pi_{new}$
  1. Gives us $\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]$

Think about this:

We’ve probably seen only a bit or none of action in the “exact” same state
1. But there might be very “similar” states that we execute action on ⇒ There’s not bins of states but distributions of states
1. Can we use those “similar states” as extra data points for us to perform generalization on?
1. But what if we use a “quantile” / “expectile” ⇒ expectile is like a quantile squared
  1. Upper expectile / quantile is the best policy supported by the data

Implicit Q-Learning

Expectile:

l_2^\tau (x) = \begin{cases} (1-\tau)x^2 &\text{if } x>0 \\ \tau x^2 &\text{otherwise} \end{cases}

Intuition:

MSE penalizes positive and negative errors equally

expectile penalizes positive and negative with different weights depending on $\tau \in [0,1]$

Weights given to positive and negative erros in expectile regression. Taken from IQL paper

Formally, it could be shown that if we use

V \leftarrow \argmin_{V} \mathbb{E}_{(s,a) \sim D} [l_2^\tau(V(s)-Q(s,a))]

In practice, $Q(s,a)$ would be a target network

Formally, we can prove that provided with a big enough $\tau$ ,

V(s) \leftarrow \max_{a \in \Omega(s)} Q(s,a)

Where

\Omega(s) = \{a: \pi_\beta(a|s) \ge \epsilon \}

Kostrikov, Nair, Levine. Offline Reinforcement Learning with Implicit Q-Learning. ‘21

📌

Oh yes, that’s my postdoc’s paper (as the first author)! This is absolutely an amazing paper and he is absolutely an amazing person, also this is one of the papers I read for entering into Sergey’s group. I hope you get to check out this paper!

Conservative Q-Learning (CQL)

Kumar, Zhou, Tucker, Levine. Conservative Q-Learning for Offline Reinforcement Learning

Directly repair overestimated actions in the Q function

Intuition: Push down on areas that have errorneous high Q values

We can formally show that $\hat{Q}^\pi \le Q^\pi$ for large enough $\alpha$

But with this objective, we’re pressing down on actually “good” actions as well ⇒ so add additional terms that pushes up on good samples supported by data

No longer guarenteed that $\forall (s,a) \hat{Q}^\pi (s,a) \le Q^\pi(s,a)$

But $\forall s\in D, \mathbb{E}_{\pi(a|s)}[\hat{Q}(s,a)] \le \mathbb{E}_{\pi(a|s)}[Q^\pi(s,a)]$

In practice, we add a regularizer $R(\mu)$ in $L_{CQL}(\hat{Q}^\pi)$ so that the $\mu$ doesn’t need to be calculated directly.

L_{CQL}(\hat{Q}^\pi) = \max_{\mu} \{ \alpha\mathbb{E}_{s\sim D, a\sim \mu(a|s)}[\hat{Q}^\pi(s,a)] - \alpha \mathbb{E}_{(s,a) \sim D}[Q(s,a)] \} + \mathbb{E}_{(s,a,s') \sim D}[(Q(s,a) - (r(s,a)+\mathbb{E}[Q(s',a')]))^2]

Common choice: $R(\mu) = \mathbb{E}_{s \sim D} [H(\mu(\cdot | s))]$

⇒ Then $\mu(a|s) \propto \exp[Q(s,a)]$ ⇒ $\mathbb{E}_{a \sim \mu(a|s)}[Q(s,a)] = \log \sum_a \exp(Q(s,a))$

For discrete actions, we can just calculate this $\log \sum_a \exp(Q(s,a))$ directly

For continuous actions, use importance sampling to estimate $\mathbb{E}_{a \sim \mu (a|s)} [Q(s,a)]$

Model-based Offline RL

What goes wrong when we cannot collect more data?

MOPO: Model-Based Offline Policy Optimization

Yu*, Thomas*, Yu, Ermon, Zou, Levine, Finn, Ma. MOPO: Model-Based Offline Policy Optimization. ‘20 MOReL : Model-Based Offline Reinforcement Learning. ’20 (concurrent)

“punish” the policy for exploiting

\tilde{r}(s,a) = r(s,a) -\lambda u(s,a)

where $u(s,a)$ is model uncertainty about the state and action

$u$ need to at least as large as the error in the model (according to some kind of divergence) ⇒ still an open problem to get a good estimate

“Use ensemble as an error proxy for the uncertainty metric” ⇒ current method

Theoretical analysis:

In particular, $\forall \delta \ge \delta_{\min}$

Some implications:

\eta_M(\hat{\pi}) \ge \eta_M(\pi^{B})-2\lambda \epsilon_u(\pi^{B})

We can do almost always better then the behavior(exploration) policy (since $\delta$ is very small for $\pi^B$ )

\eta_M(\hat{\pi}) \ge \eta_M(\pi^*)-2\lambda \epsilon_u(\pi^*)

Quantifies the optimality “gap” in terms of model error

COMBO: Convervative Model-based RL

Yu, Kumar, Rafailov, Rajeswaran, Levine, Finn. COMBO: Conservative Offline Model-Based Policy Optimization. 2021.

📌

Basic idea: just like CQL minimizes Q-value of policy actions, we can minimize Q-value of model state-action tuples

Intuition: if the model produces something that looks clearly different from real data, it’s easy for the Q-function to make it look bad

Trajectory Transformer

Janner, Li, Levine. Reinforcement Learning as One Big Sequence Modeling Problem. 2021.

Model drew with Auto-regressive sequence model

Train a joint state-action model
1. $p_\beta(\tau) = p_\beta(s_1,a_2, \dots, s_T, a_T)$
1. Intuitively we want to optimize for a plan for a sequence of actions that has a high probability under the data distribution

Use a big expressive model (Transformer)
1. Subscript in the model image just shows the “timestep, dimension” of state and action spaces
1. Diagram is an Auto-regressive sequence model (LSTM)
1. With transformer need a causal mask
  1. GPT-style model
1. Because we are regressing on both state and action sequences we can have accurate predictions out to much longer horizons

Control
1. Beam search, but use $\sum_t r(s_t,a_t)$ instead of probability
  1. Given current sequence, sample next tokens from model
  1. Store top $K$ tokens with highest cumulative reward (but as we progress only select high-probability sequences)
1. Other methods like MCTS works as well

This works because

Generating high-probability trajectories avoids OOD states & actions

using a really big model works well in offline mode

Which Offline Algorithm to use?

If only want to train offline
- CQL
  - Just one hyper-param, well understood and widely tested
  - Doesn’t finetune well
- IQL
  - More flexible (offline + online)
  - More hyperparameters

Only train offline and finetune online
- Advantage-weighted actor-critic (AWAC)
  - Widely used an well tested
- IQL
  - Performs much better than AWAC!

If have a good way to train models in the domain
- Not always easy to train a good model in a specific domain!
- COMBO
  - Similar properties as CQL but benefits from models
- Trajectory Transformer
  - Very powerful and effective models
  - Extremely computationally expensive to train and evaluate

Why offline RL

Open Problems

An offline RL workflow
- Supervised learning workflow: train/test split
- Offline RL workflow: Train offline, evaluate online
- Starting Point
  - Kumar, Singh, Tian, Finn, Levine. A Workflow for Offline Model-Free Robotic Reinforcement Learning. CoRL 2021

Statistical guarantees
- Biggest challenge: distributional shift/counterfactuals
- Can we make any guarantees?

Scalable methods, large-scale applications
- Dialogue systems
- Data-driven navigation and driving