Offline RL

Huge gap in DL and DRL ⇒ supervised learning has tons of data!

Formally:

D={(si,ai,si,ri)}sdπβ(s)aπβ(as)rr(s,a)D = \{ (s_i,a_i,s_i',r_i) \} \\ s \sim d^{\pi_\beta}(s) \\ a \sim \pi_\beta(a|s) \\ r \leftarrow r(s,a)
πβ\pi_\beta is the unknown policy that collected the data

Off-Policy Evaluation (OPE)

Given DD, estimate return of some policy J(πβ)J(\pi_\beta)

Offline Reinforcement Learning

Given DD, learn the best possible policy πθ\pi_\theta (such that the evidence of the model is good is supported by the dataset)

How is this even possible?

  1. Find the “good stuff” in a dataset full of good and bad behaviors
  1. Generalization: Good behavior in one place may suggest good behavior in another place
  1. “Stitching”: Parts of good behaviors (even badly-performed parts can have good behaviors) can be recombined

Behavior Cloning
“Stiching”

What can go wrong if we just use off-policy RL (on data collected by other agents)?

Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-based Robotic Manipulation Skills

So it works in practice, but seems like a huge GAP between not-fine-tuned and fine-tuned result?

Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. NeurIPS ‘19
Q-function overfitting with an off-policy actor-critic algorithm
Fundamental Problem: Counterfactual queries

Online RL algorithms don’t have to handle this because they can simply try this action and see what happens ⇒ not possible for offline policies

Offline RL methods must somehow account for these unseen (”out-of-distribution”) actions, ideally in a safe way

Distribution Shift

In a supervised learning setting, we want to perform empirical risk minimization (ERM):

θarg minθExp(x),yp(yx)[(fθ(x)y)2]\theta \leftarrow \argmin_\theta \mathbb{E}_{x \sim p(x), y\sim p(y|x)} [(f_\theta(x)-y)^2]

Given some xx^*

Exp(x),yp(yx)[(fθ(x)y)2] is low(since distribution is same as training set)Expˉ(x),yp(yx)[(fθ(x)y)2] is generally not if pˉ(x)p(x)\mathbb{E}_{x \sim p(x), y\sim p(y|x)} [(f_\theta(x)-y)^2] \text{ is low(since distribution is same as training set)} \\ \mathbb{E}_{x \sim \bar{p}(x), y \sim p(y|x)} [(f_\theta(x)-y)^2] \text{ is generally not if $\bar{p}(x) \ne p(x)$}

Yes, neural nets generalize well ⇒ but is is well enough?

If we pick xarg maxxfθ(x)x^* \leftarrow \argmax_x f_\theta(x)

Blue curve is the fitted and the green is the actual

When we choose argmax we are probably including a huge positive noise!

Q-Learning Actor-Critic with Distribution Shift

Q-objective:

minQE(s,a)πβ(s,a)[(Q(s,a)y(s,a))2]\min_{Q} \mathbb{E}_{(s,a) \sim \pi_\beta(s,a)}[(Q(s,a)-y(s,a))^2]

The function approximates:

Q(s,a)r(s,a)+Eaπnew[Q(s,a)]Q(s,a) \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]

The distribution shift problem kicks in when

πnewπβ\pi_{new} \ne \pi_\beta

And even worse,

πnew=arg maxπEaπ(as)[Q(s,a)]\pi_{new} = \argmax_\pi \mathbb{E}_{a \sim \pi(a|s)} [Q(s,a)]

Online RL Setting (Sample “optimal” and find error in our approximation)

Offline Setting (No way to find error in our approximation)

🧙🏽‍♂️
Existing Challenges with sampling error and function approximation error in standard RL become much more severe in offline RL

Batch RL via Important Sampling (Traditional Offline Algorithms)

Importance Sampling:

θJ(θ)1Ni=1Nπθ(τi)πβ(τi)undefinedimportance weightt=0Tθγtlogπθ(at,ist,i)Q^(st,i,at,i)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\frac{\pi_\theta(\tau_i)}{\pi_\beta(\tau_i)}}_{\mathclap{\text{importance weight}}} \sum_{t=0}^T \nabla_\theta \gamma^t \log \pi_\theta(a_{t,i}|s_{t,i})\hat{Q}(s_{t,i},a_{t,i})

Where the importance weight, is exponential in T

πθ(τ)πβ(τ)=p(s1)tp(st+1st,at)πθ(atst)p(s1)tp(st+1st,at)πβ(atst)O(eT)\frac{\pi_\theta(\tau)}{\pi_\beta(\tau)} = \frac{p(s_1) \prod_t p(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)}{p(s_1)\prod_t p(s_{t+1}|s_t,a_t) \pi_\beta(a_t|s_t)} \in O(e^T)

Can we fix this?

θJ(θ)1Ni=1Nt=0T(t=0t1πθ(at,ist,i)πβ(at,ist,i))undefinedaccounts for difference in probability of landing in st,iθγtlogπθ(at,ist,i)(t=tTπθ(at,ist,i)πβ(at,ist,i))undefinedaccounts for incorrect Q^(st,i,at,i)Q^(st,i,at,i)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \underbrace{(\prod_{t'=0}^{t-1} \frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})})}_{\mathclap{\text{accounts for difference in probability of landing in $s_{t,i}$}}} \nabla_\theta \gamma^t \log \pi_\theta(a_{t,i}|s_{t,i}) \underbrace{(\prod_{t'=t}^T\frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})}) }_{\mathclap{\text{accounts for incorrect $\hat{Q}(s_{t,i},a_{t,i})$}}}\hat{Q}(s_{t,i},a_{t,i})

Classic Advanced Policy Gradient RL completely strips away the first product term ⇒ assuming that the new policy πθ\pi_\theta is not too far away from πβ\pi_\beta

We can estimate

Q^(st,i,at,i)=Eπθ[t=tTγttrt]t=tTγttrt,i\hat{Q}(s_{t,i},a_{t,i}) = \mathbb{E}_{\pi_\theta}[\sum_{t'=t}^T \gamma^{t'-t} r_{t'}] \approx \sum_{t'=t}^T \gamma^{t'-t} r_{t',i}

We can simplify the portion (using the fact that future actions don’t affect current reward)

(t=tTπθ(at,ist,i)πβ(at,ist,i))Q^(st,i,at,i)=t=tT(t=tTπθ(at,ist,i)πβ(at,ist,i))Q^(st,i,at,i)t=tT(t=ttπθ(at,ist,i)πβ(at,ist,i))Q^(st,i,at,i)(\prod_{t'=t}^T\frac{\pi_\theta(a_{t',i}|s_{t',i})}{\pi_\beta(a_{t',i}|s_{t',i})})\hat{Q}(s_{t,i},a_{t,i}) = \sum_{t'=t}^T(\prod_{t''=t}^T\frac{\pi_\theta(a_{t'',i}|s_{t'',i})}{\pi_\beta(a_{t'',i}|s_{t'',i})})\hat{Q}(s_{t,i},a_{t,i}) \approx \sum_{t'=t}^T(\prod_{t''=t}^{t'}\frac{\pi_\theta(a_{t'',i}|s_{t'',i})}{\pi_\beta(a_{t'',i}|s_{t'',i})})\hat{Q}(s_{t',i},a_{t',i})

But this is still exponential in TT

🧙🏽‍♂️
To avoid exponentially exploding importance weights, we must use value function estimation!

The Doubly Robust Estimator

Jiang, N. and Li, L. (2015). Doubly robust off-policy value evaluation for reinforcement learning
Vπθ(s0)t=0T(t=0tπθ(atst)πβ(atst))γtrt=t=0T(t=0tρt)γtrt=ρ0r0+ρ0γρ1r1+ρ0γρ1γρ2r2+=ρ0(r0+γ(ρ1(r1+...)))=VˉT\begin{split} V^{\pi_\theta}(s_0) &\approx \sum_{t=0}^T (\prod_{t'=0}^t \frac{\pi_\theta(a_{t'}|s_{t'})}{\pi_\beta(a_{t'}|s_{t'})})\gamma^t r_t \\ &= \sum_{t=0}^T(\prod_{t'=0}^t \rho_{t'}) \gamma^t r_t \\ &= \rho_0 r_0 + \rho_0\gamma\rho_1r_1 + \rho_0\gamma \rho_1 \gamma \rho_2 r_2 + \dots \\ &=\rho_0(r_0 + \gamma(\rho_1(r_1+...))) \\ &=\bar{V}^T \end{split}

We notice a recursion relationship:

VˉT+1t=ρt(rt+γVˉTt)\bar{V}^{T+1-t} = \rho_t(r_t + \gamma \bar{V}^{T-t})

Bandit case of doubly robust estimation:

VDR(s)=V^(s)+ρ(s,a)(rs,aQ^(s,a))V_{DR}(s) = \hat{V}(s) + \rho(s,a)(r_{s,a} - \hat{Q}(s,a))

Doubly Robust Estimation:

VˉDRT+1t=V^(st)+ρt(rt+γVˉDRTtQ^(st,at))\bar{V}_{DR}^{T+1-t} = \hat{V}(s_t) + \rho_t(r_t + \gamma \bar{V}_{DR}^{T-t} - \hat{Q}(s_t,a_t))

Marginalized Important Sampling

In previous times we were trying to do importance sampling on action distributions ⇒ but it’s actually possible to do it on state distributions
🧙🏽‍♂️
Instead of using tπθ(atst)πβ(atst)\prod_t \frac{\pi_\theta(a_t|s_t)}{\pi_\beta(a_t|s_t)}, estimate w(s,a)=dπθ(s,a)dπβ(s,a)w(s,a) = \frac{d^{\pi_\theta}(s,a)}{d^{\pi_\beta}(s,a)}

How to determine w(s,a)w(s,a)?

Suggested Reading

Batch RL via Linear Fitted Value Functions

Assume we have a feature matrix

ΦRS×K\Phi \in \mathbb{R}^{|S| \times K}

where S|S| is the number of states, and KK is the number of features. Then in order to do model-based RL in feature space

  1. Estimate the reward
    1. Φwrr\Phi w_r \approx rwr=(ΦΦ)1Φrw_r = (\Phi^{\top} \Phi)^{-1} \Phi^{\top} \vec{r}
  1. Estimate the transitions
    1. ΦPΦPπΦ\Phi P_{\Phi} \approx P^{\pi} \Phi
      1. PΦP_\Phi ⇒ estimated feature space transition matrix RK×K\in \mathbb{R}^{K \times K}
      1. Pϕ=(ΦΦ)1ΦPπΦP_\phi = (\Phi^{\top} \Phi)^{-1} \Phi^{\top} P^{\pi} \Phi
    1. PπRS×SP^\pi \in \mathbb{R}^{|S| \times |S|} ⇒ Real transition matrix (on states), dependent on the policy
  1. Estimate the value function
    1. VπVΦπ=ΦwVV^{\pi} \approx V_{\Phi}^\pi = \Phi w_V
      1. Solving for VπV^\pi in terms of PπP^\pi and rr:
        1. Vπ=r+γPΠVπV^\pi = r + \gamma P^\Pi V^\pi
        1. (1γPπ)Vπ=r(1-\gamma P^\pi) V^\pi = r
        1. Vπ=(IγPπ)1rV^\pi = (I - \gamma P^\pi)^{-1}r
    1. wV=(IγPΦ)1wr=(ΦΦγΦPπΦ)1Φrw_V = (I - \gamma P_\Phi)^{-1} w_r = (\Phi^\top \Phi-\gamma \Phi^\top P^\pi \Phi)^{-1} \Phi^\top \vec{r} ⇒ Least-sqaures Temporal Difference (LSTD)
    1. But we don’t know PπP^\pi
    1. With samples introduced, we will change our feature matrix ΦRD×K\Phi \in \mathbb{R}^{|D| \times K}
    1. And now we are going to replace PπΦP^\pi \Phi with Φ\Phi’ such that each row of Φi\Phi’_i is feature at the next timestep Φi=ϕ(si)\Phi_i’ = \phi(s_i’)
      1. However, if we change policy, we no longer have PπP^\pi
    1. Also replace ri=r(si,ai)\vec{r}_i = r(s_i,a_i)
    1. Everything else works exactly same, but only that we now have some sampling error
  1. Improve the policy
    1. π(s)Greedy(ΦwV)\pi’(s) \leftarrow Greedy(\Phi w_V)

Least Squares Policy Iteration (LSPI)

🧙🏽‍♂️
Replace LSTD with LSTDQ - LSTD but for Q functions
wQ=(ΦΦγΦΦundefinedΦi=ϕ(si,π(si)))1Φrw_Q = (\Phi^\top \Phi-\gamma \Phi^\top \underbrace{\Phi'}_{\mathclap{\Phi'_i = \phi(s_i',\pi(s_i'))}})^{-1} \Phi^\top \vec{r}

LSPI:

  1. Compute wQw_Q for πk\pi_k
  1. πk+1(s)=arg maxaϕ(s,a)wQ\pi_{k+1}(s) = \argmax_a \phi(s,a)w_Q
  1. Set Φi=ϕ(si,πk+1(si))\Phi_i’ = \phi(s_i’, \pi_{k+1}(s_i’))

Problem:

Policy Constraints

Very old idea, generally only implement this would not work well
Qr(s,a)+Eaπnew[Q(s,a)]Q \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]
πnew(as)=arg maxπEaπ(as)[Q(s,a)] s.t. DKL(π,πβ)ϵ\pi_{new}(a|s) = \argmax_\pi \mathbb{E}_{a \sim \pi(a|s)}[Q(s,a)] \text{ s.t. } D_{KL}(\pi,\pi_\beta) \le \epsilon

DKL(π,πβ)D_{KL}(\pi, \pi_\beta)

So maybe instead we want to pose a “support constraint”

π(as)0 iff πβ(as)ϵ\pi(a|s) \ge 0 \text{ iff } \pi_\beta(a|s) \ge \epsilon

How to implement?

Implementing Implicit Policy Constraint in practice
  1. Modify the actor objective
    1. Common objective θarg maxθEsD[Eaπθ(as)[Q(s,a)]]\theta \leftarrow \argmax_\theta \mathbb{E}_{s \sim D} [\mathbb{E}_{a \sim \pi_\theta(a|s)}[Q(s,a)]]
    1. DKL(π,πβ)=E[logπ(as)logπβ(as)]=Eπ[logπβ(as)]H(π)D_{KL}(\pi, \pi_\beta) = \mathbb{E}[\log \pi(a|s) - \log \pi_\beta (a|s)] = -\mathbb{E}_\pi[\log \pi_\beta(a|s)] - H(\pi)
    1. So we can incorporate the KL divergence into the actor objective
    1. Use Lagrange Multiplier
      1. θarg maxθEsD[Eaπθ(as)[Q(s,a)+λlogπβ(as)]+λH(π(as))]\theta \leftarrow \argmax_\theta \mathbb{E}_{s \sim D} [\mathbb{E}_{a \sim \pi_\theta(a|s)}[Q(s,a) + \lambda \log \pi_\beta (a|s)]+\lambda H(\pi (a|s))]
      1. In practice we either solve the lagrange multiplier or treat it as hyper-parameter and tune it
  1. Modify the reward funciton
    1. rˉ(s,a)=r(s,a)D(π,πβ)\bar{r}(s,a) = r(s,a) - D(\pi, \pi_\beta)
    1. Simple modification to directly penalize divergence
    1. Also accounts for future divergence (in later steps)
    1. Wu, Tucker, Nachum, Behavior Regularized Offline Reinforcement Learning. ‘19
  1. Implicit Policy Constraint Methods
    1. Solve for lagrangian condition of the KL constrained problem (straightforward to show via duality), we have
    1. π(as)=1Z(s)πβ(as)exp(1λAπ(s,a))\pi^*(a|s) = \frac{1}{Z(s)} \pi_\beta (a|s) \exp(\frac{1}{\lambda} A^\pi (s,a))
    1. Peters et al. (REPS)
    1. Rawlik et al. (”psi-learning”)
    1. We can do this by
      1. Approximate via weighted max likelihood
      1. πnew(as)=arg maxπE(s,a)πβundefined(s,a)D[logπ(as)1Z(s)exp(1λAπold(s,a))undefinedw(s,a)]\pi_{new}(a|s) = \argmax_{\pi} \mathbb{E}_{\underbrace{(s,a) \sim \pi_\beta}_{(s,a) \sim D}} [\log \pi(a|s) \underbrace{\frac{1}{Z(s)} \exp(\frac{1}{\lambda} A^{\pi_{old}} (s,a)) }_{w(s,a)}]
      1. Imitating good actions “more”(determined by w(s,a)w(s,a)) than bad actions
    1. Problem: In order to get advantage values / target values we still need to query OOD actions
      1. If we choose λ\lambda correctly the constraints will be respected at convergence, but not necessarily at intermediate steps
🧙🏽‍♂️
Can we avoid ALL OOD actions in the Q update?

Instead of doing

Q(s,a)r(s,a)+Eaπnew[Q(s,a)]Q(s,a) \leftarrow r(s,a) + \mathbb{E}_{a' \sim \pi_{new}}[Q(s',a')]

We can do:

Q(s,a)r(s,a)+V(s)Q(s,a) \leftarrow r(s,a) + V(s')

But how do we fit this VV’ function?

  1. Varg minV1Ni=1Nl(V(si),Q(si,ai))V \leftarrow \argmin_{V} \frac{1}{N} \sum_{i=1}^N l(V(s_i), Q(s_i, a_i))
    1. MSE Loss (V(si)Q(si,ai))2(V(s_i) - Q(s_i, a_i))^2
      1. Problem is ⇒ the actions come from the exploration policy πβ\pi_\beta not from πnew\pi_{new}
      1. Gives us Eaπβ[Q(s,a)]\mathbb{E}_{a \sim \pi_\beta}[Q(s,a)]

Think about this:

Distribution visualization of states
  1. We’ve probably seen only a bit or none of action in the “exact” same state
    1. But there might be very “similar” states that we execute action on ⇒ There’s not bins of states but distributions of states
    1. Can we use those “similar states” as extra data points for us to perform generalization on?
    1. But what if we use a “quantile” / “expectile” ⇒ expectile is like a quantile squared
      1. Upper expectile / quantile is the best policy supported by the data

Implicit Q-Learning

Expectile:

l2τ(x)={(1τ)x2if x>0τx2otherwisel_2^\tau (x) = \begin{cases} (1-\tau)x^2 &\text{if } x>0 \\ \tau x^2 &\text{otherwise} \end{cases}

Intuition:

  1. MSE penalizes positive and negative errors equally
  1. expectile penalizes positive and negative with different weights depending on τ[0,1]\tau \in [0,1]

Weights given to positive and negative erros in expectile regression. Taken from IQL paper

Formally, it could be shown that if we use

Varg minVE(s,a)D[l2τ(V(s)Q(s,a))]V \leftarrow \argmin_{V} \mathbb{E}_{(s,a) \sim D} [l_2^\tau(V(s)-Q(s,a))]

In practice, Q(s,a)Q(s,a) would be a target network

Formally, we can prove that provided with a big enough τ\tau,

V(s)maxaΩ(s)Q(s,a)V(s) \leftarrow \max_{a \in \Omega(s)} Q(s,a)

Where

Ω(s)={a:πβ(as)ϵ}\Omega(s) = \{a: \pi_\beta(a|s) \ge \epsilon \}
Kostrikov, Nair, Levine. Offline Reinforcement Learning with Implicit Q-Learning. ‘21
📌
Oh yes, that’s my postdoc’s paper (as the first author)! This is absolutely an amazing paper and he is absolutely an amazing person, also this is one of the papers I read for entering into Sergey’s group. I hope you get to check out this paper!

Conservative Q-Learning (CQL)

Kumar, Zhou, Tucker, Levine. Conservative Q-Learning for Offline Reinforcement Learning
Directly repair overestimated actions in the Q function

Intuition: Push down on areas that have errorneous high Q values

We can formally show that Q^πQπ\hat{Q}^\pi \le Q^\pi for large enough α\alpha

But with this objective, we’re pressing down on actually “good” actions as well ⇒ so add additional terms that pushes up on good samples supported by data

In practice, we add a regularizer R(μ)R(\mu) in LCQL(Q^π)L_{CQL}(\hat{Q}^\pi) so that the μ\mu doesn’t need to be calculated directly.

LCQL(Q^π)=maxμ{αEsD,aμ(as)[Q^π(s,a)]αE(s,a)D[Q(s,a)]}+E(s,a,s)D[(Q(s,a)(r(s,a)+E[Q(s,a)]))2]L_{CQL}(\hat{Q}^\pi) = \max_{\mu} \{ \alpha\mathbb{E}_{s\sim D, a\sim \mu(a|s)}[\hat{Q}^\pi(s,a)] - \alpha \mathbb{E}_{(s,a) \sim D}[Q(s,a)] \} + \mathbb{E}_{(s,a,s') \sim D}[(Q(s,a) - (r(s,a)+\mathbb{E}[Q(s',a')]))^2]

Common choice: R(μ)=EsD[H(μ(s))]R(\mu) = \mathbb{E}_{s \sim D} [H(\mu(\cdot | s))]

⇒ Then μ(as)exp[Q(s,a)]\mu(a|s) \propto \exp[Q(s,a)]Eaμ(as)[Q(s,a)]=logaexp(Q(s,a))\mathbb{E}_{a \sim \mu(a|s)}[Q(s,a)] = \log \sum_a \exp(Q(s,a))

Model-based Offline RL

What goes wrong when we cannot collect more data?

MOPO: Model-Based Offline Policy Optimization

Yu*, Thomas*, Yu, Ermon, Zou, Levine, Finn, Ma. MOPO: Model-Based Offline Policy Optimization. ‘20 MOReL : Model-Based Offline Reinforcement Learning. ’20 (concurrent)

“punish” the policy for exploiting

r~(s,a)=r(s,a)λu(s,a)\tilde{r}(s,a) = r(s,a) -\lambda u(s,a)

where u(s,a)u(s,a) is model uncertainty about the state and action

uu need to at least as large as the error in the model (according to some kind of divergence) ⇒ still an open problem to get a good estimate

“Use ensemble as an error proxy for the uncertainty metric” ⇒ current method

Theoretical analysis:

In particular, δδmin\forall \delta \ge \delta_{\min}

Some implications:

ηM(π^)ηM(πB)2λϵu(πB)\eta_M(\hat{\pi}) \ge \eta_M(\pi^{B})-2\lambda \epsilon_u(\pi^{B})
We can do almost always better then the behavior(exploration) policy (since δ\delta is very small for πB\pi^B)
ηM(π^)ηM(π)2λϵu(π)\eta_M(\hat{\pi}) \ge \eta_M(\pi^*)-2\lambda \epsilon_u(\pi^*)
Quantifies the optimality “gap” in terms of model error

COMBO: Convervative Model-based RL

Yu, Kumar, Rafailov, Rajeswaran, Levine, Finn. COMBO: Conservative Offline Model-Based Policy Optimization. 2021.
📌
Basic idea: just like CQL minimizes Q-value of policy actions, we can minimize Q-value of model state-action tuples

Intuition: if the model produces something that looks clearly different from real data, it’s easy for the Q-function to make it look bad

Trajectory Transformer

Janner, Li, Levine. Reinforcement Learning as One Big Sequence Modeling Problem. 2021.
Model drew with Auto-regressive sequence model
  1. Train a joint state-action model
    1. pβ(τ)=pβ(s1,a2,,sT,aT)p_\beta(\tau) = p_\beta(s_1,a_2, \dots, s_T, a_T)
    1. Intuitively we want to optimize for a plan for a sequence of actions that has a high probability under the data distribution
  1. Use a big expressive model (Transformer)
    1. Subscript in the model image just shows the “timestep, dimension” of state and action spaces
    1. Diagram is an Auto-regressive sequence model (LSTM)
    1. With transformer need a causal mask
      1. GPT-style model
    1. Because we are regressing on both state and action sequences we can have accurate predictions out to much longer horizons
  1. Control
    1. Beam search, but use tr(st,at)\sum_t r(s_t,a_t) instead of probability
      1. Given current sequence, sample next tokens from model
      1. Store top KK tokens with highest cumulative reward (but as we progress only select high-probability sequences)
    1. Other methods like MCTS works as well

This works because

Which Offline Algorithm to use?

Why offline RL

Open Problems