CS 285 Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley CS 285 (Sergey Levine).

Reference Notice: Material highly and mostly derived from Prof Levine's lecture slides, some ideas were borrowed from wikipedia & CS189.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

General Introduction to RL (LEC 1)

Supervised ML

Given D={(xi,yi)}D=\{(x_i,y_i)\}

we will learn to predict yy from xx.

usually assumes:

i.i.d. data (previous x,y pair does not affect next x, y pair)

known ground truth outputs in the training


Cannot adapt if something fails.

Reinforcement Learning:

Data is not i.i.d: previous outputs influence future inputs!

Ground truth answer is not known, only know if we succeeded or failed (but we generally know the reward)

Compared to traditional reinforcement learning,

With DRL(Deep Reinforcement Learning), we can solve complex problems end-to-end!

But there are challenges:

  1. Humans can learn incredibly quickly
    1. Deep RLs are usually slow
    1. probably because humans can reuse past knowledge
  1. Not clear what the reward function should be
  1. Not clear what the role of prediction should be

Types of RL Algorithms

Remember the objective of RL:

θ=arg maxθEτpθ(τ)[tr(st,at)]\theta^{*} = \argmax_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]
  1. Policy Gradient
    1. Directly differentiate the above objective
  1. Value-based
    1. Estimate value function or Q-function of the optimal policy (no explicit policy)
    1. Then use those functions to prove policy
  1. Actor-critic (A mix between policy gradient and value-based)
    1. Estimate value function or Q-function of the current policy, use it to improve policy
  1. Model-based RL
    1. Estimate the transition model, and then
      1. Use it for planning (no explicit policy)
      1. Use it to improve a policy
      1. Other variants

Supervised Learning of RL

Model-Based RL


  1. Dyna
  1. Guided Policy Search

Generate samples(run the policy) ⇒

Fit a model of p(st+1st,at)p(s_{t+1}|s_t,a_t)

Then improve the policy(a few options)

Improving policy:

  1. Just use the model to plan (no policy)
    1. Trajectory optimization / optimal control (primarily in continuous spaces)
      1. Backprop to optimize over actions
    1. Discrete planning in discrete action spaces
      1. Monte Carlo tree search
  1. Backprop gradients into the policy
    1. Requires some tricks to make it work
  1. Use the model to learn a separate value function or Q function
    1. Dynamic Programming
    1. Generate simulated experience for model-free learner

Value function based algorithms


  1. Q-Learning, DQN
  1. Temporal Difference Learning
  1. Fitted Value Iteration

Generate samples(run the policy) ⇒

Fit a model of V(s)V(s) or Q(s,a)Q(s,a)

Then improve the policy(set π(s)=arg maxθQ(s,a)\pi(s) = \argmax_\theta Q(s,a))

Direct Policy Gradients


  1. REINFORCE Natural Policy Gradient
  1. Trust Region Policy Optimization

Generate samples (run the policy) ⇒

Estimate the return

Rτ=tr(st,at)R_\tau = \sum_t r(s_t,a_t)

Improve the policy by

θθ+αθE[tr(st,at)]\theta \leftarrow \theta + \alpha \nabla_\theta \mathbb{E}[\sum_t r(s_t,a_t)]



  1. Asynchronous Advantage Actor-Critic (A3C)
  1. Soft actor-critic (SAC)

Generate samples ⇒

Fit a model V(s)V(s) or Q(s,a)Q(s,a)

Improve the policy

θθ+αθE[tr(st,at)]\theta \leftarrow \theta + \alpha \nabla_\theta \mathbb{E}[\sum_t r(s_t,a_t)]

Tradeoff between algorithms

Sample efficiency

“How many samples for a good policy?”
  1. Most important question: Is the algorithm off-policy?
    1. Off policy means being able to improve the policy without generating new samples from that policy
    1. On policy means we need to generate new samples even if the policy is changed a little bit
We may still want to use a less efficient algorithm because of the other tradeoffs and maybe the price of generating new samples is really cheap.

Stability & Ease of use

  1. Does our policy converge, and if it does, to what?
Supervised Learning: Almost always Gradient Descent is very likely to converge Reinforcement Learning: Often not Gradient Descent
  1. Value function fitting:
    1. Fixed point iteration
    1. At best, minimizes error of fit (“Bellman error”)
      1. Not the same as expected reward
    1. At worst, doesn’t optimize anything
      1. Many popular DRL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case
  1. Model-based RL
    1. Model is not optimized for expected reward
    1. Model minimizes error of fit
      1. This will converge
    1. No guarantee that better model = better policy
  1. Policy Gradient
    1. The only one that performs Gradient Descent / Ascent on the true objective

Different assumptions

  1. Stochastic or deterministic environments?
  1. Continuous or discrete (states and action)?
  1. Episodic(finite TT) or infinite horizon?
  1. Different things are easy or hard in different settings
    1. Easier to represent the policy?
    1. Easier to represent the model?

Common Assumptions:

  1. Full observability
    1. Generally assumed by value function fitting methods
    1. Can be mitigated by adding recurrence
  1. Episodic Learning
    1. Often assumed by pure policy gradient methods
    1. Assumed by some model-based RL methods
    1. Although other methods not assumed, tend to work better under this assumption
  1. Continuity or smoothness
    1. Assumed by some continuous value function learning methods
    1. Often assumed by some model-based RL methods


Offline RL (Batch RL / fully off-policy RL)

RL Theory

Variational Inference

No RL content in this chapter, but heavy links to RL algorithms

Control as an Inference Problem

Inverse Reinforcement Learning

Transfer Learning & Meta-Learning

Open Problems

Guest Lectures