CS 285 Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley CS 285 (Sergey Levine).

Reference Notice: Material highly and mostly derived from Prof Levine's lecture slides, some ideas were borrowed from wikipedia & CS189.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

General Introduction to RL (LEC 1)

Supervised ML

Given $D=\{(x_i,y_i)\}$

we will learn to predict $y$ from $x$ .

usually assumes:

i.i.d. data (previous x,y pair does not affect next x, y pair)

known ground truth outputs in the training

Problem:

Cannot adapt if something fails.

Reinforcement Learning:

Data is not i.i.d: previous outputs influence future inputs!

Ground truth answer is not known, only know if we succeeded or failed (but we generally know the reward)

Compared to traditional reinforcement learning,

With DRL(Deep Reinforcement Learning), we can solve complex problems end-to-end!

But there are challenges:

Humans can learn incredibly quickly
1. Deep RLs are usually slow
1. probably because humans can reuse past knowledge

Not clear what the reward function should be

Not clear what the role of prediction should be

Fundamental Theorem of RL

Optimal Control and Planning Algorithms (and LQR)

Types of RL Algorithms

Remember the objective of RL:

\theta^{*} = \argmax_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t,a_t)]

Policy Gradient
1. Directly differentiate the above objective

Value-based
1. Estimate value function or Q-function of the optimal policy (no explicit policy)
1. Then use those functions to prove policy

Actor-critic (A mix between policy gradient and value-based)
1. Estimate value function or Q-function of the current policy, use it to improve policy

Model-based RL
1. Estimate the transition model, and then
  1. Use it for planning (no explicit policy)
  1. Use it to improve a policy
  1. Other variants

Supervised Learning of RL

Supervised Learning of Policies (Behavior Cloning)

Model-Based RL

e.g.

Dyna

Guided Policy Search

Generate samples(run the policy) ⇒

Fit a model of $p(s_{t+1}|s_t,a_t)$ ⇒

Then improve the policy(a few options)

Improving policy:

Just use the model to plan (no policy)
1. Trajectory optimization / optimal control (primarily in continuous spaces)
  1. Backprop to optimize over actions
1. Discrete planning in discrete action spaces
  1. Monte Carlo tree search

Backprop gradients into the policy
1. Requires some tricks to make it work

Use the model to learn a separate value function or Q function
1. Dynamic Programming
1. Generate simulated experience for model-free learner

Model-based RL

Value function based algorithms

e.g.

Q-Learning, DQN

Temporal Difference Learning

Fitted Value Iteration

Generate samples(run the policy) ⇒

Fit a model of $V(s)$ or $Q(s,a)$ ⇒

Then improve the policy(set $\pi(s) = \argmax_\theta Q(s,a)$ )

Value-based (and Q-) Learning

Direct Policy Gradients

e.g.

REINFORCE Natural Policy Gradient

Trust Region Policy Optimization

Generate samples (run the policy) ⇒

Estimate the return

R_\tau = \sum_t r(s_t,a_t)

Improve the policy by

\theta \leftarrow \theta + \alpha \nabla_\theta \mathbb{E}[\sum_t r(s_t,a_t)]

Policy Gradients

Actor-critic

e.g.

Asynchronous Advantage Actor-Critic (A3C)

Soft actor-critic (SAC)

Generate samples ⇒

Fit a model $V(s)$ or $Q(s,a)$ ⇒

Improve the policy

\theta \leftarrow \theta + \alpha \nabla_\theta \mathbb{E}[\sum_t r(s_t,a_t)]

Actor-critic Methods

Tradeoff between algorithms

Sample efficiency

“How many samples for a good policy?”

Most important question: Is the algorithm off-policy?
1. Off policy means being able to improve the policy without generating new samples from that policy
1. On policy means we need to generate new samples even if the policy is changed a little bit

⚠️

We may still want to use a less efficient algorithm because of the other tradeoffs and maybe the price of generating new samples is really cheap.

Stability & Ease of use

Does our policy converge, and if it does, to what?

🔥

Supervised Learning: Almost always Gradient Descent is very likely to converge Reinforcement Learning: Often not Gradient Descent

Value function fitting:
1. Fixed point iteration
1. At best, minimizes error of fit (“Bellman error”)
  1. Not the same as expected reward
1. At worst, doesn’t optimize anything
  1. Many popular DRL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case

Model-based RL
1. Model is not optimized for expected reward
1. Model minimizes error of fit
  1. This will converge
1. No guarantee that better model = better policy

Policy Gradient
1. The only one that performs Gradient Descent / Ascent on the true objective

Different assumptions

Stochastic or deterministic environments?

Continuous or discrete (states and action)?

Episodic(finite $T$ ) or infinite horizon?

Different things are easy or hard in different settings
1. Easier to represent the policy?
1. Easier to represent the model?

Common Assumptions:

Full observability
1. Generally assumed by value function fitting methods
1. Can be mitigated by adding recurrence

Episodic Learning
1. Often assumed by pure policy gradient methods
1. Assumed by some model-based RL methods
1. Although other methods not assumed, tend to work better under this assumption