# CS 285 Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley CS 285 (Sergey Levine).

Reference Notice: Material highly and mostly derived from Prof Levine's lecture slides, some ideas were borrowed from wikipedia & CS189.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

# General Introduction to RL (LEC 1)

**Supervised ML**

Given $D=\{(x_i,y_i)\}$

we will learn to predict $y$ from $x$.

usually assumes:

i.i.d. data (previous x,y pair does not affect next x, y pair)

known ground truth outputs in the training

Problem:

Cannot adapt if something fails.

**Reinforcement Learning:**

Data is not i.i.d: previous outputs influence future inputs!

Ground truth answer is not known, only know if we succeeded or failed (but we generally know the reward)

Compared to traditional reinforcement learning,

With DRL(Deep Reinforcement Learning), we can solve complex problems end-to-end!

But there are challenges:

- Humans can learn incredibly quickly
- Deep RLs are usually slow

- probably because humans can reuse past knowledge

- Not clear what the reward function should be

- Not clear what the role of prediction should be

# Types of RL Algorithms

Remember the objective of RL:

- Policy Gradient
- Directly differentiate the above objective

- Value-based
- Estimate value function or Q-function of the optimal policy (no explicit policy)

- Then use those functions to prove policy

- Actor-critic (A mix between policy gradient and value-based)
- Estimate value function or Q-function of the current policy, use it to improve policy

- Model-based RL
- Estimate the transition model, and then
- Use it for planning (no explicit policy)

- Use it to improve a policy

- Other variants

- Estimate the transition model, and then

## Supervised Learning of RL

## Model-Based RL

e.g.

- Dyna

- Guided Policy Search

Generate samples(run the policy) ⇒

Fit a model of $p(s_{t+1}|s_t,a_t)$ ⇒

Then improve the policy(a few options)

Improving policy:

- Just use the model to plan (no policy)
- Trajectory optimization / optimal control (primarily in continuous spaces)
- Backprop to optimize over actions

- Discrete planning in discrete action spaces
- Monte Carlo tree search

- Trajectory optimization / optimal control (primarily in continuous spaces)

- Backprop gradients into the policy
- Requires some tricks to make it work

- Use the model to learn a separate value function or Q function
- Dynamic Programming

- Generate simulated experience for model-free learner

## Value function based algorithms

e.g.

- Q-Learning, DQN

- Temporal Difference Learning

- Fitted Value Iteration

Generate samples(run the policy) ⇒

Fit a model of $V(s)$ or $Q(s,a)$ ⇒

Then improve the policy(set $\pi(s) = \argmax_\theta Q(s,a)$)

## Direct Policy Gradients

e.g.

- REINFORCE Natural Policy Gradient

- Trust Region Policy Optimization

Generate samples (run the policy) ⇒

Estimate the return

Improve the policy by

## Actor-critic

e.g.

- Asynchronous Advantage Actor-Critic (A3C)

- Soft actor-critic (SAC)

Generate samples ⇒

Fit a model $V(s)$ or $Q(s,a)$ ⇒

Improve the policy

## Tradeoff between algorithms

### Sample efficiency

“How many samples for a good policy?”

- Most important question: Is the algorithm off-policy?
- Off policy means being able to improve the policy without generating new samples from that policy

- On policy means we need to generate new samples even if the policy is changed a little bit

### Stability & Ease of use

- Does our policy converge, and if it does, to what?

- Value function fitting:
- Fixed point iteration

- At best, minimizes error of fit (“Bellman error”)
- Not the same as expected reward

- At worst, doesn’t optimize anything
- Many popular DRL value fitting algorithms are not guaranteed to converge to anything in the nonlinear case

- Model-based RL
- Model is not optimized for expected reward

- Model minimizes error of fit
- This will converge

- No guarantee that better model = better policy

- Policy Gradient
- The only one that performs Gradient Descent / Ascent on the true objective

### Different assumptions

- Stochastic or deterministic environments?

- Continuous or discrete (states and action)?

- Episodic(finite $T$) or infinite horizon?

- Different things are easy or hard in different settings
- Easier to represent the policy?

- Easier to represent the model?

Common Assumptions:

- Full observability
- Generally assumed by value function fitting methods

- Can be mitigated by adding recurrence

- Episodic Learning
- Often assumed by pure policy gradient methods

- Assumed by some model-based RL methods

- Although other methods not assumed, tend to work better under this assumption

- Continuity or smoothness
- Assumed by some continuous value function learning methods

- Often assumed by some model-based RL methods

# Exploration

# Offline RL (Batch RL / fully off-policy RL)

# RL Theory

# Variational Inference

No RL content in this chapter, but heavy links to RL algorithms

## Control as an Inference Problem

# Inverse Reinforcement Learning

# Transfer Learning & Meta-Learning

# Open Problems

# Guest Lectures