Value-based (and Q-) Learning

Value Function / Q Function Fitting (Value-Based)

Can we omit policy gradient completely from actor-critic methods?

Actor-critic minimizes

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) (\sum_{t'=t}^T A(s_{t'},a_{t'}))

But the best policy can just be extracted from $\argmax_{a_t} A^{\pi}(s_t,a_t)$

\pi'(a_t|s_t)=\begin{cases} 1 &\quad \text{if } a_t = \argmax_{a_t} A^{\pi}(s_t,a_t) \\ 0 &\quad \text{otherwise} \end{cases}

We know that $\pi '$ is as good as $\pi$ , and probably better.

Same as before,

A^\pi(s,a)=r(s,a)+\gamma \mathbb{E}[V^{\pi}(s')] - V^{\pi}(s)

Policy Iteration (Fitted Value Iteration)

So fitting value function can be done if we know the transition dynamics very easily using least square estimate… but what if we don’t know transition dynamics?

Policy Iteration without transition probability (Fitted Q-Iteration)

What if we change the policy evalution part to

Q^\pi(s,a) \leftarrow r(s,a)+\gamma \mathbb{E}_{s' \sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]

Now ⇒ if we have an $(s,a,r,s’)$ tuple then even if policy $\pi '$ changes its action in state $s$ we can still do policy evalution on the tuple because the only place the policy shows up is inside the expectation.

This simple change allows us to policy iteration style algorithms without actually knowing the transition dynamics

Instead of using $\mathbb{E}[V(s_i)]$ , we used $\max_{a’} Q_\phi(s_i’,a_i’)$ , the value function $V(s_i)$ is approxiamted by $\max_{a’} Q_\phi(s_i,a’)$ and intead of doing expectations on the future possible states, we used the next state in the sample $s_i’$

Works for off-policy samples

Only one network, no high-variance policy gradient

No convergence guarentees for non-linear function approximation (more on this later)

Special Cases of Fixed Q-Fitting

During exploration phase, using the greedy algorithm may not be the optimal choice (we want to maximize entropy!)

“Epsilon Greedy”

\pi(a_t|s_t)=\begin{cases} 1-\epsilon &\quad \text{if } a_t = \argmax_a Q_\phi(s_t,a_t) \\ \epsilon /(|A|-1) &\quad \text{otherwise} \end{cases}

“epsilon chance of exploring other options other than optimal action”

Common practice is to vary epsilon while training (early on(Q is bad), larger epsilon)

“Exponential”

\pi(a_t|s_t) \propto \exp(Q_\phi(s_t,a_t))

Best action will be most frequent, but leave some probability for exploration. And plus, if there is a second largest Q, then this second largest is considered more than other options.

Value Function Learning Theory

For tabular value iteration algorithm:

The Bellman Operator encaptures the logic of updating $V(s)$

And one fact is that:

$V^*$ , the value function for the optimal policy, is a fixed point of $B$

$V^*$ always exists, is always unique, and always corresponds to the optimal policy

Does fixed point iteration converge to $V^*$

We can prove that the value iteration reaches $V^*$ because $B$ is a contraction

Contraction means: for any $V$ and $\bar{V}$ , we have $||BV-B\bar{V}||_\infin \le \gamma||V-\bar{V}||_\infin$ where $\gamma \in (0,1)$

Meaning that $V$ and $\bar{V}$ will get closer and closer as we apply $B$ to them

If we replace $\bar{V}$ by $V^*$ (and since $BV^* = V^*$ ),

||BV-V^*||_\infin \le \gamma ||V-V^*||_\infin

For fitted value iteration algorithm:

Step 1 basically applies the bellman operator

We will define a new operator for step 2, $\Pi$ (pi)

Step 2 performs the following:

“Find an value function in the value function model hypothesis space $\Omega$ that optimizes the objective”

V' \leftarrow \argmin_{V' \in \Omega} \frac{1}{2} \sum ||V'(s)-(BV)(s)||^2

So step 2 performs this model fitting operator $\Pi$ that projects $BV$ onto hypothesis space $\Omega$ , our iteration algorithm can now be described as

V \leftarrow \Pi BV

$B$ is still a contraction with respect to $\infin$ norm

||BV-B\bar{V}||_\infin \le \gamma||V-\bar{V}||_\infin

$\Pi$ is a contraction wr.t. L2-norm

||\Pi V- \Pi \bar{V}||^2 \le ||V - \bar{V} ||^2

However, $\Pi B$ is not a contraction of any kind

So fitted value iteration does not converge in general and often does not converge in practice

Same with fitted Q-Iteration, Batch actor-critic

🧙🏽‍♂️

Bad theoretical properties, but works in practice lol

But why? Isn’t value function fitting just graident descent?

We can actually turn this into a gradient descent algorithm by computing into those target functions ⇒ resulting algorithm is called residual algorithms, has poor numerical properties & doesn’t work well in practice

Correlation Problem in Q-Learning

⚠️

Sequential States are Strongly Correlated And Target value is always changing (chasing its own tails)

Think about a sequence

In the early 3 steps, the target value is kind of “overfit” to those 3 values

In the supervised setting of learning Q / Value functions, the data points seen by the target values are only local states

We can solve this by adding more batches at one time (by synchronized parallel Q-learning or asynchronous parallel Q-learning)

Or we use replay buffers (Since fitted Q-Learning is basically a off-policy method) ⇒ Samples are no longer correlated and multiple samples in the batch (low-variance gradient)

But where does the data come from:

Need to periodically feed the replay buffer

⚠️

On full fitted Q-Iteration algorithm, we set

\phi \leftarrow \argmin_{\phi} \frac{1}{2} \sum_i ||Q_\phi(s_i,a_i)-y_i||^2

But do we want it to actually converge to argmin if the sample has high variance?

Maybe just move one gradient step instead

\phi \leftarrow \phi - \alpha \sum_i \frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i)-y_i)

Usually ⇒ $K \in [1, 4]$ , $N \in [10000, 50000]$

Target network makes sure that we’re not trying to hit a moving target ⇒ and now it looks more like supervised regression

🔥

Popular Alternative of target network update:

\phi ' \leftarrow \tau \phi ' + (1-\tau) \phi

\tau = 0.999

works well

Question: Chasing its tail relationship, is there a typical scenerio where Q function chases its tail and shows bad behaviors?

Classic Deep Q-Learning Algorithm (DQN)

A special case(with specific parameters) of Q-learning with replay buffer and target network

A general view of Q-Learning

Online Q-learning (last lecture): evict immediately, process 1, process 2, and process 3 all run at the same speed.

DQN: Process 1 and process 3 run at the same speed, process 2 is slow.

Fitted Q-iteration: process 3 in the inner loop of process 2, which is in the inner loop of process 1

Overestimation in Q-Learning and Double Q-Learning

Q functions are not accurate (much much larger)

🔥

Why does Q-function think that it’s gonna get systematically larger values than true values.

Because target value

y_j = r_j + \gamma \underbrace{\max_{a_j '}}_{\mathclap{\text{this last term is the problem}}} Q_{\phi '}(s_j ', a_j ')

Imagine two r.v. $X_1, X_2$

We can prove:

\mathbb{E}[\max(X_1,X_2)] \ge \max(\mathbb{E}[X_1],\mathbb{E}[X_2])

But:

$Q_{\phi '}(s',a')$ is not perfect, it is “noisy” ⇒ We are always selecting the positive errors while training using the $\max$

Note that

\max_{a'}Q_{\phi '}(s',a')=Q_{\phi '}(s',\underbrace{\argmax_{a'} Q_{\phi '}(s',a')}_{\mathclap{\text{action selected according to $Q_{\phi '}$}}})

So use “double Q-Learning” ⇒ use two networks

Update $\phi A$ by using $\phi B$ as a target network and $\phi A$ as an action selector, opposite with $\phi B$ update.

Q_{\phi A}(s,a) \leftarrow r+\gamma Q_{\phi B}(s',\argmax_{a'} Q_{\phi A}(s',a'))

Q_{\phi B}(s,a) \leftarrow r + \gamma Q_{\phi A}(s', \argmax_{a'} Q_{\phi B}(s',a')

Intuition: Choosing the “optimal” (but due to noise) action in $Q_{\phi A}$ will cause the noise of the same action in $Q_{\phi B}$ to be added into the result, therefore the “overestimation” effect is mitigated

In practice, instead of having two Q functions, use the two Q functions that we already have.

We already have $\phi$ and $\phi '$

In standard Q-learning:

y = r+\gamma Q_{\phi '}(s', \argmax_{a'} Q_{\phi '}(s',a'))

Double Q-Learning:

y = r+\gamma Q_{\phi '}(s',\argmax_{a'} Q_{\phi}(s',a'))

Multi-step returns

Q-learning target

y_{j,t} = r_{j,t} + \gamma \max_{a_{j,t+1}} Q_{\phi '}(s_{j,t+1},a_{j,t+1})

Early on in training, the term $\gamma \max_{a_j, t+1} Q_{\phi ‘}(s_{j,t+1},a_{j,t+1})$ only gives us random noise if the network is bad, so $r_{j,t}$ is the most important. But later in training, when $Q_{\phi'}$ gets better and better, this term dominates(since we have more terms then just a single term reward) and is more important.

So we can have something like “Monte Carle” sum of rewards in Actor-Critic algorithms

y_{j,t} = \sum_{t' =t}^{t+N-1} \gamma^{t'-t}r_{j,t'} + \gamma^N \max_{a_{j,t+N}}({s_{j,t+N},a_{j,t+N})}

Higher variance

but lower bias
1. Faster learning especially early on

but only correct when on-policy learning
1. In the second step it actually matters what action you take because the exploration policy might be different from current policy

How to fix the “on-policy” limit?

Ignore the problem (😂 Yeah this is SOOOO CS)
1. Often works very well

Cut the trace
1. dynamically choose N to get only on-policy data
1. Works well when data mostly on-policy, and action space is small

Importance sampling
1. “Safe and efficient off-policy reinforcement learning. “ Munos et al. ‘16

Extending Q-Learning to continuous action space

Problem with continuous actions:

Our policy has to select $\argmax Q_\phi$

When evaluting the $y_j$ for training $Q_\phi$ we also need to $\argmax$
1. Particularly problematic(since this runs in a loop!）

Options:

Continuous Optimization Procedure
1. Gradient based optimization (like SGD) is a bit slow in the inner loop
1. Action space typically low-dimensional ⇒ So no need to do gradient optimization

Stochastic optimization
1. Uses the fact that action space is low-dimensional (easy to solve for optima)

Use a function class that is easy to optimize
1. Quadratic function?
  1. Gu, Lillicrap, Sutskever, L., ICML 2016
    1. Normalized Advantage Functions (NAF) architecture
1. No change to algorithm
1. Just as efficient as Q-Learning
1. But loses representational power

Learn an approximate maximizer
1. Learn a function maximizer using neural nets
1. DDPG (Lillicrap et al. ICLR 2016)
  1. “Deterministic” actor-critic (really approximate Q-Learning)
  1. $\max_{a} Q_{\phi} (s,a) = Q_\phi (s, \argmax_a Q_\phi (s,a))$
    1. Idea is to train another network $\mu_\theta(s)$ such that $\mu_\theta(s) \approx \argmax_a Q_{\phi}(s,a)$
    1. $\frac{dQ_\phi}{d\theta} = \frac{da}{d\theta} \frac{d Q_{\theta}}{da}$
1. classic: NFQCA
1. recent: TD3 and SAC

Stochastic Optimization

Simple Solution

\max_{a} Q(s,a) \approx \max \{ Q(s,a_1), \dots, Q(s,a_N) \}

Where $(a_1, \dots, a_N)$ are sampled from some distribution (e.g. uniform)

Dead simple

Efficiently parallelizable

But - not very accurate

More accurate (works OK for ≥ 40 dims):

Cross-entropy method (CEM)
1. simple iterative stochastic optimization
1. sampling actions from distribution like in the simple method but then refines to a range and then continues to sample in a smaller and smaller range

CMA-ES
1. substantially less simple iterative stochastic optimization

Implementation Tips

Huber loss (interpolating between squared error and absolute value loss)

Q-learning takes some care to stabilize
1. Test on easy, reliable tasks first, make sure the implementation is correct
1. Large replay buffers help improve stability
  1. Looks more like fitted Q-iteration
1. Takes time, be patient - might be no better than random for a while
1. Start with high exploration (epsilon) and gradually reduce
1. Bellman error gradients can be big; clip gradients or use Huber loss

Double Q-learning helps a lot in practice
1. Simple & no downsides!

N-step returns also help a lot
1. but have some downsides (will systematically bias the objective)

Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too

Run multiple random seeds, its very inconsistent between runs