Value-based (and Q-) Learning

Value Function / Q Function Fitting (Value-Based)

Can we omit policy gradient completely from actor-critic methods?

Actor-critic minimizes

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTA(st,at))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) (\sum_{t'=t}^T A(s_{t'},a_{t'}))

But the best policy can just be extracted from arg maxatAπ(st,at)\argmax_{a_t} A^{\pi}(s_t,a_t)

π(atst)={1if at=arg maxatAπ(st,at)0otherwise\pi'(a_t|s_t)=\begin{cases} 1 &\quad \text{if } a_t = \argmax_{a_t} A^{\pi}(s_t,a_t) \\ 0 &\quad \text{otherwise} \end{cases}

We know that π\pi ' is as good as π\pi, and probably better.

Same as before,

Aπ(s,a)=r(s,a)+γE[Vπ(s)]Vπ(s)A^\pi(s,a)=r(s,a)+\gamma \mathbb{E}[V^{\pi}(s')] - V^{\pi}(s)

Policy Iteration (Fitted Value Iteration)

So fitting value function can be done if we know the transition dynamics very easily using least square estimate… but what if we don’t know transition dynamics?

Policy Iteration without transition probability (Fitted Q-Iteration)

What if we change the policy evalution part to

Qπ(s,a)r(s,a)+γEsp(ss,a)[Qπ(s,π(s))]Q^\pi(s,a) \leftarrow r(s,a)+\gamma \mathbb{E}_{s' \sim p(s'|s,a)}[Q^\pi(s',\pi(s'))]

Now ⇒ if we have an (s,a,r,s)(s,a,r,s’) tuple then even if policy π\pi ' changes its action in state ss we can still do policy evalution on the tuple because the only place the policy shows up is inside the expectation.

This simple change allows us to policy iteration style algorithms without actually knowing the transition dynamics

Instead of using E[V(si)]\mathbb{E}[V(s_i)], we used maxaQϕ(si,ai)\max_{a’} Q_\phi(s_i’,a_i’), the value function V(si)V(s_i) is approxiamted by maxaQϕ(si,a)\max_{a’} Q_\phi(s_i,a’) and intead of doing expectations on the future possible states, we used the next state in the sample sis_i’

Special Cases of Fixed Q-Fitting

During exploration phase, using the greedy algorithm may not be the optimal choice (we want to maximize entropy!)

“Epsilon Greedy”

π(atst)={1ϵif at=arg maxaQϕ(st,at)ϵ/(A1)otherwise\pi(a_t|s_t)=\begin{cases} 1-\epsilon &\quad \text{if } a_t = \argmax_a Q_\phi(s_t,a_t) \\ \epsilon /(|A|-1) &\quad \text{otherwise} \end{cases}

“epsilon chance of exploring other options other than optimal action”

Common practice is to vary epsilon while training (early on(Q is bad), larger epsilon)

“Exponential”

π(atst)exp(Qϕ(st,at))\pi(a_t|s_t) \propto \exp(Q_\phi(s_t,a_t))
Best action will be most frequent, but leave some probability for exploration. And plus, if there is a second largest Q, then this second largest is considered more than other options.

Value Function Learning Theory

For tabular value iteration algorithm:

The Bellman Operator encaptures the logic of updating V(s)V(s)

And one fact is that:

VV^*, the value function for the optimal policy, is a fixed point of BB

VV^* always exists, is always unique, and always corresponds to the optimal policy

Does fixed point iteration converge to VV^*

We can prove that the value iteration reaches VV^* because BB is a contraction

Contraction means: for any VV and Vˉ\bar{V}, we have BVBVˉγVVˉ||BV-B\bar{V}||_\infin \le \gamma||V-\bar{V}||_\infin where γ(0,1)\gamma \in (0,1)

Meaning that VV and Vˉ\bar{V} will get closer and closer as we apply BB to them

If we replace Vˉ\bar{V} by VV^* (and since BV=VBV^* = V^*),

BVVγVV||BV-V^*||_\infin \le \gamma ||V-V^*||_\infin

For fitted value iteration algorithm:

Step 1 basically applies the bellman operator

We will define a new operator for step 2, Π\Pi (pi)

Step 2 performs the following:

“Find an value function in the value function model hypothesis space Ω\Omega that optimizes the objective”

Varg minVΩ12V(s)(BV)(s)2V' \leftarrow \argmin_{V' \in \Omega} \frac{1}{2} \sum ||V'(s)-(BV)(s)||^2

So step 2 performs this model fitting operator Π\Pi that projects BVBV onto hypothesis space Ω\Omega, our iteration algorithm can now be described as

VΠBVV \leftarrow \Pi BV

BB is still a contraction with respect to \infin norm

BVBVˉγVVˉ||BV-B\bar{V}||_\infin \le \gamma||V-\bar{V}||_\infin

Π\Pi is a contraction wr.t. L2-norm

ΠVΠVˉ2VVˉ2||\Pi V- \Pi \bar{V}||^2 \le ||V - \bar{V} ||^2

However, ΠB\Pi B is not a contraction of any kind

So fitted value iteration does not converge in general and often does not converge in practice

Same with fitted Q-Iteration, Batch actor-critic
🧙🏽‍♂️
Bad theoretical properties, but works in practice lol

But why? Isn’t value function fitting just graident descent?

We can actually turn this into a gradient descent algorithm by computing into those target functions ⇒ resulting algorithm is called residual algorithms, has poor numerical properties & doesn’t work well in practice

Correlation Problem in Q-Learning

⚠️
Sequential States are Strongly Correlated And Target value is always changing (chasing its own tails)

Think about a sequence

In the early 3 steps, the target value is kind of “overfit” to those 3 values

In the supervised setting of learning Q / Value functions, the data points seen by the target values are only local states

We can solve this by adding more batches at one time (by synchronized parallel Q-learning or asynchronous parallel Q-learning)

Or we use replay buffers (Since fitted Q-Learning is basically a off-policy method) ⇒ Samples are no longer correlated and multiple samples in the batch (low-variance gradient)

But where does the data come from:

  1. Need to periodically feed the replay buffer
⚠️
On full fitted Q-Iteration algorithm, we set ϕarg minϕ12iQϕ(si,ai)yi2\phi \leftarrow \argmin_{\phi} \frac{1}{2} \sum_i ||Q_\phi(s_i,a_i)-y_i||^2 But do we want it to actually converge to argmin if the sample has high variance?

Maybe just move one gradient step instead

ϕϕαidQϕdϕ(si,ai)(Qϕ(si,ai)yi)\phi \leftarrow \phi - \alpha \sum_i \frac{dQ_\phi}{d\phi}(s_i,a_i)(Q_\phi(s_i,a_i)-y_i)

Usually ⇒ K[1,4]K \in [1, 4], N[10000,50000]N \in [10000, 50000]

Target network makes sure that we’re not trying to hit a moving target ⇒ and now it looks more like supervised regression

🔥
Popular Alternative of target network update: ϕτϕ+(1τ)ϕ\phi ' \leftarrow \tau \phi ' + (1-\tau) \phi τ=0.999\tau = 0.999 works well

Question: Chasing its tail relationship, is there a typical scenerio where Q function chases its tail and shows bad behaviors?

Classic Deep Q-Learning Algorithm (DQN)

A special case(with specific parameters) of Q-learning with replay buffer and target network

A general view of Q-Learning

Online Q-learning (last lecture): evict immediately, process 1, process 2, and process 3 all run at the same speed.

DQN: Process 1 and process 3 run at the same speed, process 2 is slow.

Fitted Q-iteration: process 3 in the inner loop of process 2, which is in the inner loop of process 1

Overestimation in Q-Learning and Double Q-Learning

Q functions are not accurate (much much larger)
🔥
Why does Q-function think that it’s gonna get systematically larger values than true values.

Because target value

yj=rj+γmaxajundefinedthis last term is the problemQϕ(sj,aj)y_j = r_j + \gamma \underbrace{\max_{a_j '}}_{\mathclap{\text{this last term is the problem}}} Q_{\phi '}(s_j ', a_j ')

Imagine two r.v. X1,X2X_1, X_2

We can prove:

E[max(X1,X2)]max(E[X1],E[X2])\mathbb{E}[\max(X_1,X_2)] \ge \max(\mathbb{E}[X_1],\mathbb{E}[X_2])

But:

Qϕ(s,a)Q_{\phi '}(s',a') is not perfect, it is “noisy” ⇒ We are always selecting the positive errors while training using the max\max

Note that

maxaQϕ(s,a)=Qϕ(s,arg maxaQϕ(s,a)undefinedaction selected according to Qϕ)\max_{a'}Q_{\phi '}(s',a')=Q_{\phi '}(s',\underbrace{\argmax_{a'} Q_{\phi '}(s',a')}_{\mathclap{\text{action selected according to $Q_{\phi '}$}}})

So use “double Q-Learning” ⇒ use two networks

Update ϕA\phi A by using ϕB\phi B as a target network and ϕA\phi A as an action selector, opposite with ϕB\phi B update.

QϕA(s,a)r+γQϕB(s,arg maxaQϕA(s,a))Q_{\phi A}(s,a) \leftarrow r+\gamma Q_{\phi B}(s',\argmax_{a'} Q_{\phi A}(s',a'))
QϕB(s,a)r+γQϕA(s,arg maxaQϕB(s,a)Q_{\phi B}(s,a) \leftarrow r + \gamma Q_{\phi A}(s', \argmax_{a'} Q_{\phi B}(s',a')
Intuition: Choosing the “optimal” (but due to noise) action in QϕAQ_{\phi A} will cause the noise of the same action in QϕBQ_{\phi B} to be added into the result, therefore the “overestimation” effect is mitigated

In practice, instead of having two Q functions, use the two Q functions that we already have.

We already have ϕ\phi and ϕ\phi '

In standard Q-learning:

y=r+γQϕ(s,arg maxaQϕ(s,a))y = r+\gamma Q_{\phi '}(s', \argmax_{a'} Q_{\phi '}(s',a'))

Double Q-Learning:

y=r+γQϕ(s,arg maxaQϕ(s,a))y = r+\gamma Q_{\phi '}(s',\argmax_{a'} Q_{\phi}(s',a'))

Multi-step returns

Q-learning target

yj,t=rj,t+γmaxaj,t+1Qϕ(sj,t+1,aj,t+1)y_{j,t} = r_{j,t} + \gamma \max_{a_{j,t+1}} Q_{\phi '}(s_{j,t+1},a_{j,t+1})

Early on in training, the term γmaxaj,t+1Qϕ(sj,t+1,aj,t+1)\gamma \max_{a_j, t+1} Q_{\phi ‘}(s_{j,t+1},a_{j,t+1}) only gives us random noise if the network is bad, so rj,tr_{j,t} is the most important. But later in training, when QϕQ_{\phi'} gets better and better, this term dominates(since we have more terms then just a single term reward) and is more important.

So we can have something like “Monte Carle” sum of rewards in Actor-Critic algorithms

yj,t=t=tt+N1γttrj,t+γNmaxaj,t+N(sj,t+N,aj,t+N)y_{j,t} = \sum_{t' =t}^{t+N-1} \gamma^{t'-t}r_{j,t'} + \gamma^N \max_{a_{j,t+N}}({s_{j,t+N},a_{j,t+N})}
  1. Higher variance
  1. but lower bias
    1. Faster learning especially early on
  1. but only correct when on-policy learning
    1. In the second step it actually matters what action you take because the exploration policy might be different from current policy

How to fix the “on-policy” limit?

  1. Ignore the problem (😂 Yeah this is SOOOO CS)
    1. Often works very well
  1. Cut the trace
    1. dynamically choose N to get only on-policy data
    1. Works well when data mostly on-policy, and action space is small
  1. Importance sampling
    1. “Safe and efficient off-policy reinforcement learning. “ Munos et al. ‘16

Extending Q-Learning to continuous action space

Problem with continuous actions:

  1. Our policy has to select arg maxQϕ\argmax Q_\phi
  1. When evaluting the yjy_j for training QϕQ_\phi we also need to arg max\argmax
    1. Particularly problematic(since this runs in a loop!)

Options:

  1. Continuous Optimization Procedure
    1. Gradient based optimization (like SGD) is a bit slow in the inner loop
    1. Action space typically low-dimensional ⇒ So no need to do gradient optimization
  1. Stochastic optimization
    1. Uses the fact that action space is low-dimensional (easy to solve for optima)
  1. Use a function class that is easy to optimize
    1. Quadratic function?
      1. Gu, Lillicrap, Sutskever, L., ICML 2016
        1. Normalized Advantage Functions (NAF) architecture
    1. No change to algorithm
    1. Just as efficient as Q-Learning
    1. But loses representational power
  1. Learn an approximate maximizer
    1. Learn a function maximizer using neural nets
    1. DDPG (Lillicrap et al. ICLR 2016)
      1. “Deterministic” actor-critic (really approximate Q-Learning)
      1. maxaQϕ(s,a)=Qϕ(s,arg maxaQϕ(s,a))\max_{a} Q_{\phi} (s,a) = Q_\phi (s, \argmax_a Q_\phi (s,a))
        1. Idea is to train another network μθ(s)\mu_\theta(s) such that μθ(s)arg maxaQϕ(s,a)\mu_\theta(s) \approx \argmax_a Q_{\phi}(s,a)
        1. dQϕdθ=dadθdQθda\frac{dQ_\phi}{d\theta} = \frac{da}{d\theta} \frac{d Q_{\theta}}{da}
    1. classic: NFQCA
    1. recent: TD3 and SAC

Stochastic Optimization

Simple Solution

maxaQ(s,a)max{Q(s,a1),,Q(s,aN)}\max_{a} Q(s,a) \approx \max \{ Q(s,a_1), \dots, Q(s,a_N) \}

Where (a1,,aN)(a_1, \dots, a_N) are sampled from some distribution (e.g. uniform)

  1. Dead simple
  1. Efficiently parallelizable
  1. But - not very accurate

More accurate (works OK for ≥ 40 dims):

  1. Cross-entropy method (CEM)
    1. simple iterative stochastic optimization
    1. sampling actions from distribution like in the simple method but then refines to a range and then continues to sample in a smaller and smaller range
  1. CMA-ES
    1. substantially less simple iterative stochastic optimization

Implementation Tips

Instability of Q Learning
Huber loss (interpolating between squared error and absolute value loss)
  1. Q-learning takes some care to stabilize
    1. Test on easy, reliable tasks first, make sure the implementation is correct
    1. Large replay buffers help improve stability
      1. Looks more like fitted Q-iteration
    1. Takes time, be patient - might be no better than random for a while
    1. Start with high exploration (epsilon) and gradually reduce
    1. Bellman error gradients can be big; clip gradients or use Huber loss
  1. Double Q-learning helps a lot in practice
    1. Simple & no downsides!
  1. N-step returns also help a lot
    1. but have some downsides (will systematically bias the objective)
  1. Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too
  1. Run multiple random seeds, its very inconsistent between runs

Suggested Readings

  1. Classic papers
    1. Watkins. (1989). Learning from delayed rewards: introduces Q-learning
    1. Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks
  1. Deep reinforcement learning Q-learning papers
    1. Lange, Riedmiller. (2010). Deep auto-encoder neural networks in reinforcement learning: early image-based Q-learning method using autoencoders to construct embeddings
    1. Mnih et al. (2013). Human-level control through deep reinforcement learning: Q- learning with convolutional networks for playing Atari.
    1. Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: a very effective trick to improve performance of deep Q-learning.
    1. Lillicrap et al. (2016). Continuous control with deep reinforcement learning: continuous Q-learning with actor network for approximate maximization.
    1. Gu, Lillicrap, Stuskever, L. (2016). Continuous deep Q-learning with model-based acceleration: continuous Q-learning with action-quadratic value functions.
    1. Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling network architectures for deep reinforcement learning: separates value and advantage estimation in Q-function.