Actor-critic Methods

Actor-Critic Method

Remember from the policy gradient that

θJ(θ)=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]1Ni=1N(t=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t)undefinedReward to go Q^i,t)\begin{split} \nabla_\theta J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ &\approx \frac{1}{N}\sum_{i=1}^N(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\underbrace{\sum_{t=1}^T r(s_{i,t},a_{i,t})}_{\mathclap{\text{Reward to go $\hat{Q}_{i,t}$}}}) \end{split}

Q^i,t\hat{Q}_{i,t} is the estimate of expected reward if we take action ai,ta_{i,t} in state si,ts_{i,t}

But Q^i,t\hat{Q}_{i,t} currently has a very high variance

  1. Q^i,t\hat{Q}_{i,t} is only taking account of a specific chain of action-state pairs
    1. This is because we approximated the gradient by stripping away the expectation

We can get a better estimate by using the true expected reward-to-go

Qπ(st,at)=t=tTEπθ[r(st,at)st,at]Q^{\pi}(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t '},a_{t'})|s_t,a_t]
What about baseline? Can we apply a baseline even if we have a true Q function?
Vπ(st)=Eatπθ(atst)[Qπ(st,at)]V^{\pi}(s_t)=\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[Q^{\pi}(s_t,a_t)]

Turns out that we can do better(less variance) than applying bt=1NiQ(si,t,ai,t)b_t = \frac{1}{N} \sum_i Q(s_{i,t},a_{i,t}) because with the value function, the baseline can be dependent on the state.

So now our gradient becomes

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(Qπ(si,t,ai,t)Vπ(si,t))undefinedHow much better the action ai,t is than the average action1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Aπ(si,t,ai,t)\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \underbrace{(Q^{\pi}(s_{i,t},a_{i,t})-V^{\pi}(s_{i,t}))}_{\mathclap{\text{How much better the action $a_{i,t}$ is than the average action}}} \\ &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) A^{\pi}(s_{i,t},a_{i,t}) \end{split}

We will also name something new: the advantage function:

How much better the action ai,ta_{i,t} is than the average action

Aπ(st,at)=Qπ(st,at)Vπ(st)A^{\pi}(s_t,a_t)=Q^{\pi}(s_t,a_t)-V^{\pi}(s_t)

The better estimate AπA^{\pi} estimate has, the lower variance θJ(θ)\nabla_\theta J(\theta) has.

Fitting Q/Value/Advantage Functions

But problem is, what should we fit? Qπ,Vπ,AπQ^{\pi}, V^{\pi}, A^{\pi}?

Let’s do some approximation and find out

Qπ(st,at)=r(st,at)+γEst+1p(st+1st,at)[Vπ(st+1)]r(st,at)+γVπ(st+1)\begin{split} Q^{\pi}(s_t,a_t)&=r(s_t,a_t)+\gamma\mathbb{E}_{s_{t+1} \sim p(s_{t+1}|s_t,a_t)}[V^{\pi}(s_{t+1})] \\ &\approx r(s_t,a_t)+\gamma V^{\pi}(s_{t+1}) \end{split}
Aπr(st,at)+γVπ(st+1)Vπ(st)A^{\pi} \approx r(s_t,a_t)+\gamma V^{\pi}(s_{t+1})-V^{\pi}(s_t)

We will introduce what γ\gamma(discount factor) is later. Just view it as γ=1\gamma=1 for now.

So now we see that it is now easy to fit VπV^{\pi} and use VπV^{\pi} to approximate QπQ^{\pi} and AπA^{\pi}.

VπV^{\pi} is relatively easier to fit to because it does not involve action, only depends on state.

🔥
Note that Actor-Critic can also fit QπQ^{\pi}.

Policy Evaluation

This is what policy gradient does
Vπ(st)t=tTr(st,at)V^{\pi}(s_t) \approx \sum_{t' = t}^T r(s_{t '}, a_{t' })

Ideally we want to do this:

Vπ(st)1Ni=1Nt=tTr(st,at)V^{\pi}(s_t) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t'=t}^T r(s_{t'},a_{t'})

Because in a model-free setting we cannot reset back to a state and run multiple trials

Expectation of rewards

So…

Monte Carlo policy evaluation
Use empirical returns to train the a value function to approximate the expectation
We can use a neural net

Instead of using those rewards directly to a policy gradient we will fit a model to those rewards ⇒ will reduce variance

Because even though we cannot visit the same state twice, the function approximator will combine information with similar states

And we can of course use MSE, etc. - supervised training losses

Ideal target

yi,t=t=tTEπθ[r(st,at)si,t]r(si,t,ai,t)+t=t+1TEπθ[r(st,at)si,t+1]y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t},a_{i,t})+\sum_{t' = t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t+1}]

Monte Carlo target:

yi,t=t=tTr(si,t,ai,t)y_{i,t} = \sum_{t' = t}^T r(s_{i,t'},a_{i,t'})

Training data would be:

{(si,t,t=tTr(si,t,ai,t)undefinedyi,t)}\{(s_{i,t},\underbrace{\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})}_{\mathclap{\text{$y_{i,t}$}}})\}

We can do even better (bootstrapped estimate):

Hmmm… looks like we can modify a bit in the ideal target

yi,t=t=tTEπθ[r(st,at)si,t]r(si,t,ai,t)+Vπ(si,t+1)y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t}, a_{i,t}) + V^{\pi}(s_{i,t+1})

Since we don’t know VπV^{\pi}, we can approximate it by V^ϕπ(si,t+1)\hat{V}_{\phi}^{\pi}(s_{i,t+1}) - our previous value function approximator (bootstraped estimate)

yi,t=t=tTEπθ[r(st,at)si,t]r(si,t,ai,t)+V^ϕπ(si,t+1)y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t}, a_{i,t}) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})

So now training data:

{(si,t,r(si,t,ai,t)+V^ψπ(si,t+1)}\{(s_{i,t},r(s_{i,t},a_{i,t})+\hat{V}^{\pi}_{\psi} (s_{i,t+1})\}
🔥
Lower variance than Monte Carlo evaluation, but higher bias (because V^ϕπ\hat{V}_{\phi}^{\pi} might be incorrect)

Batch actor-critic algorithm

⚠️
The fitted value function is not guarenteed to merge, same reason as the section “Value Function Learning Theory”

Discount Factor

If TT \rightarrow \infin, V^ϕπ\hat{V}_\phi^\pi ⇒ the approximator for VπV^{\pi} can get infinitely large in many cases, so we can regulate the reward to be “sooner rather than later”.

yi,tr(si,t,ai,t)+γV^ϕπ(si,t+1)y_{i,t} \approx r(s_{i,t},a_{i,t}) + \gamma \hat{V}_\phi^\pi(s_{i,t+1})

Where γ[0,1]\gamma \in [0,1] is the discount factor, it let’s reward you get decay in every timestep ⇒ So that the obtainable reward in infinite lifetime can actually be bounded.

One understanding that γ\gamma affects policy is that γ\gamma adds a “death state” that once you’re in, you can never get out.

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t' = t}^T \gamma^{t' - t} r(s_{i,t'},a_{i,t'}))

Online Actor-critic Algorithm

In practice:

Off-policy Actor-critic algorithms

Idea: Collect data and instead of training on them directly, put them into a replay buffer. At training instead of using the data just collected, fetch one randomly from the replay buffer.

Coming from the online algorithm, let’s see what problems do we need to fix:

(1) Under current policy, it is possible that our policy would not even have taken the next action aia_i and therefore it’s not cool to assume we will arrive at the reward r(si,ai,si)r(s_i,a_i,s_i') ⇒ We may not even arrive at state sis_i’

(2) Same reason, we may not have taken aia_i as our action under the current policy


We can fix the problem in (1) by using Qπ(st,at)Q^{\pi}(s_t,a_t) ⇒ replace the term γV^ϕπ(si)\gamma \hat{V}_\phi^\pi(s_i’)

Now

L(ϕ)=1NiQ^ϕπ(si,ai)yi)2L(\phi)=\frac{1}{N}\sum_i ||\hat{Q}_\phi^\pi(s_i,a_i)-y_i)||^2

And we replace the target value

yi=ri+γQ^ϕπ(si,aiπundefinedthis is sampled from current policy, aiπθ(aiπsi))y_i = r_i + \gamma \hat{Q}^\pi_\phi (s_i',\underbrace{a_i'^\pi}_{\mathclap{\text{this is sampled from current policy, $a_i'\sim \pi_\theta(a_i'^\pi|s_i')$}}})

Same for (2), sample an action from current policy aiππθ(aisi)a_i^\pi \sim \pi_\theta(a_i|s_i) rather than using the original data

And instead of plugging in advantage function,

θJ(θ)1Niθlogπθ(aiπsi)A^π(si,aiπ)1Niθlogπθ(aiπsi)Q^π(si,aiπ)undefinedHigher variance\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta (a_i^\pi|s_i) \hat{A}^{\pi}(s_i,a_i^\pi) \\ &\approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta (a_i^\pi|s_i) \underbrace{\hat{Q}^{\pi}(s_i,a_i^\pi)}_{\mathclap{\text{Higher variance}}} \\ \end{split}

It’s fine to have high variance (not being baselined), because it’s easier and now we don't need to generate more states ⇒ we can just generate more actions

In exchange use a larger batch size ⇒ all good!

🔥
Still one problem left ⇒ sis_i did not come from pθ(s)p_\theta(s) ⇒ but nothing we can do Not too bad ⇒ Initially we want optimal policy on pθ(s)p_\theta(s), but we actually get optimal policy on a broader distribution

Now our final result:

Example Practical Algorithm: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. 2018.

Critics as state-dependent baselines

In actor-critic

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(Qπ(si,t,ai,t)Vπ(si,t))1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Aπ(si,t,ai,t)\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) (Q^{\pi}(s_{i,t},a_{i,t})-V^{\pi}(s_{i,t})) \\ &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) A^{\pi}(s_{i,t},a_{i,t}) \end{split}

This method of using fitted model to approximate value/Q/Advantage Function:

  1. Lowers variance
  1. Biased as long as the critic is not perfect

In policy gradient:

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)((t=tTγttr(si,t,ai,t))b)\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})((\sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'},a_{i,t'}))-b)

This method:

  1. No bias
  1. High variance

So we can use a state-dependent baseline to still keep the estimator unbiased but reduce a bit variance?

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)((t=tTγttr(si,t,ai,t))V^ϕπ(si,t))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) ((\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'}))-\hat{V}_\phi^\pi(s_{i,t}))
Not only does the policy gradient remain unbiased when you subtract any constant b, it still remains unbiased when you subtract any function that only depends on the state sis_{i} (and not on the action)

Control variates - methods that use state and action dependent baselines

A^π(st)=t=tγttr(st,at)Vϕπ(st)\hat{A}^{\pi}(s_t) = \sum_{t'=t}^\infin \gamma^{t'-t}r(s_{t'},a_{t'})-V_\phi^\pi(s_t)
  1. No bias
  1. Higher variance (because single-sample estimate)

A^π(st,at)=t=tγttr(st,at)Qϕπ(st,at)\hat{A}^\pi(s_t,a_t)=\sum_{t'=t}^\infin \gamma^{t'-t}r(s_{t'},a_{t'})-Q_\phi^\pi(s_t,a_t)
  1. Goes to zero in expectation if critic is correct
  1. If critic is not correct, bomb shakalaka
    1. The expectation integrates to an error term that needs to be compensated for

To account for the error in baseline, we modify it to:

θJ(θ)1Ni=1Ni=1Tθlogπθ(ai,tsi,t)(Q^i,tQϕπ(si,t,ai,t))+1Ni=1Nt=1TθEaπθ(at,si,t)[Qϕπ(si,t,at)]\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{i=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(\hat{Q}_{i,t}-Q_\phi^\pi(s_{i,t},a_{i,t})) + \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \mathbb{E}_{a \sim \pi_\theta(a_t,s_{i,t})}[Q^{\pi}_\phi(s_{i,t},a_t)]
Use a critic without the bias, provided second term can be evaluated Gu et al. 2016 (Q-Prop)

Eligibility traces & n-step returns

Again, the Critic advantage estimator

A^Cπ(st,at)=r(st,at)+γV^ϕπ(st+1)V^ϕπ(st)\hat{A}_C^\pi(s_t,a_t)=r(s_t,a_t)+\gamma \hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)
  1. Lower variance
  1. higher bias if value estimate is wrong

The Monte Carlo Advantage Estimator

A^MCπ(st,at)=t=tγttr(st,at)V^ϕπ(st)\hat{A}_{MC}^\pi(s_t,a_t) = \sum_{t'=t}^\infin \gamma^{t'-t} r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_t)
  1. unbiased
  1. higher variance (single-sample estimate)

Get something in-between?

Facts:

  1. rewards are smaller as tt’ \rightarrow \infin
    1. So biases are much smaller problems when tt’ is big
  1. Variance is more of a problem in the future

A^nπ(st,at)=t=tt+nγttr(st,at)+γnV^ϕπ(st+n)V^ϕπ(st)\hat{A}_n^\pi(s_t,a_t)=\sum_{t'=t}^{t+n} \gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^{n}\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_t)

Generalized advantage estimation

The n-step advantage estimator is good, but can we generalize(hybrid) it?

Instead of hard cutting between two estimators, why don’t we combine them everywhere?

A^GAEπ(st,at)=n=1wnA^nπ(st,at)\hat{A}_{GAE}^{\pi}(s_t,a_t) = \sum_{n=1}^\infin w_n \hat{A}_n^\pi(s_t,a_t)

Where A^nπ\hat{A}_n^\pi stands for the n-step estimator

“Most prefer cutting earlier (less variance)”

So we set wnλn1w_n \propto \lambda^{n-1} ⇒ exponential falloff

Then

A^GAEπ(st,at)=t=t(γλ)ttδt, where δt=r(st,at)+γV^ϕπ(st+1)V^ϕπ(st)\begin{split} \hat{A}_{GAE}^\pi(s_t,a_t)=\sum_{t'=t}^\infin (\gamma \lambda)^{t'-t} \delta_{t'} \\ \text{, where } \delta_{t '}=r(s_{t '},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'}) \end{split}
🔥
Now we can just tradeoff the bias and variances using the parameter λ\lambda

Examples of actor-critic algorithms

Schulman, Moritz, Levine, Jordan, Abbeel ‘16. High dimensional continuous control with generalized advantage estimation
  1. Batch-mode actor-critic
  1. Blends Monte Carlo and GAE

Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16. Asynchronous methods for deep reinforcement learning
  1. Online actor-critic, parallelized batch
  1. CNN end-to-end
  1. N-step returns with N=4
  1. Single network for actor and critic

Suggested Readings

Classic: Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for reinforcement learning with function approximation: actor-critic algorithms with value function approximation

Talks about contents in this class

Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic
Schulman, Moritz, Levine, Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns
Gu, Lillicrap, Ghahramani, Turner, Levine (2017). Q-Prop: sample-efficient policy-gradient with an off-policy critic: policy gradient with Q-function control variate
https://arxiv.org/pdf/2108.08812.pdf ← Why actor-critic works better in offline training