Actor-critic Methods

Actor-Critic Method

Remember from the policy gradient that

\begin{split} \nabla_\theta J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ &\approx \frac{1}{N}\sum_{i=1}^N(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\underbrace{\sum_{t=1}^T r(s_{i,t},a_{i,t})}_{\mathclap{\text{Reward to go $\hat{Q}_{i,t}$}}}) \end{split}

$\hat{Q}_{i,t}$ is the estimate of expected reward if we take action $a_{i,t}$ in state $s_{i,t}$

But $\hat{Q}_{i,t}$ currently has a very high variance

$\hat{Q}_{i,t}$ is only taking account of a specific chain of action-state pairs
1. This is because we approximated the gradient by stripping away the expectation

We can get a better estimate by using the true expected reward-to-go

Q^{\pi}(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t '},a_{t'})|s_t,a_t]

What about baseline? Can we apply a baseline even if we have a true Q function?

V^{\pi}(s_t)=\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[Q^{\pi}(s_t,a_t)]

Turns out that we can do better(less variance) than applying $b_t = \frac{1}{N} \sum_i Q(s_{i,t},a_{i,t})$ because with the value function, the baseline can be dependent on the state.

So now our gradient becomes

\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \underbrace{(Q^{\pi}(s_{i,t},a_{i,t})-V^{\pi}(s_{i,t}))}_{\mathclap{\text{How much better the action $a_{i,t}$ is than the average action}}} \\ &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) A^{\pi}(s_{i,t},a_{i,t}) \end{split}

We will also name something new: the advantage function:

How much better the action $a_{i,t}$ is than the average action

A^{\pi}(s_t,a_t)=Q^{\pi}(s_t,a_t)-V^{\pi}(s_t)

The better estimate $A^{\pi}$ estimate has, the lower variance $\nabla_\theta J(\theta)$ has.

Fitting Q/Value/Advantage Functions

But problem is, what should we fit? $Q^{\pi}, V^{\pi}, A^{\pi}$ ?

Let’s do some approximation and find out

\begin{split} Q^{\pi}(s_t,a_t)&=r(s_t,a_t)+\gamma\mathbb{E}_{s_{t+1} \sim p(s_{t+1}|s_t,a_t)}[V^{\pi}(s_{t+1})] \\ &\approx r(s_t,a_t)+\gamma V^{\pi}(s_{t+1}) \end{split}

A^{\pi} \approx r(s_t,a_t)+\gamma V^{\pi}(s_{t+1})-V^{\pi}(s_t)

We will introduce what $\gamma$ (discount factor) is later. Just view it as $\gamma=1$ for now.

So now we see that it is now easy to fit $V^{\pi}$ and use $V^{\pi}$ to approximate $Q^{\pi}$ and $A^{\pi}$ .

$V^{\pi}$ is relatively easier to fit to because it does not involve action, only depends on state.

🔥

Note that Actor-Critic can also fit

Q^{\pi}

Policy Evaluation

This is what policy gradient does

V^{\pi}(s_t) \approx \sum_{t' = t}^T r(s_{t '}, a_{t' })

Ideally we want to do this:

V^{\pi}(s_t) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t'=t}^T r(s_{t'},a_{t'})

Because in a model-free setting we cannot reset back to a state and run multiple trials

Expectation of rewards

So…

Monte Carlo policy evaluation

Use empirical returns to train the a value function to approximate the expectation

Instead of using those rewards directly to a policy gradient we will fit a model to those rewards ⇒ will reduce variance

Because even though we cannot visit the same state twice, the function approximator will combine information with similar states

And we can of course use MSE, etc. - supervised training losses

Ideal target

y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t},a_{i,t})+\sum_{t' = t+1}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t+1}]

Monte Carlo target:

y_{i,t} = \sum_{t' = t}^T r(s_{i,t'},a_{i,t'})

Training data would be:

\{(s_{i,t},\underbrace{\sum_{t'=t}^T r(s_{i,t'},a_{i,t'})}_{\mathclap{\text{$y_{i,t}$}}})\}

We can do even better (bootstrapped estimate):

Hmmm… looks like we can modify a bit in the ideal target

y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t}, a_{i,t}) + V^{\pi}(s_{i,t+1})

Since we don’t know $V^{\pi}$ , we can approximate it by $\hat{V}_{\phi}^{\pi}(s_{i,t+1})$ - our previous value function approximator (bootstraped estimate)

y_{i,t} = \sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_{i,t}] \approx r(s_{i,t}, a_{i,t}) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})

So now training data:

\{(s_{i,t},r(s_{i,t},a_{i,t})+\hat{V}^{\pi}_{\psi} (s_{i,t+1})\}

🔥

Lower variance than Monte Carlo evaluation, but higher bias (because

\hat{V}_{\phi}^{\pi}

might be incorrect)

Batch actor-critic algorithm

⚠️

The fitted value function is not guarenteed to merge, same reason as the section “Value Function Learning Theory”

Discount Factor

If $T \rightarrow \infin$ , $\hat{V}_\phi^\pi$ ⇒ the approximator for $V^{\pi}$ can get infinitely large in many cases, so we can regulate the reward to be “sooner rather than later”.

y_{i,t} \approx r(s_{i,t},a_{i,t}) + \gamma \hat{V}_\phi^\pi(s_{i,t+1})

Where $\gamma \in [0,1]$ is the discount factor, it let’s reward you get decay in every timestep ⇒ So that the obtainable reward in infinite lifetime can actually be bounded.

One understanding that $\gamma$ affects policy is that $\gamma$ adds a “death state” that once you’re in, you can never get out.

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t' = t}^T \gamma^{t' - t} r(s_{i,t'},a_{i,t'}))

Online Actor-critic Algorithm

In practice:

Off-policy Actor-critic algorithms

Idea: Collect data and instead of training on them directly, put them into a replay buffer. At training instead of using the data just collected, fetch one randomly from the replay buffer.

Coming from the online algorithm, let’s see what problems do we need to fix:

(1) Under current policy, it is possible that our policy would not even have taken the next action $a_i$ and therefore it’s not cool to assume we will arrive at the reward $r(s_i,a_i,s_i')$ ⇒ We may not even arrive at state $s_i’$

(2) Same reason, we may not have taken $a_i$ as our action under the current policy

We can fix the problem in (1) by using $Q^{\pi}(s_t,a_t)$ ⇒ replace the term $\gamma \hat{V}_\phi^\pi(s_i’)$

Now

L(\phi)=\frac{1}{N}\sum_i ||\hat{Q}_\phi^\pi(s_i,a_i)-y_i)||^2

And we replace the target value

y_i = r_i + \gamma \hat{Q}^\pi_\phi (s_i',\underbrace{a_i'^\pi}_{\mathclap{\text{this is sampled from current policy, $a_i'\sim \pi_\theta(a_i'^\pi|s_i')$}}})

Same for (2), sample an action from current policy $a_i^\pi \sim \pi_\theta(a_i|s_i)$ rather than using the original data

And instead of plugging in advantage function,

\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta (a_i^\pi|s_i) \hat{A}^{\pi}(s_i,a_i^\pi) \\ &\approx \frac{1}{N} \sum_i \nabla_\theta \log \pi_\theta (a_i^\pi|s_i) \underbrace{\hat{Q}^{\pi}(s_i,a_i^\pi)}_{\mathclap{\text{Higher variance}}} \\ \end{split}

It’s fine to have high variance (not being baselined), because it’s easier and now we don't need to generate more states ⇒ we can just generate more actions

In exchange use a larger batch size ⇒ all good!

🔥

Still one problem left ⇒

s_i

did not come from

p_\theta(s)

⇒ but nothing we can do Not too bad ⇒ Initially we want optimal policy on

p_\theta(s)

, but we actually get optimal policy on a broader distribution

Now our final result:

Example Practical Algorithm: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. 2018.

Critics as state-dependent baselines

In actor-critic

\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) (Q^{\pi}(s_{i,t},a_{i,t})-V^{\pi}(s_{i,t})) \\ &\approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) A^{\pi}(s_{i,t},a_{i,t}) \end{split}

This method of using fitted model to approximate value/Q/Advantage Function:

Lowers variance

Biased as long as the critic is not perfect

In policy gradient:

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})((\sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'},a_{i,t'}))-b)

This method:

No bias

High variance

So we can use a state-dependent baseline to still keep the estimator unbiased but reduce a bit variance?

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t}) ((\sum_{t'=t}^T \gamma^{t'-t} r(s_{i,t'},a_{i,t'}))-\hat{V}_\phi^\pi(s_{i,t}))

Not only does the policy gradient remain unbiased when you subtract any constant b, it still remains unbiased when you subtract any function that only depends on the state $s_{i}$ (and not on the action)

Control variates - methods that use state and action dependent baselines

\hat{A}^{\pi}(s_t) = \sum_{t'=t}^\infin \gamma^{t'-t}r(s_{t'},a_{t'})-V_\phi^\pi(s_t)

No bias

Higher variance (because single-sample estimate)

\hat{A}^\pi(s_t,a_t)=\sum_{t'=t}^\infin \gamma^{t'-t}r(s_{t'},a_{t'})-Q_\phi^\pi(s_t,a_t)

Goes to zero in expectation if critic is correct

If critic is not correct, bomb shakalaka
1. The expectation integrates to an error term that needs to be compensated for

To account for the error in baseline, we modify it to:

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{i=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})(\hat{Q}_{i,t}-Q_\phi^\pi(s_{i,t},a_{i,t})) + \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \nabla_\theta \mathbb{E}_{a \sim \pi_\theta(a_t,s_{i,t})}[Q^{\pi}_\phi(s_{i,t},a_t)]

Use a critic without the bias, provided second term can be evaluated Gu et al. 2016 (Q-Prop)

Eligibility traces & n-step returns

Again, the Critic advantage estimator

\hat{A}_C^\pi(s_t,a_t)=r(s_t,a_t)+\gamma \hat{V}_\phi^\pi(s_{t+1})-\hat{V}_\phi^\pi(s_t)

Lower variance

higher bias if value estimate is wrong

The Monte Carlo Advantage Estimator

\hat{A}_{MC}^\pi(s_t,a_t) = \sum_{t'=t}^\infin \gamma^{t'-t} r(s_{t'},a_{t'})-\hat{V}_\phi^\pi(s_t)

unbiased

higher variance (single-sample estimate)

Get something in-between?

Facts:

rewards are smaller as $t’ \rightarrow \infin$
1. So biases are much smaller problems when $t’$ is big

Variance is more of a problem in the future

\hat{A}_n^\pi(s_t,a_t)=\sum_{t'=t}^{t+n} \gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^{n}\hat{V}_\phi^\pi(s_{t+n})-\hat{V}_\phi^\pi(s_t)

Generalized advantage estimation

The n-step advantage estimator is good, but can we generalize(hybrid) it?

Instead of hard cutting between two estimators, why don’t we combine them everywhere?

\hat{A}_{GAE}^{\pi}(s_t,a_t) = \sum_{n=1}^\infin w_n \hat{A}_n^\pi(s_t,a_t)

Where $\hat{A}_n^\pi$ stands for the n-step estimator

“Most prefer cutting earlier (less variance)”

So we set $w_n \propto \lambda^{n-1}$ ⇒ exponential falloff

Then

\begin{split} \hat{A}_{GAE}^\pi(s_t,a_t)=\sum_{t'=t}^\infin (\gamma \lambda)^{t'-t} \delta_{t'} \\ \text{, where } \delta_{t '}=r(s_{t '},a_{t'})+\gamma\hat{V}_\phi^\pi(s_{t'+1})-\hat{V}_\phi^\pi(s_{t'}) \end{split}

🔥

Now we can just tradeoff the bias and variances using the parameter

\lambda

Examples of actor-critic algorithms

Schulman, Moritz, Levine, Jordan, Abbeel ‘16. High dimensional continuous control with generalized advantage estimation

Batch-mode actor-critic

Blends Monte Carlo and GAE

Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16. Asynchronous methods for deep reinforcement learning

Online actor-critic, parallelized batch

CNN end-to-end

N-step returns with N=4

Single network for actor and critic