Policy Gradients

”REINFORCE algorithm”

Remember: Direct gradient descent on RL objective

\begin{split} \theta^* &= \argmax_\theta \underbrace{ \mathbb{E}_{\tau \sim p_\theta(\tau)} [\sum_t r(s_t,a_t)] }_{\mathclap{J(\theta)}} \\ &= \argmax_{\theta} \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \end{split}

Note: We don’t know anything about the $p(s_1)$ and $p(s_{i+1}|a_i, s_i)$ , but we can approximate it.

\begin{split} J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[\underbrace{\sum_t r(s_t,a_t)}_{\mathclap{\text{denote as }r(\tau)}}] \\ &= \int p_\theta(\tau)r(\tau)d\tau \\ &\approx \frac{1}{N} \sum_i \sum_t \underbrace{r(s_{i,t}, a_{i,t}}_{\mathclap{\text{Reward of time step $t$ in the $i$-th sample}}}) \end{split}

The larger our sample size N is, the more accurate our estimate will be

\nabla_\theta J(\theta)=\int \nabla_\theta p_\theta(\tau)r(\tau) d\tau

We will use a convenient identity to rewrite this equation so that it can be evaluated without knowing the distribution for $p_\theta(\tau)$

p_\theta(\tau)\nabla \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau)

Using this identity, we can rewrite the equation

\begin{split} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau)r(\tau)d\tau \\ &= \int p_\theta(\tau)\nabla_\theta \log p_\theta(\tau)r(\tau)d\tau \end{split}

Note:

p_\theta(s_1,a_1,\dots,s_T,a_T)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t) \\ \log p_\theta(s_1,a_1,\dots,s_T,a_T)=\log p(s_1)+ \sum_{t=1}^T \log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t)

So if we consider the $\nabla_\theta J(\theta)$ , especially the $\nabla_\theta \log p_\theta(\tau)$ , we see that

\begin{split} \nabla_\theta \log p_\theta(\tau) &= \nabla_\theta[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta(a_t|s_t)+\log p(s_{t+1}|s_t,a_t)] \\ &= \nabla_\theta \sum_{t=1}^T \log \pi_\theta(a_t|s_t) \end{split}

So therefore

\begin{split} \nabla_\theta J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ &\approx \frac{1}{N}\sum_{i=1}^N(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t=1}^T r(s_{i,t},a_{i,t})) \end{split}

Improve by

\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)

”REINFORCE algorithm” on continuous space

e.g. We can represent $p_\theta(a_t|s_t)$ as $N(f_{\text{neural network}}(s_t);\Sigma)$ ⇒

$\nabla_{\theta} \log \pi_\theta(a_t|s_t)=-\frac{1}{2}\Sigma^{-1}(f(s_t)-a_t)\frac{df}{d\theta}$

Sample $\{\tau^i\}$ from $\pi_\theta(a_t|s_t)$ (run it on the robot)

$\nabla_\theta J(\theta) \approx \sum_i(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i))(\sum_t r(s_t^i, a_t^i))$

$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

Partial Observability

\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N (\sum_{t=1}^N \nabla_\theta \log \pi_\theta (a_{i,t}|o_{i,t}))(\sum_{t=1}^T r(s_{i,t},a_{i,t}))

🔥

Markov property is not actually used so we can just use

o_{i,t}

in place of

s_{i,t}

What’s wrong?

Blue graph corresponds to PDF of reward, green bars represent samples of rewards, and dashed blue graph corresponds to fitted policy distribution

What if reward is shifted up (adding a constant) (while everything maintains the same)?

We see huge differences in fitted policy distributions ⇒ we have a high variance problem

How we can modify policy gradient to reduce variance?

🔥

Causality: policy at time

t’

cannot affect reward at time

t

t < t’

\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_{i,t}|s_{i,t})(\sum_{t'=1}^T r(s_{i,t'},a_{i,t'}))

We want to show that the expectation of the previous rewards would converge to 0 so we can rewrite the summation. Note that removing those terms in finite time would change the estimator but it is still unbiased.

\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_{i,t}|s_{i,t})\underbrace{(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}))}_{\mathclap{\text{reward to go $\hat{Q}_{i,t}$}}}

We have removed some items from the sum therefore the variance is lower.

With discount factors

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla _\theta \log \pi_\theta (a_{i,t}|s_{i,t}) (\sum_{t' = t}^T \gamma^{t' -t} r(s_{i,t'},a_{i,t}))

The power on the $\gamma$ ⇒ $t’ - t$ is important because now the gradient is NOT decresing with t stepping ⇒ we’re putting less focus on the future rewards in the current step, but in future steps those rewards will be valued more again.

If we let the power be $t’ - 1$ or $t-1$ ⇒ then as we train from $t=0$ to $t \rightarrow \infin$ , the gradient would decrease as $t$ increases

Baselines

We want policy gradients to optimize reward actions that are above average and penalize reward actions that are below average.

\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log p_\theta (\tau)[r(\tau)-b] \end{split}

Without baseline (b = 0), but with baseline, set b as

b = \frac{1}{N}\sum_{i=1}^Nr(\tau)

Changing b will keep the esitimator unbiased but will change the variance!

\begin{split} \mathbb{E}[\nabla_\theta \log p_\theta(\tau)b] &= \int p_\theta(\tau)\nabla_\theta \log p_\theta(\tau) b d\tau \\ &= \int \nabla_\theta p_\theta(\tau)b d\tau \\ &= b \nabla_\theta \int p_\theta(\tau) d\tau \\ &=b\nabla_\theta1 = 0 \end{split}

Average reward is not the best baseline, but it’s pretty good!

Let’s first write down variance

\begin{split} Var &= \mathbb{E}_{\tau \sim p_\theta(\tau)}[(\nabla_\theta \log p_\theta(\tau)(r(\tau)-b))^2]-\mathbb{E}_{\tau \sim p_\theta(\tau)}[\nabla_\theta\log p_\theta (\tau)(r(\tau)-b)]^2 \\ \end{split}

\begin{split} \frac{dVar}{db} &=\frac{d}{db}\mathbb{E}[g(\tau)^2(r(\tau)-b)^2], g(\tau)=\text{gradient$(\log(\tau))$} \\ &=\frac{d}{db}(\mathbb{E}[g(\tau)^2r(\tau)^2]-2\mathbb{E}[g(\tau)^2r(\tau)b]+b^2\mathbb{E}[g(\tau)^2]) \\ &= -2\mathbb{E}[g(\tau)^2r(\tau)]+2b\mathbb{E}[g(\tau)^2]=0 \\ \end{split}

So optimal value of b

b = \frac{\mathbb{E}[g(\tau)^2r(\tau)]}{\mathbb{E}[g(\tau)^2]}

Intuition: Different baseline for every entry in the gradient, the baseline for every parameter value is the expected value of the reward weighted by the magnitude of the gradient for the parameter value

Off-policy setting

Policy gradient is defined to be on-policy

The expectatio causes us to need to update samples every time we have updated the policy, but problem is Neural Nets only change a bit with each gradient step.

So we can modify the policy a bit

Instead of having samples from $p_\theta(\tau)$ , we have samples from $\bar{p}(\tau)$

We can use importance sampling to accomodate this case

\begin{split} \mathbb{E}_{x \sim p(x)}[f(x)] &= \int \mathbb{P}_p(x)f(x)dx \\ &=\int \frac{\mathbb{P}_q (x)}{\mathbb{P}_q (x)} \mathbb{P}_p (x) f(x) dx \\ &=\int \mathbb{P}_q(x) \frac{\mathbb{P}_p(x)}{\mathbb{P}_q(x)}f(x)dx \end{split}

Due to this expectation, we see that our objective becomes this

J(\theta)=\mathbb{E}_{\tau \sim \bar{p}(\tau)}[\frac{p_\theta(\tau)}{\bar{p}(\tau)}r(\tau)]

Where

p_\theta(\tau)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)

Therefore,

\begin{split} \frac{p_\theta(\tau)}{\bar{p}(\tau)} &= \frac{p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)} \\ &=\frac{\prod_{t=1}^T \pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar\pi_\theta(a_t|s_t)} \end{split}

So… we can estimate the value of some new parameters $\theta '$

J(\theta')=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)} r(\tau)]

\begin{split} \nabla_{\theta '} J(\theta ') &= \mathbb{E}_{\tau \sim p_\theta(\tau)} [\frac{\nabla_{\theta '} p_{\theta '}(\tau)}{p_\theta(\tau)}r(\tau)] \\ &=\mathbb{E}_{\tau \sim p_\theta(\tau)} [\frac{p_{\theta '}(\tau)}{p_\theta(\tau)}\nabla_{\theta '} \log p_{\theta '} (\tau) r(\tau)] \\ &=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\prod_{t=1}^T\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)})(\sum_{t=1}^T \nabla_{\theta '} \log \pi_{\theta '}(a_t|s_t))(\sum_{t=1}^T r(s_t,a_t))] \end{split}

We can consider causality(the fact that current action does not affect past rewards) and

\begin{split} \nabla_{\theta '} J(\theta ') &=\mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_{t=1}^T \nabla_{\theta '} \log \pi_{\theta '}(a_t|s_t)(\prod_{t'=1}^t \frac{\pi_{\theta '} (a_{t'}|s_{t'})}{\pi_\theta(a_{t '}|s_{t '})})(\sum_{t' = t}^Tr(s_{t'},a_{t'})(\prod_{t''=t}^{t'}\frac{\pi_{\theta '}(a_{t ''}|s_{t ''})}{\pi_\theta(a_{t ''} | s_{t ''})}))] \end{split}

Some important terms in this equation:

$\prod_{t’=1}^t \frac{\pi_{\theta ‘}(a_{t ‘} | s_{t ’})}{\pi_{\theta}(a_{t ‘} | s_{t ’})}$ ⇒ future actions won’t affect current weight ⇒ however, looks like we still have the term at the end ⇒ its also a trouble because it’s exponential in $T$

$\prod_{t''=t}^{t'}\frac{\pi_{\theta '}(a_{t ''}|s_{t ''})}{\pi_\theta(a_{t ''} | s_{t ''})}$ ⇒ we cannot complete ignore this because stripping this term would not be gradient descent but actually”Policy Iteration” algorithm

Since the first term is exponential, we can try to write the objetive a bit differently

On-policy policy gradient

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{Q}_{i,t}

Off-policy policy gradient

\begin{split} \nabla_{\theta '} J(\theta ') &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \frac{\pi_{\theta '} (s_{i,t}, a_{i,t})}{\pi_\theta(s_{i,t},a_{i,t})} \nabla_{\theta '} \log \pi_{\theta '}(a_{i,t}|s_{i,t}) \hat{Q}_{i,t} \\ &\quad = \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \underbrace{\frac{\pi_{\theta '}(s_{i,t})}{\pi_\theta(s_{i,t})}}_{\mathclap{\text{Can we ignore this part?}}} \frac{\pi_{\theta '}(a_{i,t}|s_{i,t})}{\pi_\theta(a_{i,t}|s_{i,t})} \nabla_{\theta '} \log \pi_{\theta '}(a_{i,t}|s_{i,t}) \hat{Q}_{i,t} \end{split}

This does not necessarily give the optimal policy, but its a reasonable approximation.

Policy Gradient with Automatic Differentiation

We don’t want to be inefficient at computing gradients ⇒ use backprop trick

So why don’t we start from the MLE approach to Policy Gradients?

\nabla_\theta J_{ML} \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})

To start doing backprop, let’s define the objective function first.

J_{ML}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})

So we will define our “pseudo-loss” approximation as weighted maximum likelihood

\tilde{J}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t}) \hat{Q}_{i,t}

A few reminders about tuning hyper-parameters

Gradients has very high variance
1. Isn’t the same as supervised learning

Consider using much larger batches

Tweaking learning rates is VERY hard
1. Adaptive step size rules like ADAM can be OK-ish

Covariant/natural policy gradient

🤔

One thing about one dimension state and action spaces are that you can visualize the policy distribution easily one a 2d plane

\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

We can view the first order gradient descent as follows

\theta ' \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_{\theta}J(\theta) \\ \text{s.t. } ||\theta ' - \theta||^2 \le \epsilon

Thank about $\epsilon$ as a high-dimensional sphere.

We can instead put constraint on the distribution of the policy

\theta ' \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_{\theta}J(\theta) \\ \text{s.t. } D(\pi_{\theta '}, \pi_\theta) \le \epsilon

Where $D(\cdot, \cdot)$ is a distribution dimension measure.

So we want to choose some parameterization-independent divergence measure

Usually KL-divergence

D_{KL}(\pi_{\theta '}||\pi_\theta) = \mathbb{E}_{\pi_{\theta '}}[\log \pi_\theta - \log \pi_{\theta '}]

A but hard to plug into our gradient descent ⇒ we will approximate this distance using 2nd order Taylor expansion around the current parameter value $\theta$

Advanced Policy Gradients

Why does policy gradient work?

Estimate $\hat{A}^\pi (s_t,a_t)$ (Monte-carlo or function approximator) for current policy $\pi$

Use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi '$

We are going back and forth between (1) and (2)

Looks familiar to policy iteration algorithm!

Nice thing about policy gradients:

If advantage estimator is perfect, we are just moving closer to the “optimal” policy for the advantage estimator not jumping to the optimal

But how do we formalize this?

If we set the loss function of policy gradient as

J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t \gamma^t r(s_t,a_t)]

Then we can calculate how much we improved by

J(\theta ') - J(\theta) = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)]

Here we claimed that $J(\theta’) - J(\theta)$ is equal to the expectation of all discounted advantage estimates in the new trajectory that the new policy would explore.

We will prove by:

\begin{split} J(\theta')-J(\theta)&=J(\theta')-\mathbb{E}_{s_0 \sim p(s_0)} [V^{\pi_\theta}(s_0)] \\ &= J(\theta') - \underbrace{\mathbb{E}_{\tau \sim p_{\theta '}(\tau)}[V^{\pi_\theta}(s_0)]}_{\mathclap{\text{It's the same as above because the initial states $s_0$ are the same}}} \\ &= J(\theta') - \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t V^{\pi_\theta}(s_t) - \sum_{t=1}^\infin \gamma^t V^{\pi_\theta}(s_t)] \\ &= J(\theta') + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t r(s_t,a_t)] + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (r(s_t,a_t) + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t A^{\pi_\theta}(s_t,a_t)] \end{split}

🔥

This proof shows that by optimizing

\mathbb{E}_{\tau \sim p_{\theta’}}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)]

, we are actually improving our policy

But how does this relate to our policy gradient?

\begin{split} \mathbb{E}_{\tau\ \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)] &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\mathbb{E}_{a_t \sim \pi_{\theta '}(a_t|s_t)}[\gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\underbrace{\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)}}_{\mathclap{\text{importance sampling}}} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &\approx \sum_t \underbrace{\mathbb{E}_{s_t \sim p_{\theta}(s_t)}}_{\mathclap{\text{ignoring distribution mismatch}}} [\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &\quad =\bar{A}(\theta') \end{split}

We want this approximation to be true so that

J(\theta')-J(\theta) \approx \bar{A}(\theta')

And then we can use

\theta' \leftarrow \argmax_{\theta'} \bar{A}(\theta)

to optimize our policy

And use $\hat{A}^\pi(s_t,a_t)$ to get improved policy $\pi’$

But when is this true?

Bounding the distribution change (deterministic)

We will show:

🔥

p_\theta(s_t)

is close to

p_{\theta’}(s_t)

when

\pi_\theta

is close to

\pi_\theta’

We will assume that $\pi_\theta$ is a deterministic policy $a_t = \pi_\theta(s_t)$

We define closeness of policy as:

$\pi_{\theta’}$ is close to $\pi_\theta$ if $\pi_{\theta’}(a_t \ne \pi_\theta(s_t)|s_t) \le \epsilon$

p_{\theta'}(s_t) = \underbrace{(1-\epsilon)^t}_{\mathclap{\text{probability we made no mistakes}}} p_\theta(s_t)+(1-(1-\epsilon)^t)\underbrace{p_{mistake}(s_t)}_{\mathclap{\text{some other distribution}}}

Implies that we can write the total variation divergence as

\begin{split} |p_{\theta'}(s_t)-p_\theta(s_t)| &= (1-(1-\epsilon)^t) |p_{mistake}(s_t)-p_\theta(s_t)| \\ &\le 2(1-(1-\epsilon)^t) \\ &\le 2\epsilon t \end{split}

👉

Useful Identity

\forall \epsilon \in [0,1], (1-\epsilon)^t \ge 1 - \epsilon t

Bounding the distribution change (arbitrary)

Schulman, Levine, Moritz, Jordan, Abbeel, “Trust Region Policy Optimization”

We will show:

🔥

p_\theta(s_t)

is close to

p_{\theta’}(s_t)

when

\pi_\theta

is close to

\pi_\theta’

Closeness:

$\pi_{\theta’}$ is close to $\pi_\theta$ if $|\pi_{\theta’}(a_t|s_t) - \pi_\theta(a_t|s_t)| \le \epsilon$ for all $s_t$

Useful Lemma:

If $|p_X(x) - p_Y(y)| = \epsilon$ , there exists $p(x,y)$ such that $p(x) = p_X(x)$ and $p(y) = p_Y(y)$ and $p(x=y) = 1 - \epsilon$

Says:

$p_X(x)$ agrees with $p_Y(y)$ with probability $\epsilon$

$\pi_{\theta'}(a_t|s_t)$ takes a different action than $\pi_{\theta}(a_t|s_t)$ with probability at most $\epsilon$

Therefore

\begin{split} |p_{\theta'}(s_t)-p_{\theta}(s_t)| &= (1-(1-\epsilon)^t)|p_{mistake}(s_t)-p_\theta(s_t)| \\ &\le 2(1-(1-\epsilon)^t) \\ &\le 2 \epsilon t \end{split}

Bounding the objective value

Assume $|p_{\theta’}(s_t)-p_{\theta}(s_t)| \le 2\epsilon t$

\begin{split} \mathbb{E}_{s_t \sim p_{\theta'}(s_t)}[f(s_t)] &= \sum_{s_t} p_{\theta'}(s_t)f(s_t) \\ &\ge \sum_{s_t} p_{\theta}(s_t)f(s_t) - |p_{\theta}(s_t)-p_{\theta'}(s_t)|\max_{s_t}f(s_t) \\ &\ge \mathbb{E}_{p_\theta(s_t)}[f(s_t)] - 2\epsilon t \max_{s_t} f(s_t) \end{split}

For the objective value, we can bound it by

\begin{split} \mathbb{E}_{\tau\ \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)] &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\underbrace{\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)}}_{\mathclap{\text{importance sampling}}} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ & \ge \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)} [\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)} \gamma^t A^{\pi_\theta} (s_t,a_t)]] - \sum_{t} 2\epsilon tC \\ \end{split}

Where $C$ is a constant being the largest value that the thing in the state exepectation can take on ⇒ its a probability times an advantage ⇒ largest possible reward times the number of time steps.

So:

C \in O(Tr_{max}) \text{ or } \underbrace{O(\frac{r_{max}}{1-\gamma})}_{\mathclap{\text{infinite time with discount (geometric series)}}}

Therefore for small enought $\epsilon$ , $\bar{A}^{\pi}(\theta ')$ is guarenteed to be similar to the true RL objective

Policy Gradients with Constraints

We want to constrain:

|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)| \le \epsilon

It which will result in

|p_{\theta'}(s_t)-p_\theta(s_t)| \le 2\epsilon t

A more convenient bound:

|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)| \le \sqrt{\frac{1}{2} D_{KL}(\pi_{\theta'}(a_t|s_t),\pi_{\theta}(a_t|s_t))}

We see that the KL divergence bounds the state marginal difference

Note:

D_{KL}(p_1(x),p_2(x)) = \mathbb{E}_{x \sim p_1(x)} [\log \frac{p_1(x)}{p_2(x)}]

👉

KL Divergence has very convenient properties that make it much easier to approximate

So contraining our objective, we have:

\theta' \leftarrow \argmax_{\theta'} \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ \text{s.t. } D_{KL}(\pi_{\theta '}(a_t|s_t), \pi_\theta(a_t|s_t)) \le \epsilon

For small $\epsilon$ , this is guaranteed to improve $J(\theta’) - J(\theta)$

👉

We can do this by simply write out the lagrangian function and do convex optimization

L(\theta',\lambda) = \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]] - \lambda(D_{KL}(\pi_{\theta '}(a_t|s_t), \pi_\theta(a_t|s_t)) - \epsilon)

“Dual Gradient Descent” Alternate between:

Maximize $L(\theta’, \lambda)$ with respect to $\theta’$

$\lambda \leftarrow \lambda + \alpha(D_{KL}(\pi_{\theta’}(a_t|s_t),\pi_\theta(a_t|s_t)) - \epsilon)$

Intuition: raise $\lambda$ if constraint is violated too much

Natural Gradient

As usual, we denote $\bar{A}(\theta’)$ as

\bar{A}(\theta') = \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]]

When we are doing gradient descent, we are actually optimizing near a region(trust region) in the first order taylor expansion of the objective.

Can we use this fact to simplify the optimization step and use one easy algorithm to take care of this constrained optimization problem?

\theta' \leftarrow \argmax_{\theta'} \nabla_{\theta} \bar{A}(\theta)^{\top}(\theta'-\theta) \\ \text{s.t. } D_{KL}(\pi_{\theta'}(a_t|s_t),\pi_{\theta}(a_t|s_t)) \le \epsilon

Let’s look at the gradient of $\bar{A}(\theta’)$

\nabla_{\theta'}\bar{A}(\theta) = \sum_t \mathbb{E}_{s_t \sim p_\theta(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t,a_t)]] \\

Evaluating this at point $\theta’ = \theta$

\begin{split} \nabla_{\theta}\bar{A}(\theta) &= \sum_t \mathbb{E}_{s_t \sim p_\theta(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t,a_t)]] \\ &= \nabla_\theta J(\theta) \end{split}

Can we just use the gradient then?

Remember in the original policy gradient algorithm, $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

But some parameters change probabilities a lot more than others, we are not considering the KL Divergence in this case

⚠️

so we want to either form the policy to let them have equal effects, or we can change the learning rate for each of the parameters

But it seems like the problem can be fixed, notice that

Gradient Descent still poses some kind of constraint $||\theta - \theta’||^2 \le \epsilon$

Think about in parameter space, we are constrained to within some fixed distance(inside a circle) from the original parameter. ⇒ Indeed the learning rate $\alpha = \sqrt{\frac{\epsilon}{||\nabla_\theta J(\theta)||^2}}$ can satisfy this constraint.

But in KL divergence, maybe we want an ellipse? (we want a circle in distribution space)

We will use second order taylor expansion to approximate the KL divergence to make it easier to calculate.

D_{KL}(\pi_{\theta '} || \pi_\theta) \approx (\theta '-\theta)\underbrace{F}_{\mathclap{\text{Fisher-information Matrix}}}(\theta ' - \theta)

F = \mathbb{E}_{\pi_{\theta}}[\nabla_\theta \log \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)^{\top}]

Now we see that the shape of the new ellipse in parameter space is determined by the matrix $F$

We can estimate this $F$ expectation using samples

So therefore!

Using the approximation, our policy becomes

\theta \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_\theta J(\theta) \\ \text{s.t. } ||\theta ' - \theta||_F^2 \le \epsilon

With this we have the “natural gradeint”

\theta \leftarrow \theta + \alpha F^{-1} \nabla_\theta J(\theta)

If we want to enforce the constraint, we can choose a step size that enforces the constraint.

We can calculate step size $\alpha$ by

\alpha = \sqrt{\frac{2 \epsilon}{\nabla_\theta J(\theta)^{\top} F \nabla_\theta J(\theta)}}

Or run conjugate gradient method to calculate inverse $\nabla_\theta J(\theta)$ and get $\alpha$ as a by-product (see trust region policy optimization paper)

Several policy that utilizes this trick:

natural gradient: pick $\alpha$

trust region policy optimization: pick $\epsilon$

Can solve for optimal $\alpha$ while solving $F^{-1} \nabla_\theta J(\theta)$

Conjugate gradient works well for this