Policy Gradients

Policy Gradients

”REINFORCE algorithm”

Remember: Direct gradient descent on RL objective

θ=arg maxθEτpθ(τ)[tr(st,at)]undefinedJ(θ)=arg maxθt=1TE(st,at)pθ(st,at)[r(st,at)]\begin{split} \theta^* &= \argmax_\theta \underbrace{ \mathbb{E}_{\tau \sim p_\theta(\tau)} [\sum_t r(s_t,a_t)] }_{\mathclap{J(\theta)}} \\ &= \argmax_{\theta} \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \end{split}

Note: We don’t know anything about the p(s1)p(s_1) and p(si+1ai,si)p(s_{i+1}|a_i, s_i), but we can approximate it.

J(θ)=Eτpθ(τ)[tr(st,at)undefineddenote as r(τ)]=pθ(τ)r(τ)dτ1Nitr(si,t,ai,tundefinedReward of time step t in the i-th sample)\begin{split} J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[\underbrace{\sum_t r(s_t,a_t)}_{\mathclap{\text{denote as }r(\tau)}}] \\ &= \int p_\theta(\tau)r(\tau)d\tau \\ &\approx \frac{1}{N} \sum_i \sum_t \underbrace{r(s_{i,t}, a_{i,t}}_{\mathclap{\text{Reward of time step $t$ in the $i$-th sample}}}) \end{split}
The larger our sample size N is, the more accurate our estimate will be
θJ(θ)=θpθ(τ)r(τ)dτ\nabla_\theta J(\theta)=\int \nabla_\theta p_\theta(\tau)r(\tau) d\tau

We will use a convenient identity to rewrite this equation so that it can be evaluated without knowing the distribution for pθ(τ)p_\theta(\tau)

pθ(τ)logpθ(τ)=pθ(τ)θpθ(τ)pθ(τ)=θpθ(τ)p_\theta(\tau)\nabla \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau)

Using this identity, we can rewrite the equation

θJ(θ)=θpθ(τ)r(τ)dτ=pθ(τ)θlogpθ(τ)r(τ)dτ\begin{split} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau)r(\tau)d\tau \\ &= \int p_\theta(\tau)\nabla_\theta \log p_\theta(\tau)r(\tau)d\tau \end{split}

Note:

pθ(s1,a1,,sT,aT)=p(s1)t=1Tπθ(atst)p(st+1st,at)logpθ(s1,a1,,sT,aT)=logp(s1)+t=1Tlogπθ(atst)+logp(st+1st,at)p_\theta(s_1,a_1,\dots,s_T,a_T)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t) p(s_{t+1}|s_t, a_t) \\ \log p_\theta(s_1,a_1,\dots,s_T,a_T)=\log p(s_1)+ \sum_{t=1}^T \log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t)

So if we consider the θJ(θ)\nabla_\theta J(\theta), especially the θlogpθ(τ)\nabla_\theta \log p_\theta(\tau), we see that

θlogpθ(τ)=θ[logp(s1)+t=1Tlogπθ(atst)+logp(st+1st,at)]=θt=1Tlogπθ(atst)\begin{split} \nabla_\theta \log p_\theta(\tau) &= \nabla_\theta[\log p(s_1) + \sum_{t=1}^T \log \pi_\theta(a_t|s_t)+\log p(s_{t+1}|s_t,a_t)] \\ &= \nabla_\theta \sum_{t=1}^T \log \pi_\theta(a_t|s_t) \end{split}

So therefore

θJ(θ)=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]1Ni=1N(t=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t))\begin{split} \nabla_\theta J(\theta)&=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\sum_{t=1}^T\nabla_\theta \log \pi_\theta(a_t|s_t))(\sum_{t=1}^Tr(s_t,a_t))] \\ &\approx \frac{1}{N}\sum_{i=1}^N(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})(\sum_{t=1}^T r(s_{i,t},a_{i,t})) \end{split}

Improve by

θθ+αθJ(θ)\theta \leftarrow \theta + \alpha\nabla_\theta J(\theta)

”REINFORCE algorithm” on continuous space

e.g. We can represent pθ(atst)p_\theta(a_t|s_t) as N(fneural network(st);Σ)N(f_{\text{neural network}}(s_t);\Sigma)

θlogπθ(atst)=12Σ1(f(st)at)dfdθ\nabla_{\theta} \log \pi_\theta(a_t|s_t)=-\frac{1}{2}\Sigma^{-1}(f(s_t)-a_t)\frac{df}{d\theta}

  1. Sample {τi}\{\tau^i\} from πθ(atst)\pi_\theta(a_t|s_t) (run it on the robot)
  1. θJ(θ)i(tθlogπθ(atisti))(tr(sti,ati))\nabla_\theta J(\theta) \approx \sum_i(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i))(\sum_t r(s_t^i, a_t^i))
  1. θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

Partial Observability

θJ(θ)1Ni=1N(t=1Nθlogπθ(ai,toi,t))(t=1Tr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N (\sum_{t=1}^N \nabla_\theta \log \pi_\theta (a_{i,t}|o_{i,t}))(\sum_{t=1}^T r(s_{i,t},a_{i,t}))
🔥
Markov property is not actually used so we can just use oi,to_{i,t} in place of si,ts_{i,t}

What’s wrong?

Blue graph corresponds to PDF of reward, green bars represent samples of rewards, and dashed blue graph corresponds to fitted policy distribution
What if reward is shifted up (adding a constant) (while everything maintains the same)?

We see huge differences in fitted policy distributions ⇒ we have a high variance problem

How we can modify policy gradient to reduce variance?

🔥
Causality: policy at time tt’ cannot affect reward at time tt if t<tt < t’
θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=1Tr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_{i,t}|s_{i,t})(\sum_{t'=1}^T r(s_{i,t'},a_{i,t'}))

We want to show that the expectation of the previous rewards would converge to 0 so we can rewrite the summation. Note that removing those terms in finite time would change the estimator but it is still unbiased.

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTr(si,t,ai,t))undefinedreward to go Q^i,t\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_{i,t}|s_{i,t})\underbrace{(\sum_{t'=t}^T r(s_{i,t'},a_{i,t'}))}_{\mathclap{\text{reward to go $\hat{Q}_{i,t}$}}}

We have removed some items from the sum therefore the variance is lower.

With discount factors

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla _\theta \log \pi_\theta (a_{i,t}|s_{i,t}) (\sum_{t' = t}^T \gamma^{t' -t} r(s_{i,t'},a_{i,t}))

The power on the γ\gammattt’ - t is important because now the gradient is NOT decresing with t stepping ⇒ we’re putting less focus on the future rewards in the current step, but in future steps those rewards will be valued more again.

If we let the power be t1t’ - 1 or t1t-1 ⇒ then as we train from t=0t=0 to tt \rightarrow \infin, the gradient would decrease as tt increases

Baselines

We want policy gradients to optimize reward actions that are above average and penalize reward actions that are below average.

θJ(θ)1Ni=1Nθlogpθ(τ)[r(τ)b]\begin{split} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta \log p_\theta (\tau)[r(\tau)-b] \end{split}

Without baseline (b = 0), but with baseline, set b as

b=1Ni=1Nr(τ)b = \frac{1}{N}\sum_{i=1}^Nr(\tau)

Changing b will keep the esitimator unbiased but will change the variance!

E[θlogpθ(τ)b]=pθ(τ)θlogpθ(τ)bdτ=θpθ(τ)bdτ=bθpθ(τ)dτ=bθ1=0\begin{split} \mathbb{E}[\nabla_\theta \log p_\theta(\tau)b] &= \int p_\theta(\tau)\nabla_\theta \log p_\theta(\tau) b d\tau \\ &= \int \nabla_\theta p_\theta(\tau)b d\tau \\ &= b \nabla_\theta \int p_\theta(\tau) d\tau \\ &=b\nabla_\theta1 = 0 \end{split}

Average reward is not the best baseline, but it’s pretty good!

Let’s first write down variance

Var=Eτpθ(τ)[(θlogpθ(τ)(r(τ)b))2]Eτpθ(τ)[θlogpθ(τ)(r(τ)b)]2\begin{split} Var &= \mathbb{E}_{\tau \sim p_\theta(\tau)}[(\nabla_\theta \log p_\theta(\tau)(r(\tau)-b))^2]-\mathbb{E}_{\tau \sim p_\theta(\tau)}[\nabla_\theta\log p_\theta (\tau)(r(\tau)-b)]^2 \\ \end{split}
dVardb=ddbE[g(τ)2(r(τ)b)2],g(τ)=gradient(log(τ))=ddb(E[g(τ)2r(τ)2]2E[g(τ)2r(τ)b]+b2E[g(τ)2])=2E[g(τ)2r(τ)]+2bE[g(τ)2]=0\begin{split} \frac{dVar}{db} &=\frac{d}{db}\mathbb{E}[g(\tau)^2(r(\tau)-b)^2], g(\tau)=\text{gradient$(\log(\tau))$} \\ &=\frac{d}{db}(\mathbb{E}[g(\tau)^2r(\tau)^2]-2\mathbb{E}[g(\tau)^2r(\tau)b]+b^2\mathbb{E}[g(\tau)^2]) \\ &= -2\mathbb{E}[g(\tau)^2r(\tau)]+2b\mathbb{E}[g(\tau)^2]=0 \\ \end{split}

So optimal value of b

b=E[g(τ)2r(τ)]E[g(τ)2]b = \frac{\mathbb{E}[g(\tau)^2r(\tau)]}{\mathbb{E}[g(\tau)^2]}

Intuition: Different baseline for every entry in the gradient, the baseline for every parameter value is the expected value of the reward weighted by the magnitude of the gradient for the parameter value

Off-policy setting

Policy gradient is defined to be on-policy

The expectatio causes us to need to update samples every time we have updated the policy, but problem is Neural Nets only change a bit with each gradient step.

So we can modify the policy a bit

Instead of having samples from pθ(τ)p_\theta(\tau), we have samples from pˉ(τ)\bar{p}(\tau)

We can use importance sampling to accomodate this case

Exp(x)[f(x)]=Pp(x)f(x)dx=Pq(x)Pq(x)Pp(x)f(x)dx=Pq(x)Pp(x)Pq(x)f(x)dx\begin{split} \mathbb{E}_{x \sim p(x)}[f(x)] &= \int \mathbb{P}_p(x)f(x)dx \\ &=\int \frac{\mathbb{P}_q (x)}{\mathbb{P}_q (x)} \mathbb{P}_p (x) f(x) dx \\ &=\int \mathbb{P}_q(x) \frac{\mathbb{P}_p(x)}{\mathbb{P}_q(x)}f(x)dx \end{split}

Due to this expectation, we see that our objective becomes this

J(θ)=Eτpˉ(τ)[pθ(τ)pˉ(τ)r(τ)]J(\theta)=\mathbb{E}_{\tau \sim \bar{p}(\tau)}[\frac{p_\theta(\tau)}{\bar{p}(\tau)}r(\tau)]

Where

pθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)p_\theta(\tau)=p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)

Therefore,

pθ(τ)pˉ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)p(s1)t=1Tπˉθ(atst)p(st+1st,at)=t=1Tπθ(atst)t=1Tπˉθ(atst)\begin{split} \frac{p_\theta(\tau)}{\bar{p}(\tau)} &= \frac{p(s_1)\prod_{t=1}^T \pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T \bar{\pi}_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)} \\ &=\frac{\prod_{t=1}^T \pi_\theta(a_t|s_t)}{\prod_{t=1}^T \bar\pi_\theta(a_t|s_t)} \end{split}

So… we can estimate the value of some new parameters θ\theta '

J(θ)=Eτpθ(τ)[pθ(τ)pθ(τ)r(τ)]J(\theta')=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)} r(\tau)]
θJ(θ)=Eτpθ(τ)[θpθ(τ)pθ(τ)r(τ)]=Eτpθ(τ)[pθ(τ)pθ(τ)θlogpθ(τ)r(τ)]=Eτpθ(τ)[(t=1Tπθ(atst)πθ(atst))(t=1Tθlogπθ(atst))(t=1Tr(st,at))]\begin{split} \nabla_{\theta '} J(\theta ') &= \mathbb{E}_{\tau \sim p_\theta(\tau)} [\frac{\nabla_{\theta '} p_{\theta '}(\tau)}{p_\theta(\tau)}r(\tau)] \\ &=\mathbb{E}_{\tau \sim p_\theta(\tau)} [\frac{p_{\theta '}(\tau)}{p_\theta(\tau)}\nabla_{\theta '} \log p_{\theta '} (\tau) r(\tau)] \\ &=\mathbb{E}_{\tau \sim p_\theta(\tau)}[(\prod_{t=1}^T\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)})(\sum_{t=1}^T \nabla_{\theta '} \log \pi_{\theta '}(a_t|s_t))(\sum_{t=1}^T r(s_t,a_t))] \end{split}

We can consider causality(the fact that current action does not affect past rewards) and

θJ(θ)=Eτpθ(τ)[t=1Tθlogπθ(atst)(t=1tπθ(atst)πθ(atst))(t=tTr(st,at)(t=ttπθ(atst)πθ(atst)))]\begin{split} \nabla_{\theta '} J(\theta ') &=\mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_{t=1}^T \nabla_{\theta '} \log \pi_{\theta '}(a_t|s_t)(\prod_{t'=1}^t \frac{\pi_{\theta '} (a_{t'}|s_{t'})}{\pi_\theta(a_{t '}|s_{t '})})(\sum_{t' = t}^Tr(s_{t'},a_{t'})(\prod_{t''=t}^{t'}\frac{\pi_{\theta '}(a_{t ''}|s_{t ''})}{\pi_\theta(a_{t ''} | s_{t ''})}))] \end{split}

Some important terms in this equation:

t=1tπθ(atst)πθ(atst)\prod_{t’=1}^t \frac{\pi_{\theta ‘}(a_{t ‘} | s_{t ’})}{\pi_{\theta}(a_{t ‘} | s_{t ’})} ⇒ future actions won’t affect current weight ⇒ however, looks like we still have the term at the end ⇒ its also a trouble because it’s exponential in TT

t=ttπθ(atst)πθ(atst)\prod_{t''=t}^{t'}\frac{\pi_{\theta '}(a_{t ''}|s_{t ''})}{\pi_\theta(a_{t ''} | s_{t ''})} ⇒ we cannot complete ignore this because stripping this term would not be gradient descent but actually”Policy Iteration” algorithm

Since the first term is exponential, we can try to write the objetive a bit differently

On-policy policy gradient

θJ(θ)1Ni=1Nt=1Tθlogπθ(ai,tsi,t)Q^i,t\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \hat{Q}_{i,t}

Off-policy policy gradient

θJ(θ)1Ni=1Nt=1Tπθ(si,t,ai,t)πθ(si,t,ai,t)θlogπθ(ai,tsi,t)Q^i,t=1Ni=1Nt=1Tπθ(si,t)πθ(si,t)undefinedCan we ignore this part?πθ(ai,tsi,t)πθ(ai,tsi,t)θlogπθ(ai,tsi,t)Q^i,t\begin{split} \nabla_{\theta '} J(\theta ') &\approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T \frac{\pi_{\theta '} (s_{i,t}, a_{i,t})}{\pi_\theta(s_{i,t},a_{i,t})} \nabla_{\theta '} \log \pi_{\theta '}(a_{i,t}|s_{i,t}) \hat{Q}_{i,t} \\ &\quad = \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \underbrace{\frac{\pi_{\theta '}(s_{i,t})}{\pi_\theta(s_{i,t})}}_{\mathclap{\text{Can we ignore this part?}}} \frac{\pi_{\theta '}(a_{i,t}|s_{i,t})}{\pi_\theta(a_{i,t}|s_{i,t})} \nabla_{\theta '} \log \pi_{\theta '}(a_{i,t}|s_{i,t}) \hat{Q}_{i,t} \end{split}

This does not necessarily give the optimal policy, but its a reasonable approximation.

Policy Gradient with Automatic Differentiation

We don’t want to be inefficient at computing gradients ⇒ use backprop trick

So why don’t we start from the MLE approach to Policy Gradients?

θJML1Ni=1Nt=1Tθlogπθ(ai,tsi,t)\nabla_\theta J_{ML} \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta (a_{i,t}|s_{i,t})

To start doing backprop, let’s define the objective function first.

JML(θ)1Ni=1Nt=1Tlogπθ(ai,tsi,t)J_{ML}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t})

So we will define our “pseudo-loss” approximation as weighted maximum likelihood

J~(θ)1Ni=1Nt=1Tlogπθ(ai,tsi,t)Q^i,t\tilde{J}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta (a_{i,t}|s_{i,t}) \hat{Q}_{i,t}

A few reminders about tuning hyper-parameters

  1. Gradients has very high variance
    1. Isn’t the same as supervised learning
  1. Consider using much larger batches
  1. Tweaking learning rates is VERY hard
    1. Adaptive step size rules like ADAM can be OK-ish

Covariant/natural policy gradient

🤔
One thing about one dimension state and action spaces are that you can visualize the policy distribution easily one a 2d plane
θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

We can view the first order gradient descent as follows

θarg maxθ(θθ)θJ(θ)s.t. θθ2ϵ\theta ' \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_{\theta}J(\theta) \\ \text{s.t. } ||\theta ' - \theta||^2 \le \epsilon

Thank about ϵ\epsilon as a high-dimensional sphere.

We can instead put constraint on the distribution of the policy

θarg maxθ(θθ)θJ(θ)s.t. D(πθ,πθ)ϵ\theta ' \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_{\theta}J(\theta) \\ \text{s.t. } D(\pi_{\theta '}, \pi_\theta) \le \epsilon

Where D(,)D(\cdot, \cdot) is a distribution dimension measure.

So we want to choose some parameterization-independent divergence measure

Usually KL-divergence

DKL(πθπθ)=Eπθ[logπθlogπθ]D_{KL}(\pi_{\theta '}||\pi_\theta) = \mathbb{E}_{\pi_{\theta '}}[\log \pi_\theta - \log \pi_{\theta '}]

A but hard to plug into our gradient descent ⇒ we will approximate this distance using 2nd order Taylor expansion around the current parameter value θ\theta

Advanced Policy Gradients

Why does policy gradient work?

  1. Estimate A^π(st,at)\hat{A}^\pi (s_t,a_t) (Monte-carlo or function approximator) for current policy π\pi
  1. Use A^π(st,at)\hat{A}^\pi(s_t,a_t) to get improved policy π\pi '

We are going back and forth between (1) and (2)

Looks familiar to policy iteration algorithm!

Nice thing about policy gradients:

  1. If advantage estimator is perfect, we are just moving closer to the “optimal” policy for the advantage estimator not jumping to the optimal

But how do we formalize this?

If we set the loss function of policy gradient as

J(θ)=Eτpθ(τ)[tγtr(st,at)]J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t \gamma^t r(s_t,a_t)]

Then we can calculate how much we improved by

J(θ)J(θ)=Eτpθ(τ)[tγtAπθ(st,at)]J(\theta ') - J(\theta) = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)]

Here we claimed that J(θ)J(θ)J(\theta’) - J(\theta) is equal to the expectation of all discounted advantage estimates in the new trajectory that the new policy would explore.

We will prove by:

J(θ)J(θ)=J(θ)Es0p(s0)[Vπθ(s0)]=J(θ)Eτpθ(τ)[Vπθ(s0)]undefinedIt’s the same as above because the initial states s0 are the same=J(θ)Eτpθ(τ)[t=0γtVπθ(st)t=1γtVπθ(st)]=J(θ)+Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ(st))]=Eτpθ(τ)[t=0γtr(st,at)]+Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ(st))]=Eτpθ(τ)[t=0γt(r(st,at)+γVπθ(st+1)Vπθ(st))]=Eτpθ(τ)[t=0γtAπθ(st,at)]\begin{split} J(\theta')-J(\theta)&=J(\theta')-\mathbb{E}_{s_0 \sim p(s_0)} [V^{\pi_\theta}(s_0)] \\ &= J(\theta') - \underbrace{\mathbb{E}_{\tau \sim p_{\theta '}(\tau)}[V^{\pi_\theta}(s_0)]}_{\mathclap{\text{It's the same as above because the initial states $s_0$ are the same}}} \\ &= J(\theta') - \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t V^{\pi_\theta}(s_t) - \sum_{t=1}^\infin \gamma^t V^{\pi_\theta}(s_t)] \\ &= J(\theta') + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t r(s_t,a_t)] + \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (\gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t (r(s_t,a_t) + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t))] \\ &= \mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[\sum_{t=0}^\infin \gamma^t A^{\pi_\theta}(s_t,a_t)] \end{split}
🔥
This proof shows that by optimizing Eτpθ[tγtAπθ(st,at)]\mathbb{E}_{\tau \sim p_{\theta’}}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)], we are actually improving our policy
But how does this relate to our policy gradient?
Eτ pθ(τ)[tγtAπθ(st,at)]=tEstpθ(st)[Eatπθ(atst)[γtAπθ(st,at)]]=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)undefinedimportance samplingγtAπθ(st,at)]]tEstpθ(st)undefinedignoring distribution mismatch[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]=Aˉ(θ)\begin{split} \mathbb{E}_{\tau\ \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)] &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\mathbb{E}_{a_t \sim \pi_{\theta '}(a_t|s_t)}[\gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\underbrace{\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)}}_{\mathclap{\text{importance sampling}}} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &\approx \sum_t \underbrace{\mathbb{E}_{s_t \sim p_{\theta}(s_t)}}_{\mathclap{\text{ignoring distribution mismatch}}} [\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ &\quad =\bar{A}(\theta') \end{split}

We want this approximation to be true so that

J(θ)J(θ)Aˉ(θ)J(\theta')-J(\theta) \approx \bar{A}(\theta')

And then we can use

θarg maxθAˉ(θ)\theta' \leftarrow \argmax_{\theta'} \bar{A}(\theta)

to optimize our policy

And use A^π(st,at)\hat{A}^\pi(s_t,a_t) to get improved policy π\pi’

But when is this true?

Bounding the distribution change (deterministic)

We will show:

🔥
pθ(st)p_\theta(s_t) is close to pθ(st)p_{\theta’}(s_t) when πθ\pi_\theta is close to πθ\pi_\theta’

We will assume that πθ\pi_\theta is a deterministic policy at=πθ(st)a_t = \pi_\theta(s_t)

We define closeness of policy as:

πθ\pi_{\theta’} is close to πθ\pi_\theta if πθ(atπθ(st)st)ϵ\pi_{\theta’}(a_t \ne \pi_\theta(s_t)|s_t) \le \epsilon
pθ(st)=(1ϵ)tundefinedprobability we made no mistakespθ(st)+(1(1ϵ)t)pmistake(st)undefinedsome other distributionp_{\theta'}(s_t) = \underbrace{(1-\epsilon)^t}_{\mathclap{\text{probability we made no mistakes}}} p_\theta(s_t)+(1-(1-\epsilon)^t)\underbrace{p_{mistake}(s_t)}_{\mathclap{\text{some other distribution}}}

Implies that we can write the total variation divergence as

pθ(st)pθ(st)=(1(1ϵ)t)pmistake(st)pθ(st)2(1(1ϵ)t)2ϵt\begin{split} |p_{\theta'}(s_t)-p_\theta(s_t)| &= (1-(1-\epsilon)^t) |p_{mistake}(s_t)-p_\theta(s_t)| \\ &\le 2(1-(1-\epsilon)^t) \\ &\le 2\epsilon t \end{split}
👉
Useful Identity ϵ[0,1],(1ϵ)t1ϵt\forall \epsilon \in [0,1], (1-\epsilon)^t \ge 1 - \epsilon t

Bounding the distribution change (arbitrary)

Schulman, Levine, Moritz, Jordan, Abbeel, “Trust Region Policy Optimization”

We will show:

🔥
pθ(st)p_\theta(s_t) is close to pθ(st)p_{\theta’}(s_t) when πθ\pi_\theta is close to πθ\pi_\theta’

Closeness:

πθ\pi_{\theta’} is close to πθ\pi_\theta if πθ(atst)πθ(atst)ϵ|\pi_{\theta’}(a_t|s_t) - \pi_\theta(a_t|s_t)| \le \epsilon for all sts_t

Useful Lemma:

If pX(x)pY(y)=ϵ|p_X(x) - p_Y(y)| = \epsilon, there exists p(x,y)p(x,y) such that p(x)=pX(x)p(x) = p_X(x) and p(y)=pY(y)p(y) = p_Y(y) and p(x=y)=1ϵp(x=y) = 1 - \epsilon

Says:

  1. pX(x)p_X(x) agrees with pY(y)p_Y(y) with probability ϵ\epsilon
  1. πθ(atst)\pi_{\theta'}(a_t|s_t) takes a different action than πθ(atst)\pi_{\theta}(a_t|s_t) with probability at most ϵ\epsilon

Therefore

pθ(st)pθ(st)=(1(1ϵ)t)pmistake(st)pθ(st)2(1(1ϵ)t)2ϵt\begin{split} |p_{\theta'}(s_t)-p_{\theta}(s_t)| &= (1-(1-\epsilon)^t)|p_{mistake}(s_t)-p_\theta(s_t)| \\ &\le 2(1-(1-\epsilon)^t) \\ &\le 2 \epsilon t \end{split}

Bounding the objective value

Assume pθ(st)pθ(st)2ϵt|p_{\theta’}(s_t)-p_{\theta}(s_t)| \le 2\epsilon t

Estpθ(st)[f(st)]=stpθ(st)f(st)stpθ(st)f(st)pθ(st)pθ(st)maxstf(st)Epθ(st)[f(st)]2ϵtmaxstf(st)\begin{split} \mathbb{E}_{s_t \sim p_{\theta'}(s_t)}[f(s_t)] &= \sum_{s_t} p_{\theta'}(s_t)f(s_t) \\ &\ge \sum_{s_t} p_{\theta}(s_t)f(s_t) - |p_{\theta}(s_t)-p_{\theta'}(s_t)|\max_{s_t}f(s_t) \\ &\ge \mathbb{E}_{p_\theta(s_t)}[f(s_t)] - 2\epsilon t \max_{s_t} f(s_t) \end{split}

For the objective value, we can bound it by

Eτ pθ(τ)[tγtAπθ(st,at)]=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)undefinedimportance samplingγtAπθ(st,at)]]tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]t2ϵtC\begin{split} \mathbb{E}_{\tau\ \sim p_{\theta'}(\tau)}[\sum_t \gamma^t A^{\pi_\theta}(s_t,a_t)] &= \sum_t \mathbb{E}_{s_t \sim p_{\theta'}(s_t)} [\underbrace{\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)}}_{\mathclap{\text{importance sampling}}} \gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ & \ge \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)} [\mathbb{E}_{a_t \sim \pi_{\theta}(a_t|s_t)}[\frac{\pi_{\theta '}(a_t|s_t)}{\pi_\theta(a_t|s_t)} \gamma^t A^{\pi_\theta} (s_t,a_t)]] - \sum_{t} 2\epsilon tC \\ \end{split}

Where CC is a constant being the largest value that the thing in the state exepectation can take on ⇒ its a probability times an advantage ⇒ largest possible reward times the number of time steps.

So:

CO(Trmax) or O(rmax1γ)undefinedinfinite time with discount (geometric series)C \in O(Tr_{max}) \text{ or } \underbrace{O(\frac{r_{max}}{1-\gamma})}_{\mathclap{\text{infinite time with discount (geometric series)}}}

Therefore for small enought ϵ\epsilon, Aˉπ(θ)\bar{A}^{\pi}(\theta ') is guarenteed to be similar to the true RL objective

Policy Gradients with Constraints

We want to constrain:

πθ(atst)πθ(atst)ϵ|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)| \le \epsilon

It which will result in

pθ(st)pθ(st)2ϵt|p_{\theta'}(s_t)-p_\theta(s_t)| \le 2\epsilon t

A more convenient bound:

πθ(atst)πθ(atst)12DKL(πθ(atst),πθ(atst))|\pi_{\theta'}(a_t|s_t) - \pi_{\theta}(a_t|s_t)| \le \sqrt{\frac{1}{2} D_{KL}(\pi_{\theta'}(a_t|s_t),\pi_{\theta}(a_t|s_t))}

We see that the KL divergence bounds the state marginal difference

Note:

DKL(p1(x),p2(x))=Exp1(x)[logp1(x)p2(x)]D_{KL}(p_1(x),p_2(x)) = \mathbb{E}_{x \sim p_1(x)} [\log \frac{p_1(x)}{p_2(x)}]
👉
KL Divergence has very convenient properties that make it much easier to approximate

So contraining our objective, we have:

θarg maxθtEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]s.t. DKL(πθ(atst),πθ(atst))ϵ\theta' \leftarrow \argmax_{\theta'} \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]] \\ \text{s.t. } D_{KL}(\pi_{\theta '}(a_t|s_t), \pi_\theta(a_t|s_t)) \le \epsilon

For small ϵ\epsilon, this is guaranteed to improve J(θ)J(θ)J(\theta’) - J(\theta)

👉
We can do this by simply write out the lagrangian function and do convex optimization
L(θ,λ)=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]λ(DKL(πθ(atst),πθ(atst))ϵ)L(\theta',\lambda) = \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]] - \lambda(D_{KL}(\pi_{\theta '}(a_t|s_t), \pi_\theta(a_t|s_t)) - \epsilon)

“Dual Gradient Descent” Alternate between:

  1. Maximize L(θ,λ)L(\theta’, \lambda) with respect to θ\theta’
  1. λλ+α(DKL(πθ(atst),πθ(atst))ϵ)\lambda \leftarrow \lambda + \alpha(D_{KL}(\pi_{\theta’}(a_t|s_t),\pi_\theta(a_t|s_t)) - \epsilon)

Intuition: raise λ\lambda if constraint is violated too much

Natural Gradient

As usual, we denote Aˉ(θ)\bar{A}(\theta’) as

Aˉ(θ)=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]]\bar{A}(\theta') = \sum_t \mathbb{E}_{s_t \sim p_{\theta}(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t A^{\pi_\theta} (s_t,a_t)]]
When we are doing gradient descent, we are actually optimizing near a region(trust region) in the first order taylor expansion of the objective.

Can we use this fact to simplify the optimization step and use one easy algorithm to take care of this constrained optimization problem?

θarg maxθθAˉ(θ)(θθ)s.t. DKL(πθ(atst),πθ(atst))ϵ\theta' \leftarrow \argmax_{\theta'} \nabla_{\theta} \bar{A}(\theta)^{\top}(\theta'-\theta) \\ \text{s.t. } D_{KL}(\pi_{\theta'}(a_t|s_t),\pi_{\theta}(a_t|s_t)) \le \epsilon

Let’s look at the gradient of Aˉ(θ)\bar{A}(\theta’)

θAˉ(θ)=tEstpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtθlogπθ(atst)Aπθ(st,at)]]\nabla_{\theta'}\bar{A}(\theta) = \sum_t \mathbb{E}_{s_t \sim p_\theta(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t,a_t)]] \\

Evaluating this at point θ=θ\theta’ = \theta

θAˉ(θ)=tEstpθ(st)[Eatπθ(atst)[γtθlogπθ(atst)Aπθ(st,at)]]=θJ(θ)\begin{split} \nabla_{\theta}\bar{A}(\theta) &= \sum_t \mathbb{E}_{s_t \sim p_\theta(s_t)}[\mathbb{E}_{a_t \sim \pi_\theta(a_t|s_t)}[\gamma^t \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t,a_t)]] \\ &= \nabla_\theta J(\theta) \end{split}

Can we just use the gradient then?

Remember in the original policy gradient algorithm, θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)

But some parameters change probabilities a lot more than others, we are not considering the KL Divergence in this case

⚠️
so we want to either form the policy to let them have equal effects, or we can change the learning rate for each of the parameters

But it seems like the problem can be fixed, notice that

Gradient Descent still poses some kind of constraint θθ2ϵ||\theta - \theta’||^2 \le \epsilon

Think about in parameter space, we are constrained to within some fixed distance(inside a circle) from the original parameter. ⇒ Indeed the learning rate α=ϵθJ(θ)2\alpha = \sqrt{\frac{\epsilon}{||\nabla_\theta J(\theta)||^2}} can satisfy this constraint.

But in KL divergence, maybe we want an ellipse? (we want a circle in distribution space)

We will use second order taylor expansion to approximate the KL divergence to make it easier to calculate.

DKL(πθπθ)(θθ)FundefinedFisher-information Matrix(θθ)D_{KL}(\pi_{\theta '} || \pi_\theta) \approx (\theta '-\theta)\underbrace{F}_{\mathclap{\text{Fisher-information Matrix}}}(\theta ' - \theta)
F=Eπθ[θlogπθ(as)θlogπθ(as)]F = \mathbb{E}_{\pi_{\theta}}[\nabla_\theta \log \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)^{\top}]

Now we see that the shape of the new ellipse in parameter space is determined by the matrix FF

We can estimate this FF expectation using samples

So therefore!

Using the approximation, our policy becomes

θarg maxθ(θθ)θJ(θ)s.t. θθF2ϵ\theta \leftarrow \argmax_{\theta '} (\theta ' - \theta)^{\top}\nabla_\theta J(\theta) \\ \text{s.t. } ||\theta ' - \theta||_F^2 \le \epsilon

With this we have the “natural gradeint”

θθ+αF1θJ(θ)\theta \leftarrow \theta + \alpha F^{-1} \nabla_\theta J(\theta)

If we want to enforce the constraint, we can choose a step size that enforces the constraint.

We can calculate step size α\alpha by

α=2ϵθJ(θ)FθJ(θ)\alpha = \sqrt{\frac{2 \epsilon}{\nabla_\theta J(\theta)^{\top} F \nabla_\theta J(\theta)}}

Or run conjugate gradient method to calculate inverse θJ(θ)\nabla_\theta J(\theta) and get α\alpha as a by-product (see trust region policy optimization paper)

Several policy that utilizes this trick:

  1. natural gradient: pick α\alpha
  1. trust region policy optimization: pick ϵ\epsilon
  1. Can solve for optimal α\alpha while solving F1θJ(θ)F^{-1} \nabla_\theta J(\theta)
  1. Conjugate gradient works well for this

Recommended Readings

  1. Natural Policy Gradient
    1. Schulman, L., Moritz, Jordan, Abbeel (2015) Trust region policy optimization
    1. Peters, Schaal. Reinforcement Learning of motor skills with policy gradients
  1. IS objective directly
    1. Proximal Policy Optimization (PPO)

Is this even a problem in practice?

👉
It’s a pretty big problem especially with continuous action spaces. Generally a good choice to stabilize policy gradient training

With this problem, the vanilla policy gradients do point in the right direction but they are extremely ill-posed (too busy decreasing σ\sigma.