RL Theory

RL Theory

📌
Effective analysis is very hard in RL without strong assumptions Trick is to make assumptions that admit interesting conclusions without divorcing us (too much) from reality

What is the point:

  1. Prove that our RL algorithms will work perfectly every time
    1. Usually not possible with current deep RL methods, which are often not even guarenteed to converge
  1. Understand how errors are affected by problem parameters
    1. Do larger discounts work better than smaller ones?
    1. If we want half the eror, do we need 2x the samples, 4x, or something else?
    1. Usually we use precise theory to get imprecise qualitative conclusions about how various factors influence the performance of RL algorithms under strong assumptions, and try to make the assumptions reasonable enough that these conclusions are likely to apply to real problems (but they are not guarenteed to apply to real problems)

⚠️
Don’t take someone seriously if they say their RL algorithm has “provable guarantees” the assumptions are always unrealistic, and theory is at best a rough guide to what might happen

Exploration

Performance of RL is greatly complicated by exploration - how likely are we to find potentially sparse rewards?

Theoretical guarantees typically address worst case performance ⇒ but worst case exploration is extremely hard
🧙🏽‍♂️
Goal: Show that exploration method is Poly(S,A,1/(1γ))Poly(|S|, |A|, 1/(1-\gamma))

Basic Sample Complexity Analysis

RL Theory Textbook. Agarwal, Jiang, Kakade, Sun.
https://rltheorybook.github.io

“Oracle Exploration”: for every (s,a)(s,a), sample sP(ss,a)s’ \sim P(s’|s,a) N times

Simple “model-based” algorithm

  1. P^(ss,a)=#(s,a,s)N\hat{P}(s’|s,a) = \frac{\#(s,a,s’)}{N}
    1. Count number of transitions
  1. Given π\pi, use P^\hat{P} to estimate Q^π\hat{Q}^\pi
  1. We want to measure worst case performance
    1. How close is Q^π\hat{Q}^\pi to QπQ^\pi?
    1. How close is Q^\hat{Q}^*(optimal function learned under P^\hat{P}) to QQ^*(optimal function learned under real PP)?
    1. How good is the resulting policy?

“How close is Q^π\hat{Q}^\pi to QπQ^\pi?”

Qπ(s,a)Q^π(s,a)ϵmaxs,aQπ(s,a)Q^(s,a)ϵ\begin{split} ||Q^\pi(s,a) - \hat{Q}^\pi(s,a)||_\infin &\le \epsilon \\ \max_{s,a} |Q^\pi(s,a) - \hat{Q}(s,a)| &\le \epsilon \end{split}

“How close is Q^\hat{Q}^* to QQ^*?”

Q(s,a)Q^(s,a)ϵ||Q^*(s,a) - \hat{Q}^*(s,a)||_\infin \le \epsilon

“How good is the resulting policy?”

Q(s,a)Qπ^(s,a)ϵ||Q^*(s,a) - Q^{\hat{\pi}}(s,a)||_\infin \le \epsilon

Concentration Inequalities

How fast our estimate of random variable converges to the true underlying value (in terms of # of samples)

Hoeffding’s Inequality (For continuous distributions)

Suppose X1,,XnX_1, \dots, X_n are a sequence of independent, identically distributed (i.i.d.) random variables with mean μ\mu. Let Xnˉ=n1i=1nXi\bar{X_n} = n^{-1}\sum_{i=1}^n X_i. Suppose that Xi[b,b+]X_i \in [b_{-},b_{+}] with probability 11, then

P(Xˉnμ+ϵ)exp{2nϵ2(b+b)2}\mathbb{P}(\bar{X}_n \ge \mu + \epsilon) \le \exp \{-\frac{2n\epsilon^2}{(b_{+}-b_{-})^2}\}

Therefore if we estimate μ\mu with nn samples the probability we’re off by more than ϵ\epsilon is at most 2exp{2nϵ2(b+b)2}2 \exp \{ \frac{2n\epsilon^2}{(b_{+}-b_{-})^2} \} (since we can be off on both sides). So if we want this probability to be δ\delta:

δ2exp{2nϵ2(b+b)2}logδ22nϵ2(b+b)2ϵ2(b+b)22nlog2δϵb+b2nlog2δ\begin{split} \delta &\le 2\exp\{-\frac{2n\epsilon^2}{(b_{+}-b_{-})^2}\} \\ \log \frac{\delta}{2} &\le -\frac{2n\epsilon^2}{(b_{+}-b_{-})^2} \\ \epsilon^2 &\le \frac{(b_{+}-b_{-})^2}{2n} \log \frac{2}{\delta} \\ \epsilon &\le \frac{b_{+}-b_{-}}{\sqrt{2n}} \sqrt{\log \frac{2}{\delta}} \end{split}
⚠️
Error ϵ\epsilon scales as 1n\frac{1}{\sqrt{n}}

Concentration for Discrete Distributions

Let zz be a discrete random variable that takes values in {1,2,,d}\{1, 2, \dots, d \}, distributed according to qq. We can write qq as a vector where q=[P(z=j)]j=1d\vec{q} = [\mathbb{P}(z=j)]_{j=1}^d. Assume we have NN iid samples, and that our empiraical state of q\vec{q} is [q^]j=i=1N1{zi=j}/N[\hat{q}]_j = \sum_{i=1}^N 1\{z_i=j\} / N

We have that ϵ>0\forall \epsilon > 0,

P(q^q21N+ϵ)exp{Nϵ2}\mathbb{P}(||\hat{\vec{q}}-\vec{q}||_2 \ge \frac{1}{\sqrt{N}} + \epsilon) \le \exp\{-N\epsilon^2\}

Which implies:

P(q^q1d(1N+ϵ))exp{Nϵ2}\mathbb{P}(||\hat{\vec{q}} - \vec{q}||_1 \ge \sqrt{d}(\frac{1}{\sqrt{N}}+\epsilon)) \le \exp\{-N\epsilon^2\}

Let

δ=P(q^q1d(1N+ϵ))\delta = \mathbb{P}(||\hat{\vec{q}} - \vec{q}||_1 \ge \sqrt{d}(\frac{1}{\sqrt{N}}+\epsilon))

Then

δexp{Nϵ2}ϵ1Nlog1δN1ϵ2log1δ\begin{split} \delta &\le \exp\{-N\epsilon^2\} \\ \epsilon &\le \frac{1}{\sqrt{N}} \sqrt{\log \frac{1}{\delta}} \\ N &\le \frac{1}{\epsilon^2} \log \frac{1}{\delta} \end{split}

Using concentration inequalities, we see that in the “Basic Sample Complexity Analysis”, the state transition estimation difference is

P^(ss,a)P(ss,a)1S(1/N+ϵ)SN+Slog(1/δ)NcSlog(1/δ)N\begin{split} ||\hat{P}(s'|s,a) - P(s'|s,a)||_1 &\le \sqrt{|S|} (1/\sqrt{N} + \epsilon) \\ &\le \sqrt{\frac{|S|}{N}} + \sqrt{\frac{|S|\log (1/\delta)}{N}} \\ &\le c\sqrt{\frac{|S|\log (1/\delta)}{N}} \end{split}

Now: Relate error in P^\hat{P} to error in Q^π\hat{Q}^\pi

Recall:

Qπ(s,a)=r(s,a)+γEsP(ss,a)[Vπ(s)]Qπ(s,a)=r(s,a)+γsP(ss,a)Vπ(s)\begin{split} Q^\pi(s,a) &= r(s,a) + \gamma \mathbb{E}_{s' \sim \mathbb{P}(s'|s,a)}[V^\pi(s')] \\ Q^\pi(s,a) &= r(s,a) + \gamma\sum_{s'} \mathbb{P}(s'|s,a) V^\pi(s') \end{split}

So if we write out the transition fuction PP, the Q function QπQ^\pi, reward function rr, and value function VπV^\pi as matrices:

Dims of each matrix
Qπ=r+γPVπQ^\pi = r + \gamma PV^\pi

Remember that VV function is just an expectation over the QQ function (a sum) ⇒ so we can also write it as an expectation (where ΠRS×(SA)\Pi \in \mathbb{R}^{|S| \times (|S||A|)} and is a probability matrix)

Vπ=ΠQπV^\pi = \Pi Q^\pi

With this, (and Pπ=PΠP^\pi = P \Pi)

Qπ=r+γPπQπQ^\pi = r + \gamma P^\pi Q^\pi

So

(IγPπ)Qπ=rQπ=(IγPπ)1r\begin{split} (I - \gamma P^\pi)Q^\pi &= r \\ Q^\pi &= (I-\gamma P^\pi)^{-1} r \end{split}

And turns out the matrix (IγPπ)(I - \gamma P^\pi) is always gonna be invertible

Simulation Lemma:

QπQ^π=γ(IγP^π)1undefinedEvaluation Operator: Turns reward function into Q function(PP^)undefinedDifference in model probabilityVπQ^\pi - \hat{Q}^\pi = \gamma \overbrace{(I-\gamma \hat{P}^\pi)^{-1}}^{\mathclap{\text{Evaluation Operator: Turns reward function into Q function}}} \underbrace{(P-\hat{P})}_{\mathclap{\text{Difference in model probability}}} V^\pi

Proof:

QπQ^π=Qπ(IγP^π)1r=(IγP^π)1(IγP^π)Qπ(IγP^π)1r=(IγP^π)1(IγP^π)Qπ(IγP^π)1(IγPπ)Qπ=(IγP^π)1[(IγP^π)(IγPπ)]Qπ=γ(IγP^π)1(PπP^π)Qπ=γ(IγP^π)1(PΠP^Π)Qπ=γ(IγP^π)1(PP^)ΠQπ=γ(IγP^π)1(PP^)Vπ\begin{split} Q^\pi - \hat{Q}^\pi &= Q^\pi - (I-\gamma \hat{P}^\pi)^{-1}r \\ &=(I-\gamma\hat{P}^\pi)^{-1}(I - \gamma \hat{P}^\pi)Q^\pi - (I-\gamma \hat{P}^\pi)^{-1} r \\ &=(I-\gamma\hat{P}^\pi)^{-1}(I - \gamma \hat{P}^\pi)Q^\pi - (I-\gamma \hat{P}^\pi)^{-1} (I-\gamma P^\pi) Q^\pi \\ &=(I-\gamma\hat{P}^\pi)^{-1} [(I-\gamma \hat{P}^\pi) - (I-\gamma P^\pi)] Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P^\pi - \hat{P}^\pi)Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P \Pi - \hat{P} \Pi)Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P - \hat{P})\Pi Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P- \hat{P})V^\pi \\ \end{split}

Another Lemma:

Given PπP^\pi and any vector vRSA\vec{v} \in \mathbb{R}^{|S||A|}, we have

(IγPπ)1vv1γ||(I-\gamma P^\pi)^{-1}\vec{v}||_\infin \le \frac{||\vec{v}||_\infin}{1-\gamma}

“Q function” corresponding to “reward” v\vec{v} is at most (1γ)1(1-\gamma)^{-1} times larger

Where does this (1γ)1(1-\gamma)^{-1} come from? Geometric series of t=0γtc=c1γ\sum_{t=0}^\infin \gamma^t c = \frac{c}{1-\gamma}

Let

w=(IγPπ)1v\vec{w} = (I\gamma P^\pi)^{-1}\vec{v}

Then

v=(IγPπ)wundefinedTriangular InequalitywγPπwwγw(1γ)w\begin{split} ||\vec{v}||_\infin &= ||(I-\gamma P^\pi)\vec{w}||_\infin \\ &\overbrace{\ge}^{\mathclap{\text{Triangular Inequality}}} ||\vec{w}||_\infin - \gamma ||P^\pi \vec{w}||_\infin \\ &\ge ||\vec{w}||_\infin - \gamma||\vec{w}||_\infin \\ &\ge (1-\gamma) ||\vec{w}||_\infin \end{split}

Putting the lemmas together,

(IγPπ)1vv\begin{split} ||(I-\gamma P^\pi)^{-1}\vec{v}||_\infin &\le ||\vec{v}||_\infin \\ \end{split}

Let the special case v=γ(PP^)Vπ\vec{v} = \gamma(P-\hat{P})V^\pi

QπQ^π=γ(IγP^π)1(PP^)VπQπQ^π=γ(IγP^π)1(PP^)Vπγ1γ(PP^)Vπγ1γ(maxs,aP(s,a)P^(s,a)1)Vπ\begin{split} Q^\pi - \hat{Q}^\pi &= \gamma (I-\gamma \hat{P}^\pi)^{-1}(P-\hat{P})V^\pi \\ ||Q^\pi - \hat{Q}^\pi||_\infin &= ||\gamma(I-\gamma \hat{P}^\pi)^{-1}(P-\hat{P})V^\pi||_\infin \\ &\le \frac{\gamma}{1-\gamma} ||(P-\hat{P})V^\pi||_\infin \\ &\le \frac{\gamma}{1-\gamma}(\max_{s,a} ||P(\cdot |s,a) - \hat{P}(\cdot|s,a)||_1) ||V^\pi||_\infin \end{split}

We can bound Vπ||V^\pi||_\infin

t=0γtrt11γRmax\sum_{t=0}^\infin \gamma^t r_t \le \frac{1}{1-\gamma}R_{max}

With this bound,

QπQ^πγ(1γ)2Rmax(maxs,aP(s,a)P^(s,a)1)||Q^\pi-\hat{Q}^\pi||_\infin \le \frac{\gamma}{(1-\gamma)^2} R_{max} (\max_{s,a} ||P(\cdot |s,a) - \hat{P}(\cdot|s,a)||_1)

With the previous bound on

P(,s,a)P^(s,a)1SN+Slog(1/δ)NcSlog(1/δ)N\begin{split} ||P(\cdot|,s,a) - \hat{P}(\cdot|s,a)||_1 &\le \sqrt{\frac{|S|}{N}} + \sqrt{\frac{|S|\log (1/\delta)}{N}} \\ &\le c\sqrt{\frac{|S|\log (1/\delta)}{N}} \end{split}

We conclude:

QπQ^πϵ=γ(1γ)2Rmaxc2Slog(δ1)N||Q^\pi-\hat{Q}^\pi||_\infin \le \epsilon = \frac{\gamma}{(1-\gamma)^2} R_{max} c_2 \sqrt{\frac{|S| \log(\delta^{-1})}{N}}
⚠️
Error grows quadratically in the horizon (1γ)2(1-\gamma)^{-2}, each backup accumulates error

What about QQ^||Q^* - \hat{Q}^*||_\infin?

Supremium - lower upper bound of a function (greatest value of a function), one useful identity

supxf(x)supxg(x)supxf(x)g(x)|\sup_x f(x) - \sup_x g(x)| \le \sup_x |f(x)-g(x)|

And

QQ^=supπQπsupπQ^πsupπQπQ^πϵ||Q^*-\hat{Q}^*||_\infin = ||\sup_\pi Q^\pi - \sup_\pi \hat{Q}^\pi||_\infin \le \sup_\pi ||Q^\pi - \hat{Q}^\pi||_\infin \le \epsilon

What about QQπ^||Q^* - Q^{\hat{\pi}^*}||_\infin?

QQπ^=QQ^π^+Q^π^Qπ^QQ^π^undefinedQ^+Q^π^Qπ^undefinedSame Policy2ϵ\begin{split} ||Q^* - Q^{\hat{\pi}^*}||_\infin &= ||Q^* - \hat{Q}^{\hat{\pi}^*}+\hat{Q}^{\hat{\pi}^*} - Q^{\hat{\pi}^*}||_\infin \\ &\le ||Q^* - \underbrace{\hat{Q}^{\hat{\pi}^*}}_{\hat{Q}^*}||_\infin + ||\underbrace{\hat{Q}^{\hat{\pi}^*} - Q^{\hat{\pi}^*}}_{\mathclap{\text{Same Policy}}}||_\infin \\ &\le 2\epsilon \end{split}

Fixed Q-Iteration

Abstract model of exact Q-iteration

Qk+1TQk=r+γPmaxaQkQ_{k+1} \leftarrow TQ_k = r+\gamma P \max_{a} Q_k

Abstract model of approximate fitted Q-iteration:

Q^k+1arg minQ^Q^T^Q^k\hat{Q}_{k+1} \leftarrow \argmin_{\hat{Q}} ||\hat{Q} - \hat{T}\hat{Q}_k||

Where T^\hat{T} is the approximate bellman operator:

T^Q=r^undefinedr^(s,a)=1N(s,a)iδ((si,ai)=(s,a))ri+γP^undefinedP^(ss,a)=N(s,a,s)N(s,a)maxaQ\hat{T}Q = \underbrace{\hat{r}}_{\mathclap{\hat{r}(s,a) = \frac{1}{N(s,a)}\sum_i \delta((s_i,a_i) = (s,a)) r_i}} + \gamma \overbrace{\hat{P}}^{\mathclap{\hat{P}(s'|s,a) = \frac{N(s,a,s')}{N(s,a)}}} \max_a Q

Note: We will not be computing P^\hat{P} and r^\hat{r}, it would just be the effect of averaging together transitions in the data (having different gradients for different samples)

We will assume an infinity norm here because there will be no convergnce if using L2 norm

Erorrs come from

Sampling Error:

T^Q(s,a)TQ(s,a)=r^(s,a)r(s,a)+γ(EP^[maxaQ(s,a)]EP[maxaQ(s,a)])r^(s,a)r(s,a)+γEP^[maxaQ(s,a)]EP[maxaQ(s,a)]\begin{split} |\hat{T}Q(s,a) - TQ(s,a)| &= |\hat{r}(s,a)-r(s,a)+\gamma(\mathbb{E}_{\hat{P}}[\max_{a'} Q(s',a')] - \mathbb{E}_{P}[\max_{a'} Q(s',a')])| \\ &\le |\hat{r}(s,a) - r(s,a)| + \gamma {|\mathbb{E}_{\hat{P}}[\max_{a'}Q(s',a')]-\mathbb{E}_{P}[\max_{a'}Q(s',a')]|} \\ \end{split}

Note:

So

T^Q(s,a)TQ(s,a)2Rmaxlog(δ1)2N+cQlog(δ1)N|\hat{T}Q(s,a) -TQ(s,a)| \le 2R_{max} \sqrt{\frac{\log(\delta^{-1})}{2N}} + c||Q||_\infin \sqrt{\frac{\log(\delta^{-1})}{N}}

And

T^QTQ2Rmaxc1log(SAδ)2N+c2Qlog(Sδ)N||\hat{T}Q - TQ||_\infin \le 2R_{max} c_1 \sqrt{\frac{\log(\frac{|S||A|}{\delta})}{2N}} + c_2||Q||_\infin \sqrt{\frac{\log(\frac{|S|}{\delta})}{N}}

Approximation Error:

Assume error: Q^k+1TQ^kϵk||\hat{Q}_{k+1} - T\hat{Q}_k||_\infin||_\infin \le \epsilon_k

This is a strong assumption!

Q^kQ=Q^kTQ^k1+TQ^k1Q=(Q^kTQ^k1)+(TQ^k1TQundefinedUsing the fact that Q is a fixed point of T)Q^kTQ^k1+TQ^k1TQϵk1+γQ^k1QundefinedUsing the fact that T is a γ-contractionUnroll the recursion,i=0k1γiϵki1+γQ^k1Q\begin{split} ||\hat{Q}_k - Q^*||_\infin &= ||\hat{Q}_k - T\hat{Q}_{k-1} + T\hat{Q}_{k-1}-Q^*||_\infin \\ &= ||(\hat{Q}_k - T\hat{Q}_{k-1}) + (T\hat{Q}_{k-1} - \underbrace{TQ^*}_{\mathclap{\text{Using the fact that $Q^*$ is a fixed point of $T$}}})||_\infin \\ &\le ||\hat{Q}_k - T\hat{Q}_{k-1}||_\infin + ||T\hat{Q}_{k-1} - TQ^*||_\infin \\ &\le \epsilon _{k-1} + \underbrace{\gamma ||\hat{Q}_{k-1}- Q^*||_\infin}_{\mathclap{\text{Using the fact that $T$ is a $\gamma$-contraction}}} \\ \text{Unroll the recursion,} \\ &\le \sum_{i=0}^{k-1} \gamma^i \epsilon_{k-i-1} + \gamma ||\hat{Q}_{k-1} - Q^*||_\infin \\ \end{split}

limkQ^kQi=0γimaxkϵk=11γϵ\lim_{k \to \infin} ||\hat{Q}_k - Q^*||_\infin \le \sum_{i=0}^{\infin} \gamma^i \max_{k} \epsilon_k = \frac{1}{1-\gamma}||\epsilon||_\infin

Putting it together:

Q^kTQ^k1=Q^kT^Q^k1+T^Q^k1TQ^k1Q^kT^Q^k1undefinedApproximation Error+T^Q^k1TQ^k1undefinedSampling Error\begin{split} ||\hat{Q}_k - T\hat{Q}_{k-1}||_\infin &= ||\hat{Q}_k - \hat{T}\hat{Q}_{k-1} + \hat{T}\hat{Q}_{k-1} - T\hat{Q}_{k-1}||_\infin \\ &\le \underbrace{||\hat{Q}_k - \hat{T}\hat{Q}_{k-1}||_\infin}_{\mathclap{\text{Approximation Error}}} + \underbrace{||\hat{T}\hat{Q}_{k-1} - T\hat{Q}_{k-1}||_\infin}_{\mathclap{\text{Sampling Error}}} \\ \end{split}

Note:

From previous proofs, Sampling Error,Approx ErrorO((1γ)1)\text{Sampling Error}, \text{Approx Error} \in O((1-\gamma)^{-1})

More advanced results can be derived with pp-norms under some distribution