RL Theory

📌

Effective analysis is very hard in RL without strong assumptions Trick is to make assumptions that admit interesting conclusions without divorcing us (too much) from reality

What is the point:

Prove that our RL algorithms will work perfectly every time
1. Usually not possible with current deep RL methods, which are often not even guarenteed to converge

Understand how errors are affected by problem parameters
1. Do larger discounts work better than smaller ones?
1. If we want half the eror, do we need 2x the samples, 4x, or something else?
1. Usually we use precise theory to get imprecise qualitative conclusions about how various factors influence the performance of RL algorithms under strong assumptions, and try to make the assumptions reasonable enough that these conclusions are likely to apply to real problems (but they are not guarenteed to apply to real problems)

⚠️

Don’t take someone seriously if they say their RL algorithm has “provable guarantees” the assumptions are always unrealistic, and theory is at best a rough guide to what might happen

Exploration

Performance of RL is greatly complicated by exploration - how likely are we to find potentially sparse rewards?

Theoretical guarantees typically address worst case performance ⇒ but worst case exploration is extremely hard

🧙🏽‍♂️

Goal: Show that exploration method is

Poly(|S|, |A|, 1/(1-\gamma))

If we “abstract away” exploration, how many samples do we need to effectively learn a model or value function that results in good performance?
- “generative model” assumption: assume we can sample from $P(s’|s,a)$ for any $(s,a)$
- “oracle exploration” assumption: For every $(s,a)$ , sample $s’ \sim P(s’|s,a)$ N times

Basic Sample Complexity Analysis

RL Theory Textbook. Agarwal, Jiang, Kakade, Sun.

https://rltheorybook.github.io

“Oracle Exploration”: for every $(s,a)$ , sample $s’ \sim P(s’|s,a)$ N times

Simple “model-based” algorithm

$\hat{P}(s’|s,a) = \frac{\#(s,a,s’)}{N}$
1. Count number of transitions

Given $\pi$ , use $\hat{P}$ to estimate $\hat{Q}^\pi$

We want to measure worst case performance
1. How close is $\hat{Q}^\pi$ to $Q^\pi$ ?
1. How close is $\hat{Q}^*$ (optimal function learned under $\hat{P}$ ) to $Q^*$ (optimal function learned under real $P$ )?
1. How good is the resulting policy?

“How close is $\hat{Q}^\pi$ to $Q^\pi$ ?”

\begin{split} ||Q^\pi(s,a) - \hat{Q}^\pi(s,a)||_\infin &\le \epsilon \\ \max_{s,a} |Q^\pi(s,a) - \hat{Q}(s,a)| &\le \epsilon \end{split}

“How close is $\hat{Q}^*$ to $Q^*$ ?”

||Q^*(s,a) - \hat{Q}^*(s,a)||_\infin \le \epsilon

“How good is the resulting policy?”

||Q^*(s,a) - Q^{\hat{\pi}}(s,a)||_\infin \le \epsilon

Concentration Inequalities

How fast our estimate of random variable converges to the true underlying value (in terms of # of samples)

Hoeffding’s Inequality (For continuous distributions)

Suppose $X_1, \dots, X_n$ are a sequence of independent, identically distributed (i.i.d.) random variables with mean $\mu$ . Let $\bar{X_n} = n^{-1}\sum_{i=1}^n X_i$ . Suppose that $X_i \in [b_{-},b_{+}]$ with probability $1$ , then

\mathbb{P}(\bar{X}_n \ge \mu + \epsilon) \le \exp \{-\frac{2n\epsilon^2}{(b_{+}-b_{-})^2}\}

Therefore if we estimate $\mu$ with $n$ samples the probability we’re off by more than $\epsilon$ is at most $2 \exp \{ \frac{2n\epsilon^2}{(b_{+}-b_{-})^2} \}$ (since we can be off on both sides). So if we want this probability to be $\delta$ :

\begin{split} \delta &\le 2\exp\{-\frac{2n\epsilon^2}{(b_{+}-b_{-})^2}\} \\ \log \frac{\delta}{2} &\le -\frac{2n\epsilon^2}{(b_{+}-b_{-})^2} \\ \epsilon^2 &\le \frac{(b_{+}-b_{-})^2}{2n} \log \frac{2}{\delta} \\ \epsilon &\le \frac{b_{+}-b_{-}}{\sqrt{2n}} \sqrt{\log \frac{2}{\delta}} \end{split}

⚠️

Error

\epsilon

scales as

\frac{1}{\sqrt{n}}

Concentration for Discrete Distributions

Let $z$ be a discrete random variable that takes values in $\{1, 2, \dots, d \}$ , distributed according to $q$ . We can write $q$ as a vector where $\vec{q} = [\mathbb{P}(z=j)]_{j=1}^d$ . Assume we have $N$ iid samples, and that our empiraical state of $\vec{q}$ is $[\hat{q}]_j = \sum_{i=1}^N 1\{z_i=j\} / N$

We have that $\forall \epsilon > 0$ ,

\mathbb{P}(||\hat{\vec{q}}-\vec{q}||_2 \ge \frac{1}{\sqrt{N}} + \epsilon) \le \exp\{-N\epsilon^2\}

Which implies:

\mathbb{P}(||\hat{\vec{q}} - \vec{q}||_1 \ge \sqrt{d}(\frac{1}{\sqrt{N}}+\epsilon)) \le \exp\{-N\epsilon^2\}

Let

\delta = \mathbb{P}(||\hat{\vec{q}} - \vec{q}||_1 \ge \sqrt{d}(\frac{1}{\sqrt{N}}+\epsilon))

Then

\begin{split} \delta &\le \exp\{-N\epsilon^2\} \\ \epsilon &\le \frac{1}{\sqrt{N}} \sqrt{\log \frac{1}{\delta}} \\ N &\le \frac{1}{\epsilon^2} \log \frac{1}{\delta} \end{split}

Using concentration inequalities, we see that in the “Basic Sample Complexity Analysis”, the state transition estimation difference is

\begin{split} ||\hat{P}(s'|s,a) - P(s'|s,a)||_1 &\le \sqrt{|S|} (1/\sqrt{N} + \epsilon) \\ &\le \sqrt{\frac{|S|}{N}} + \sqrt{\frac{|S|\log (1/\delta)}{N}} \\ &\le c\sqrt{\frac{|S|\log (1/\delta)}{N}} \end{split}

Now: Relate error in $\hat{P}$ to error in $\hat{Q}^\pi$

Recall:

\begin{split} Q^\pi(s,a) &= r(s,a) + \gamma \mathbb{E}_{s' \sim \mathbb{P}(s'|s,a)}[V^\pi(s')] \\ Q^\pi(s,a) &= r(s,a) + \gamma\sum_{s'} \mathbb{P}(s'|s,a) V^\pi(s') \end{split}

So if we write out the transition fuction $P$ , the Q function $Q^\pi$ , reward function $r$ , and value function $V^\pi$ as matrices:

Q^\pi = r + \gamma PV^\pi

Remember that $V$ function is just an expectation over the $Q$ function (a sum) ⇒ so we can also write it as an expectation (where $\Pi \in \mathbb{R}^{|S| \times (|S||A|)}$ and is a probability matrix)

V^\pi = \Pi Q^\pi

With this, (and $P^\pi = P \Pi$ )

Q^\pi = r + \gamma P^\pi Q^\pi

\begin{split} (I - \gamma P^\pi)Q^\pi &= r \\ Q^\pi &= (I-\gamma P^\pi)^{-1} r \end{split}

And turns out the matrix $(I - \gamma P^\pi)$ is always gonna be invertible

Simulation Lemma:

Q^\pi - \hat{Q}^\pi = \gamma \overbrace{(I-\gamma \hat{P}^\pi)^{-1}}^{\mathclap{\text{Evaluation Operator: Turns reward function into Q function}}} \underbrace{(P-\hat{P})}_{\mathclap{\text{Difference in model probability}}} V^\pi

Proof:

\begin{split} Q^\pi - \hat{Q}^\pi &= Q^\pi - (I-\gamma \hat{P}^\pi)^{-1}r \\ &=(I-\gamma\hat{P}^\pi)^{-1}(I - \gamma \hat{P}^\pi)Q^\pi - (I-\gamma \hat{P}^\pi)^{-1} r \\ &=(I-\gamma\hat{P}^\pi)^{-1}(I - \gamma \hat{P}^\pi)Q^\pi - (I-\gamma \hat{P}^\pi)^{-1} (I-\gamma P^\pi) Q^\pi \\ &=(I-\gamma\hat{P}^\pi)^{-1} [(I-\gamma \hat{P}^\pi) - (I-\gamma P^\pi)] Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P^\pi - \hat{P}^\pi)Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P \Pi - \hat{P} \Pi)Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P - \hat{P})\Pi Q^\pi \\ &=\gamma(I-\gamma\hat{P}^\pi)^{-1} (P- \hat{P})V^\pi \\ \end{split}

Another Lemma:

Given $P^\pi$ and any vector $\vec{v} \in \mathbb{R}^{|S||A|}$ , we have

||(I-\gamma P^\pi)^{-1}\vec{v}||_\infin \le \frac{||\vec{v}||_\infin}{1-\gamma}

“Q function” corresponding to “reward” $\vec{v}$ is at most $(1-\gamma)^{-1}$ times larger

Where does this $(1-\gamma)^{-1}$ come from? Geometric series of $\sum_{t=0}^\infin \gamma^t c = \frac{c}{1-\gamma}$

Let

\vec{w} = (I\gamma P^\pi)^{-1}\vec{v}

Then

\begin{split} ||\vec{v}||_\infin &= ||(I-\gamma P^\pi)\vec{w}||_\infin \\ &\overbrace{\ge}^{\mathclap{\text{Triangular Inequality}}} ||\vec{w}||_\infin - \gamma ||P^\pi \vec{w}||_\infin \\ &\ge ||\vec{w}||_\infin - \gamma||\vec{w}||_\infin \\ &\ge (1-\gamma) ||\vec{w}||_\infin \end{split}

Putting the lemmas together,

\begin{split} ||(I-\gamma P^\pi)^{-1}\vec{v}||_\infin &\le ||\vec{v}||_\infin \\ \end{split}

Let the special case $\vec{v} = \gamma(P-\hat{P})V^\pi$

\begin{split} Q^\pi - \hat{Q}^\pi &= \gamma (I-\gamma \hat{P}^\pi)^{-1}(P-\hat{P})V^\pi \\ ||Q^\pi - \hat{Q}^\pi||_\infin &= ||\gamma(I-\gamma \hat{P}^\pi)^{-1}(P-\hat{P})V^\pi||_\infin \\ &\le \frac{\gamma}{1-\gamma} ||(P-\hat{P})V^\pi||_\infin \\ &\le \frac{\gamma}{1-\gamma}(\max_{s,a} ||P(\cdot |s,a) - \hat{P}(\cdot|s,a)||_1) ||V^\pi||_\infin \end{split}

We can bound $||V^\pi||_\infin$

\sum_{t=0}^\infin \gamma^t r_t \le \frac{1}{1-\gamma}R_{max}

With this bound,

||Q^\pi-\hat{Q}^\pi||_\infin \le \frac{\gamma}{(1-\gamma)^2} R_{max} (\max_{s,a} ||P(\cdot |s,a) - \hat{P}(\cdot|s,a)||_1)

With the previous bound on

\begin{split} ||P(\cdot|,s,a) - \hat{P}(\cdot|s,a)||_1 &\le \sqrt{\frac{|S|}{N}} + \sqrt{\frac{|S|\log (1/\delta)}{N}} \\ &\le c\sqrt{\frac{|S|\log (1/\delta)}{N}} \end{split}

We conclude:

||Q^\pi-\hat{Q}^\pi||_\infin \le \epsilon = \frac{\gamma}{(1-\gamma)^2} R_{max} c_2 \sqrt{\frac{|S| \log(\delta^{-1})}{N}}

⚠️

Error grows quadratically in the horizon

(1-\gamma)^{-2}

, each backup accumulates error

What about $||Q^* - \hat{Q}^*||_\infin$ ?

Supremium - lower upper bound of a function (greatest value of a function), one useful identity

|\sup_x f(x) - \sup_x g(x)| \le \sup_x |f(x)-g(x)|

And

||Q^*-\hat{Q}^*||_\infin = ||\sup_\pi Q^\pi - \sup_\pi \hat{Q}^\pi||_\infin \le \sup_\pi ||Q^\pi - \hat{Q}^\pi||_\infin \le \epsilon

What about $||Q^* - Q^{\hat{\pi}^*}||_\infin$ ?

\begin{split} ||Q^* - Q^{\hat{\pi}^*}||_\infin &= ||Q^* - \hat{Q}^{\hat{\pi}^*}+\hat{Q}^{\hat{\pi}^*} - Q^{\hat{\pi}^*}||_\infin \\ &\le ||Q^* - \underbrace{\hat{Q}^{\hat{\pi}^*}}_{\hat{Q}^*}||_\infin + ||\underbrace{\hat{Q}^{\hat{\pi}^*} - Q^{\hat{\pi}^*}}_{\mathclap{\text{Same Policy}}}||_\infin \\ &\le 2\epsilon \end{split}

Fixed Q-Iteration

Abstract model of exact Q-iteration

Q_{k+1} \leftarrow TQ_k = r+\gamma P \max_{a} Q_k

Abstract model of approximate fitted Q-iteration:

\hat{Q}_{k+1} \leftarrow \argmin_{\hat{Q}} ||\hat{Q} - \hat{T}\hat{Q}_k||

Where $\hat{T}$ is the approximate bellman operator:

\hat{T}Q = \underbrace{\hat{r}}_{\mathclap{\hat{r}(s,a) = \frac{1}{N(s,a)}\sum_i \delta((s_i,a_i) = (s,a)) r_i}} + \gamma \overbrace{\hat{P}}^{\mathclap{\hat{P}(s'|s,a) = \frac{N(s,a,s')}{N(s,a)}}} \max_a Q

Note: We will not be computing $\hat{P}$ and $\hat{r}$ , it would just be the effect of averaging together transitions in the data (having different gradients for different samples)

We will assume an infinity norm here because there will be no convergnce if using L2 norm

Erorrs come from

$T \ne \hat{T}$
- Sampling Error

$\hat{Q}_{k+1} \ne \hat{T}\hat{Q}_k$
- Approximation Error

Sampling Error:

\begin{split} |\hat{T}Q(s,a) - TQ(s,a)| &= |\hat{r}(s,a)-r(s,a)+\gamma(\mathbb{E}_{\hat{P}}[\max_{a'} Q(s',a')] - \mathbb{E}_{P}[\max_{a'} Q(s',a')])| \\ &\le |\hat{r}(s,a) - r(s,a)| + \gamma {|\mathbb{E}_{\hat{P}}[\max_{a'}Q(s',a')]-\mathbb{E}_{P}[\max_{a'}Q(s',a')]|} \\ \end{split}

Note:

$|\hat{r}(s,a) - r(s,a)|$
- Estimation error of continuous random variable, use Hoeffding’s Inequality!
- $\le 2 R_{max} \sqrt{\frac{\log (\delta^{-1})}{2N}}$

$|\mathbb{E}{\hat{P}}[\max{a'}Q(s',a')]-\mathbb{E}{P}[\max{a'}Q(s',a')]|$
- $= \sum_{s’} (\hat{P}(s’|s,a) - P(s’|s,a)) \max_{a’} Q(s’,a’)$
- $\le \sum_{s’} |\hat{P}(s’|s,a) - P(s’|s,a)| \max_{s’,a’} Q(s’,a’)$
  - $= ||\hat{P}(\cdot|s,a) - P(\cdot|s,a)||_1 ||Q||_\infin$
- $\le c||Q||_\infin \sqrt{\frac{\log(\delta^{-1})}{N}}$

|\hat{T}Q(s,a) -TQ(s,a)| \le 2R_{max} \sqrt{\frac{\log(\delta^{-1})}{2N}} + c||Q||_\infin \sqrt{\frac{\log(\delta^{-1})}{N}}

And

||\hat{T}Q - TQ||_\infin \le 2R_{max} c_1 \sqrt{\frac{\log(\frac{|S||A|}{\delta})}{2N}} + c_2||Q||_\infin \sqrt{\frac{\log(\frac{|S|}{\delta})}{N}}

Approximation Error:

Assume error: $||\hat{Q}_{k+1} - T\hat{Q}_k||_\infin||_\infin \le \epsilon_k$

This is a strong assumption!

\begin{split} ||\hat{Q}_k - Q^*||_\infin &= ||\hat{Q}_k - T\hat{Q}_{k-1} + T\hat{Q}_{k-1}-Q^*||_\infin \\ &= ||(\hat{Q}_k - T\hat{Q}_{k-1}) + (T\hat{Q}_{k-1} - \underbrace{TQ^*}_{\mathclap{\text{Using the fact that $Q^*$ is a fixed point of $T$}}})||_\infin \\ &\le ||\hat{Q}_k - T\hat{Q}_{k-1}||_\infin + ||T\hat{Q}_{k-1} - TQ^*||_\infin \\ &\le \epsilon _{k-1} + \underbrace{\gamma ||\hat{Q}_{k-1}- Q^*||_\infin}_{\mathclap{\text{Using the fact that $T$ is a $\gamma$-contraction}}} \\ \text{Unroll the recursion,} \\ &\le \sum_{i=0}^{k-1} \gamma^i \epsilon_{k-i-1} + \gamma ||\hat{Q}_{k-1} - Q^*||_\infin \\ \end{split}

\lim_{k \to \infin} ||\hat{Q}_k - Q^*||_\infin \le \sum_{i=0}^{\infin} \gamma^i \max_{k} \epsilon_k = \frac{1}{1-\gamma}||\epsilon||_\infin

Putting it together:

\begin{split} ||\hat{Q}_k - T\hat{Q}_{k-1}||_\infin &= ||\hat{Q}_k - \hat{T}\hat{Q}_{k-1} + \hat{T}\hat{Q}_{k-1} - T\hat{Q}_{k-1}||_\infin \\ &\le \underbrace{||\hat{Q}_k - \hat{T}\hat{Q}_{k-1}||_\infin}_{\mathclap{\text{Approximation Error}}} + \underbrace{||\hat{T}\hat{Q}_{k-1} - T\hat{Q}_{k-1}||_\infin}_{\mathclap{\text{Sampling Error}}} \\ \end{split}

Note:

From previous proofs, $\text{Sampling Error}, \text{Approx Error} \in O((1-\gamma)^{-1})$

More advanced results can be derived with $p$ -norms under some distribution