Exploration

Exploration (Lec 13)

It’s hard for algorithms to know what rules of the environment are “Mao” Game

The only rule you may be told is this one
1. Incur a penalty when you break a rule
1. Can only discover rules through trial and error
1. Rules don’t always make sense to you

So here is the definition of exploration problem:

How can an agent discover high-reward strategies that require a temporally extended sequence of complex behaviors that, individually, are not rewarding?

How can an agent decide whether to attempt new behaviors or continue to do best things it knows so far?

So can we derive an optimal exploration strategy:

Tractable(易于理解) - If it is easy to know that the policy is optimal or not

Multi-arm bandits (1-step, stateless RL) / Contextual bandits (1 step, but there’s a state)
1. Can formalize exploration as POMDP(Partial-observed MDP because although there’s only one time, performing actions help us expand our knowledge) identification
1. policy learning is trivial even with POMDP

small, finite MDPs
1. Can frame as Bayesian model identification, reason explicitly about value of information

Large or infinite MDPs
1. Optimal methods don’t work
  1. Exploring with random actions: oscillate back and forth, might not go to a coherent or interesting place
1. But can take inspiration from optimal methods in smaller settings
1. use hacks

Bandits

The dropsophila(黑腹果蝇, 模式生物) of exploration problems

Multi-arm bandit ⇒ have to make decision on which machine to play; Pulling one of the arms doesn’t mean it’s a bad decision

So the definition of the bandit:

Assume $r(a_i) \sim p_{\theta_i}(r_i)$

e.g. $p(r_i = 1) = \theta_i$ and $p(r_i = 0) = 1 - \theta_i$

Where (but you don’t know):

\theta_i \sim p(\theta)

This defines a POMDP with $s = [\theta_1, \dots, \theta_n]$

Belief state is $\hat{p}(\theta_1, \dots, \theta_n)$

Solving the POMDP yields the optimal exploration strategy

But this is an overkill - belief state is huge! ⇒ we can do very well with much simpler strategies

We will define regret as the difference from optimal policy at time step $T$

Reg(T) = T\mathbb{E}[r(a^*)] - \sum_{t=1}^T r(a_t)

How do we beat bandit?

Variety of relatively simple strategies

Often can provide theoretical guarantees on regret
1. Empirical performance may vary

Exploration strategies for more complex MDP domains will be inmspired by those strategies

Optimistic exploration $Reg(T) \in O(\log T)$

Keep track of average reward $\hat{\mu}_a$ for each action a

Exploitation: pick $a = \argmax \hat{\mu}_a$

Optimistic Estimate: $a = \argmax \hat{\mu}_a + C \sigma_a$
1. “Try each arm until you are sure it’s not great”
1. $a = \argmax \hat{\mu}_a + \sqrt{\frac{2 \ln T}{N(a)}}$ where $N(a)$ is number of times we have picked this action

To put this in practice, use $r^+(s,a) = r(s,a) + B(N(s))$ ⇒ easy to add in but hard to tune the weights of $B(N(s))$ .

Problem is in continuous states ⇒ we never see same thing twice ⇒ but some states are more similar than others

Beliemare et al. “Unifying Count-based Exploration …”

So we will

Fit a density model $p_\theta(s)$ or $p_\theta(s,a)$

$p_\theta(s)$ might be high even for a new $s$ if $s$ is similar to previous seen states

But how can the distribution of $p_\theta$ match the distribution of a count?

Notice that the true probability is $P(s) = \frac{N(s)}{n}$ , and after seeing $s$ , $P’(s) = \frac{N(s) + 1}{n+1}$

How to get to $\hat{N}(s)$ ?

p_\theta(s_i) = \frac{\hat{N}(s_i)}{\hat{n}}, p_{\theta}'(s_i) = \frac{\hat{N}(s_i)+1}{\hat{n}+1}

Solving the equation, we find

\hat{N}(s_i) = \hat{n}p_\theta(s_i), \hat{n} = \frac{1 - p_\theta'(s_i)}{p_\theta'(s_i)-p_\theta(s_i)} p_\theta(s_i)

Different classes of bonuses:

UCB bonus $B(N(s)) = \sqrt{\frac{2 \ln n}{N(s)}}$

MBIE-EB (Strehl & Littman, 2008): $B(N(s)) = \sqrt{\frac{1}{N(s)}}$

BEB (Kolter & Ng, 2009): $B(N(s)) = \frac{1}{N(s)}$

So we need to output densities, but doesn’t necessarily need to produce great samples

Opposite considerations from many popular generative models like GANs

Bellemare et al. “CTS” Model: Condition each pixel on its top-left neighborhood

Other models: Stochastic neural networks, compression length, EX2

Probability Matching / Posterior Sampling (Thompson Sampling)

Assume $r(a_i) \sim p_{\theta_i} (r_i)$ ⇒ which defines a POMDP with $s = [\theta_1, \dots, \theta_n]$

Belief state is $\hat{p}(\theta_1, \dots, \theta_n)$

Idea:

Sample $\theta_1, \dots, \theta_n \sim \hat{p}(\theta_1, \dots, \theta_n)$

Pretend the model $\theta_1, \dots, \theta_n$ is correct

Take the optimal action

Update the model

Chapelle & Li, “An Empirical Evaluation of Thompson Sampling.”

⇒ Hard to analyze theoretically but can work very well empirically

Osband et al. “Deep Exploration via Boostrapped DQN”

What do we sample?

how do we represent the distribution?

Bootstrap ensembles:

Given a dataset $D$ , resample with replacement $N$ times to get $D_1, \dots, D_N$

train each model $f_{\theta_i}$ on $D_i$

To sample from $p(\theta)$ , sample $i \in [1, \dots, N]$ and use $f_{\theta_i}$

❓

Train

N

big neural nets is expensive, can we avoid it? In the paper they trained a net with different heads ⇒ this may actually be undesirable since now the outputs are correlated but they may be uncorrelated enough for us to use

🧙🏽‍♂️

This might help better than epsilon-greedy because instead of maybe oscillating back adn forth (and not go somewhere intersting at all), we are commited to a randomized but internally consistant strategy for an entire episode

No change to original reward function

Very good bonuses often do better than this

Methods that use Information Gain

Bayesian Experimental Design

If we want to determine some latent variable $z$ ⇒ $z$ might be optimal action or its value

Which action do we take?

Let $H(\hat{p}(z))$ be the current entropy of our $z$ estimate

Let $H(\hat{p}(z)|y)$ be the entropy of our $z$ estimate after observation $y$ ( $y$ might be $r(a)$ in the case of RL)

The lower the entropy, the more precisely we know $z$

IG(z,y) = \mathbb{E}_y [H(\hat{p}(z)) - H(\hat{p}(z)|y)]

Typically depends on action, so we will use the notion $IG(z, y|a) = \mathbb{E}_y [H(\hat{p}(z) - H(\hat{p}(z)|y)|a)]$

🧙🏽‍♂️

One important thing to find is decide what

y

is in our problem and what

z

Russo & Van Roy “Learning to Optimize via Information-Directed Sampling”

y = r_a, z = \theta_a

Observe reward and learn parameters of $\hat{p}(r_a)$

g(a) = IG(\theta_a, r_a|a) \rightarrow \text{information gain of $a$}

\Delta(a) = \mathbb{E}[r(a^*) - r(a)] \rightarrow \text{expected suboptimality of $a$}

We want to gain more information but we don’t want our policy to be suboptimal

So we will choose $a$ according to

\argmin_a \frac{\Delta(a)^2}{g(a)}

Houthooft et al. “VIME”

Question:

Information gain about what?
1. Reward $r(s,a)$
  1. Not useful as reward is sparse
1. State density $p(s)$
  1. A bit strange, but somewhat makes sense!
1. Dynamics $p(s’|s,a)$
  1. Good proxy for learning the MDP, though still heuristic

Generally they are intractable to use exactly, regardless of what is being estimated

A few approximations:

Prediction Gain (Schmidhuber ‘91, Bellemare ‘16)
1. $\log p_{\theta}’ (s) - \log p_\theta(s)$
1. If density changed a lot, then the state is novel

Variational Inference (Houthooft et al. “VIME”)
1. IG can be equivalently written as $D_{KL}(p(z|y), p(z))$
1. Learn about transitions $p_\theta(s_{t+1}|s_t,a_t): z = \theta$
1. $y = (s_t, a_t, s_{t+1})$
1. Intuition: a transition is more informative if it causes belief over $\theta$ to change
1. Idea:
  1. Use variational inference to estimate $q(\theta|\phi) \approx p(\theta|h)$
    1. Specifically optimize variational lower bound $D_{KL}(q(\theta|\phi), p(h|\theta) p(\theta))$
    1. Represent $q(\theta|\phi)$ as product of independent Gaussian parameter distributions with mean $\phi$
      1. $p(\theta|D) = \prod_i p(\theta_i |D)$
      1. $p(\theta_i|D) = N(\underbrace{\mu_i, \sigma_i^2}_{\mathclap{\phi}})$
    1. See Blundell et al. “Weight uncertainty in neural networks”
  1. Given new transition $(s,a,s’)$ , update $\phi$ to get $\phi'$
  1. use $D_{KL}(q(\theta|\phi’), q(\theta|\phi))$ as approximate bonus

Appealing mathematical formalism

Models are more complex, generally harder to use effectively

Counting with Hashes

What if we still count states, but compress them into hashes and count them

Compress $s$ into k-bit code via $\phi(s)$ then count $N(\phi(s))$

shorter codes = more hash collisions

similar states get the same hash? Maybe
1. Depends on model we choose
1. We can use an autoencoder and this improve the odds of doing this

Tang et al. “#Exploration: A Study of Count-Based Exploration”

Implicit Density Modeling with Exemplar Models

Fu et al. “EX2: Exploration with Exemplar Models…”

We ned $p_\theta(s)$ to be able to output densities, but doesn’t necessarily need to produce great samples ⇒ can we explicitly compare the new state to past states?

🧙🏽‍♂️

Intuition: The state is novel if it is easy to distinguish from all previous seen states by a classifier

For each observed state $s$ , fit a classifier to classify that state against all past states $D$ and use classifier error to obtain density

p_\theta(s) = \frac{1-D_s(s)}{D_s(s)}

where $D_s(s)$ is the probability that the classifier assigns $s$ as a “positive”

In reality, every state is unique so we regularize the classifier

Heuristic estimation of counts via errors (DNF)

We just need to tell if state is novel or not

Given buffer $D = \{(s_i,a_i)\}$ , fit $\hat{f}_\theta(s,a)$

So we basically use the error of $\hat{f}_\theta(s,a)$ and actual $f^*(s,a)$ as a bonus

\epsilon(s,a) = ||\hat{f}_\theta(s,a)-f^*(s,a)||^2

But what function should we use for $f^*$ ?

$f^*(s,a) = s’$

$f^*(s,a)$ as a neural network, but with parameters chosen randomly

We mentioned about how $D_{KL}(q(\theta|\phi’), q(\theta|\phi))$ can be seen as Information Gain, it can also be viewed as change in network parameters $\phi$

So if we forget about IG, there are many other ways to measure this

Stadie et al. 2015:

encode image observations using auto-encoder

build predictive model on auto-encoder latent states

use model error as exploration bonus

Schmidhuber et al. (see, e.g. “Formal Theory of Creativity, Fun, and Intrinsic Motivation):

exploration bonus for model error

exploration bonus for model gradient

many other variations

Unsupervised learning of diverse behaviors

What if we want to recover diverse behavior without any reward function at all?

Learn skills without supervision
- But then use them to accomplish goals

Learn sub-skills to use with hierarchical reinforcement learning

Explore the space of possible behaviors

One possible case is to describe a target goal and the machine would just try to reach the goal

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18 Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

$\bar{x}$ may not be equal to $x_g$ at all…but we will do some updates regardless.

However: This means the model tends to generate states that we’ve already seen (since we sample from $p(z)$ ⇒ and we are not exploring very well

❓

How do we reach diverse goals?

We will modify step 4 ⇒ instead of doing MLE: $\theta, \phi \leftarrow \argmax_{\theta, \phi} \mathbb{E}[\log p(\bar{x})]$ , we will do a weighted MLE:

\theta,\phi \leftarrow \argmax_{\theta,\phi} \mathbb{E}[w(\bar{x}) \log p(\bar{x})]

We want to assign a greater weight to rarely seen actions, and since we’ve used a generator for $p_\theta(x|z)$ , we can use

w(\bar{x}) = p_\theta(\bar{x})^{\alpha}, \alpha \in [-1,0)

⇒ Note that for any $\alpha \in [-1,0)$ , the entropy $H(p_\theta(x))$ increases ⇒ proposing broader and broader goals

The RL formulation:

\max H(p(G))

$\pi(a|S,G)$ trained to reach goal $G$
1. as $\pi$ gets better, the final state $S$ gets close to $G$ ⇒ $p(G|S)$ becomes more deterministic

To maximize exploration,

\max H(p(G)) - H(p(G|S)) = \max I(S;G)

This is mainly for gathering data that leads to uniform density over the state distribution ⇒ doesn’t mean the policy is going randomly, what if we want a policy that goes randomly?

❓

But is state-entropy really a good objective?

Eysenbach’s Theorem: At test time, an adversary will choose the worst goal G ⇒ Then the best goal distribution use for training would be $p(G) = \argmax_p H(p(G))$ Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration

State Marginal Matching (SMM)

Learn $\pi(a|s)$ to minimize $D_{KL}(p_\pi(s), p^*(s))$

If we want to use intrinsic motivation, we have:

\tilde{r}(s) = \log p^*(s) - \log p_\pi(s)

Note that this reward objective sums to exactly the state-marginal objective above ⇒ However RL is not aware that $-\log p_\pi (s)$ depends on $\pi$ ⇒ tail chasing problem

We can prove that this $\pi^*(a|s)$ is a nash equilibrium for a two-player game

Learning Diverse Skills

\pi(a|s,\underbrace{z}_{\mathclap{\text{task index}}})

Reaching diverse goals is not the same as performing diverse tasks

not all behaviors can be captured by goal-reaching

\pi(a|s) = \argmax_{\pi} \sum_z \mathbb{E}_{s \sim \pi(s|z)} [r(s,z)]

We want reward states for other tasks $z’ \ne z$ to be unlikely, one way:

Use a classifier that gives you a softmax output of possibility of $z$ s

r(s,z) = \log p(z|s)

Connection to mutual information:

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Gregor et al. Variational Intrinsic Control. 2016

I(z,s) = \underbrace{H(z)}_{\text{maximized by using uniform prior $p(z)$}} - \underbrace{H(z|s)}_{\text{minimized by maximizing $\log p(z|s)$}}

Exploration (Lec 13)

Bandits

Optimistic exploration @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')Reg(T)∈O(log⁡T)Reg(T) \in O(\log T)Reg(T)∈O(logT)﻿

Probability Matching / Posterior Sampling (Thompson Sampling)

Methods that use Information Gain

Bayesian Experimental Design

Counting with Hashes

Implicit Density Modeling with Exemplar Models

Heuristic estimation of counts via errors (DNF)

Unsupervised learning of diverse behaviors

State Marginal Matching (SMM)

Learning Diverse Skills

Optimistic exploration $Reg(T) \in O(\log T)$