Exploration

Exploration (Lec 13)

It’s hard for algorithms to know what rules of the environment are “Mao” Game

  1. The only rule you may be told is this one
    1. Incur a penalty when you break a rule
    1. Can only discover rules through trial and error
    1. Rules don’t always make sense to you

So here is the definition of exploration problem:

  1. How can an agent discover high-reward strategies that require a temporally extended sequence of complex behaviors that, individually, are not rewarding?
  1. How can an agent decide whether to attempt new behaviors or continue to do best things it knows so far?

So can we derive an optimal exploration strategy:

Tractable(易于理解) - If it is easy to know that the policy is optimal or not

  1. Multi-arm bandits (1-step, stateless RL) / Contextual bandits (1 step, but there’s a state)
    1. Can formalize exploration as POMDP(Partial-observed MDP because although there’s only one time, performing actions help us expand our knowledge) identification
    1. policy learning is trivial even with POMDP
  1. small, finite MDPs
    1. Can frame as Bayesian model identification, reason explicitly about value of information
  1. Large or infinite MDPs
    1. Optimal methods don’t work
      1. Exploring with random actions: oscillate back and forth, might not go to a coherent or interesting place
    1. But can take inspiration from optimal methods in smaller settings
    1. use hacks

Bandits

The dropsophila(黑腹果蝇, 模式生物) of exploration problems
“One arm bandit” is a lot machine
Multi-arm bandit ⇒ have to make decision on which machine to play; Pulling one of the arms doesn’t mean it’s a bad decision

So the definition of the bandit:

Assume r(ai)pθi(ri)r(a_i) \sim p_{\theta_i}(r_i)

e.g. p(ri=1)=θip(r_i = 1) = \theta_i and p(ri=0)=1θip(r_i = 0) = 1 - \theta_i

Where (but you don’t know):

θip(θ)\theta_i \sim p(\theta)

This defines a POMDP with s=[θ1,,θn]s = [\theta_1, \dots, \theta_n]

Belief state is p^(θ1,,θn)\hat{p}(\theta_1, \dots, \theta_n)

Solving the POMDP yields the optimal exploration strategy

But this is an overkill - belief state is huge! ⇒ we can do very well with much simpler strategies

We will define regret as the difference from optimal policy at time step TT

Reg(T)=TE[r(a)]t=1Tr(at)Reg(T) = T\mathbb{E}[r(a^*)] - \sum_{t=1}^T r(a_t)

How do we beat bandit?

  1. Variety of relatively simple strategies
  1. Often can provide theoretical guarantees on regret
    1. Empirical performance may vary
  1. Exploration strategies for more complex MDP domains will be inmspired by those strategies

Optimistic exploration Reg(T)O(logT)Reg(T) \in O(\log T)

  1. Keep track of average reward μ^a\hat{\mu}_a for each action a
  1. Exploitation: pick a=arg maxμ^aa = \argmax \hat{\mu}_a
  1. Optimistic Estimate: a=arg maxμ^a+Cσaa = \argmax \hat{\mu}_a + C \sigma_a
    1. “Try each arm until you are sure it’s not great”
    1. a=arg maxμ^a+2lnTN(a)a = \argmax \hat{\mu}_a + \sqrt{\frac{2 \ln T}{N(a)}} where N(a)N(a) is number of times we have picked this action

To put this in practice, use r+(s,a)=r(s,a)+B(N(s))r^+(s,a) = r(s,a) + B(N(s)) ⇒ easy to add in but hard to tune the weights of B(N(s))B(N(s)).

Problem is in continuous states ⇒ we never see same thing twice ⇒ but some states are more similar than others

Beliemare et al. “Unifying Count-based Exploration …”

So we will

  1. Fit a density model pθ(s)p_\theta(s) or pθ(s,a)p_\theta(s,a)
  1. pθ(s)p_\theta(s) might be high even for a new ss if ss is similar to previous seen states
  1. But how can the distribution of pθp_\theta match the distribution of a count?

Notice that the true probability is P(s)=N(s)nP(s) = \frac{N(s)}{n}, and after seeing ss, P(s)=N(s)+1n+1P’(s) = \frac{N(s) + 1}{n+1}

How to get to N^(s)\hat{N}(s)?

pθ(si)=N^(si)n^,pθ(si)=N^(si)+1n^+1p_\theta(s_i) = \frac{\hat{N}(s_i)}{\hat{n}}, p_{\theta}'(s_i) = \frac{\hat{N}(s_i)+1}{\hat{n}+1}

Solving the equation, we find

N^(si)=n^pθ(si),n^=1pθ(si)pθ(si)pθ(si)pθ(si)\hat{N}(s_i) = \hat{n}p_\theta(s_i), \hat{n} = \frac{1 - p_\theta'(s_i)}{p_\theta'(s_i)-p_\theta(s_i)} p_\theta(s_i)

Different classes of bonuses:

  1. UCB bonus B(N(s))=2lnnN(s)B(N(s)) = \sqrt{\frac{2 \ln n}{N(s)}}
  1. MBIE-EB (Strehl & Littman, 2008): B(N(s))=1N(s)B(N(s)) = \sqrt{\frac{1}{N(s)}}
  1. BEB (Kolter & Ng, 2009): B(N(s))=1N(s)B(N(s)) = \frac{1}{N(s)}

So we need to output densities, but doesn’t necessarily need to produce great samples

Opposite considerations from many popular generative models like GANs

Bellemare et al. “CTS” Model: Condition each pixel on its top-left neighborhood

Other models: Stochastic neural networks, compression length, EX2

Probability Matching / Posterior Sampling (Thompson Sampling)

Assume r(ai)pθi(ri)r(a_i) \sim p_{\theta_i} (r_i) ⇒ which defines a POMDP with s=[θ1,,θn]s = [\theta_1, \dots, \theta_n]

Belief state is p^(θ1,,θn)\hat{p}(\theta_1, \dots, \theta_n)

Idea:

  1. Sample θ1,,θnp^(θ1,,θn)\theta_1, \dots, \theta_n \sim \hat{p}(\theta_1, \dots, \theta_n)
  1. Pretend the model θ1,,θn\theta_1, \dots, \theta_n is correct
  1. Take the optimal action
  1. Update the model
Chapelle & Li, “An Empirical Evaluation of Thompson Sampling.”

⇒ Hard to analyze theoretically but can work very well empirically

Osband et al. “Deep Exploration via Boostrapped DQN”
  1. What do we sample?
  1. how do we represent the distribution?

Bootstrap ensembles:

  1. Given a dataset DD, resample with replacement NN times to get D1,,DND_1, \dots, D_N
  1. train each model fθif_{\theta_i} on DiD_i
  1. To sample from p(θ)p(\theta), sample i[1,,N]i \in [1, \dots, N] and use fθif_{\theta_i}
Train NN big neural nets is expensive, can we avoid it? In the paper they trained a net with different heads ⇒ this may actually be undesirable since now the outputs are correlated but they may be uncorrelated enough for us to use
🧙🏽‍♂️
This might help better than epsilon-greedy because instead of maybe oscillating back adn forth (and not go somewhere intersting at all), we are commited to a randomized but internally consistant strategy for an entire episode

Methods that use Information Gain

Bayesian Experimental Design

If we want to determine some latent variable zzzz might be optimal action or its value

Which action do we take?

Let H(p^(z))H(\hat{p}(z)) be the current entropy of our zz estimate

Let H(p^(z)y)H(\hat{p}(z)|y) be the entropy of our zz estimate after observation yy (yy might be r(a)r(a) in the case of RL)

The lower the entropy, the more precisely we know zz

IG(z,y)=Ey[H(p^(z))H(p^(z)y)]IG(z,y) = \mathbb{E}_y [H(\hat{p}(z)) - H(\hat{p}(z)|y)]

Typically depends on action, so we will use the notion IG(z,ya)=Ey[H(p^(z)H(p^(z)y)a)]IG(z, y|a) = \mathbb{E}_y [H(\hat{p}(z) - H(\hat{p}(z)|y)|a)]

🧙🏽‍♂️
One important thing to find is decide what yy is in our problem and what zz is
Russo & Van Roy “Learning to Optimize via Information-Directed Sampling”
y=ra,z=θay = r_a, z = \theta_a

Observe reward and learn parameters of p^(ra)\hat{p}(r_a)

g(a)=IG(θa,raa)information gain of ag(a) = IG(\theta_a, r_a|a) \rightarrow \text{information gain of $a$}
Δ(a)=E[r(a)r(a)]expected suboptimality of a\Delta(a) = \mathbb{E}[r(a^*) - r(a)] \rightarrow \text{expected suboptimality of $a$}

We want to gain more information but we don’t want our policy to be suboptimal

So we will choose aa according to

arg minaΔ(a)2g(a)\argmin_a \frac{\Delta(a)^2}{g(a)}

Houthooft et al. “VIME”

Question:

  1. Information gain about what?
    1. Reward r(s,a)r(s,a)
      1. Not useful as reward is sparse
    1. State density p(s)p(s)
      1. A bit strange, but somewhat makes sense!
    1. Dynamics p(ss,a)p(s’|s,a)
      1. Good proxy for learning the MDP, though still heuristic

Generally they are intractable to use exactly, regardless of what is being estimated

A few approximations:

  1. Prediction Gain (Schmidhuber ‘91, Bellemare ‘16)
    1. logpθ(s)logpθ(s)\log p_{\theta}’ (s) - \log p_\theta(s)
    1. If density changed a lot, then the state is novel
  1. Variational Inference (Houthooft et al. “VIME”)
    1. IG can be equivalently written as DKL(p(zy),p(z))D_{KL}(p(z|y), p(z))
    1. Learn about transitions pθ(st+1st,at):z=θp_\theta(s_{t+1}|s_t,a_t): z = \theta
    1. y=(st,at,st+1)y = (s_t, a_t, s_{t+1})
    1. Intuition: a transition is more informative if it causes belief over θ\theta to change
    1. Idea:
      1. Use variational inference to estimate q(θϕ)p(θh)q(\theta|\phi) \approx p(\theta|h)
        1. Specifically optimize variational lower bound DKL(q(θϕ),p(hθ)p(θ))D_{KL}(q(\theta|\phi), p(h|\theta) p(\theta))
        1. Represent q(θϕ)q(\theta|\phi) as product of independent Gaussian parameter distributions with mean ϕ\phi
          1. p(θD)=ip(θiD)p(\theta|D) = \prod_i p(\theta_i |D)
          1. p(θiD)=N(μi,σi2undefinedϕ)p(\theta_i|D) = N(\underbrace{\mu_i, \sigma_i^2}_{\mathclap{\phi}})
        1. See Blundell et al. “Weight uncertainty in neural networks”
      1. Given new transition (s,a,s)(s,a,s’), update ϕ\phi to get ϕ\phi'
      1. use DKL(q(θϕ),q(θϕ))D_{KL}(q(\theta|\phi’), q(\theta|\phi)) as approximate bonus

Counting with Hashes

What if we still count states, but compress them into hashes and count them

  1. Compress ss into k-bit code via ϕ(s)\phi(s) then count N(ϕ(s))N(\phi(s))
  1. shorter codes = more hash collisions
  1. similar states get the same hash? Maybe
    1. Depends on model we choose
    1. We can use an autoencoder and this improve the odds of doing this
Tang et al. “#Exploration: A Study of Count-Based Exploration”

Implicit Density Modeling with Exemplar Models

Fu et al. “EX2: Exploration with Exemplar Models…”

We ned pθ(s)p_\theta(s) to be able to output densities, but doesn’t necessarily need to produce great samples ⇒ can we explicitly compare the new state to past states?

🧙🏽‍♂️
Intuition: The state is novel if it is easy to distinguish from all previous seen states by a classifier

For each observed state ss, fit a classifier to classify that state against all past states DD and use classifier error to obtain density

pθ(s)=1Ds(s)Ds(s)p_\theta(s) = \frac{1-D_s(s)}{D_s(s)}

where Ds(s)D_s(s) is the probability that the classifier assigns ss as a “positive”

In reality, every state is unique so we regularize the classifier

Heuristic estimation of counts via errors (DNF)

We just need to tell if state is novel or not

Given buffer D={(si,ai)}D = \{(s_i,a_i)\}, fit f^θ(s,a)\hat{f}_\theta(s,a)

So we basically use the error of f^θ(s,a)\hat{f}_\theta(s,a) and actual f(s,a)f^*(s,a) as a bonus

ϵ(s,a)=f^θ(s,a)f(s,a)2\epsilon(s,a) = ||\hat{f}_\theta(s,a)-f^*(s,a)||^2

But what function should we use for ff^*?

  1. f(s,a)=sf^*(s,a) = s’
  1. f(s,a)f^*(s,a) as a neural network, but with parameters chosen randomly

We mentioned about how DKL(q(θϕ),q(θϕ))D_{KL}(q(\theta|\phi’), q(\theta|\phi)) can be seen as Information Gain, it can also be viewed as change in network parameters ϕ\phi

So if we forget about IG, there are many other ways to measure this

Stadie et al. 2015:

  1. encode image observations using auto-encoder
  1. build predictive model on auto-encoder latent states
  1. use model error as exploration bonus

Schmidhuber et al. (see, e.g. “Formal Theory of Creativity, Fun, and Intrinsic Motivation):

  1. exploration bonus for model error
  1. exploration bonus for model gradient
  1. many other variations

Suggested Readings

  1. Schmidhuber. (1992). A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers.
  1. Stadie, Levine, Abbeel (2015). Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models.
  1. Osband, Blundell, Pritzel, Van Roy. (2016). Deep Exploration via Bootstrapped DQN.
  1. Houthooft, Chen, Duan, Schulman, De Turck, Abbeel. (2016). VIME: Variational Information Maximizing Exploration.
  1. Bellemare, Srinivasan, Ostroviski, Schaul, Saxton, Munos. (2016). Unifying Count-Based Exploration and Intrinsic Motivation.
  1. Tang, Houthooft, Foote, Stooke, Chen, Duan, Schulman, De Turck, Abbeel. (2016). #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning.
  1. Fu, Co-Reyes, Levine. (2017). EX2: Exploration with Exemplar Models for Deep Reinforcement Learning.

Unsupervised learning of diverse behaviors

What if we want to recover diverse behavior without any reward function at all?
One possible case is to describe a target goal and the machine would just try to reach the goal
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18 Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

xˉ\bar{x} may not be equal to xgx_g at all…but we will do some updates regardless.

However: This means the model tends to generate states that we’ve already seen (since we sample from p(z)p(z) ⇒ and we are not exploring very well

How do we reach diverse goals?

We will modify step 4 ⇒ instead of doing MLE: θ,ϕarg maxθ,ϕE[logp(xˉ)]\theta, \phi \leftarrow \argmax_{\theta, \phi} \mathbb{E}[\log p(\bar{x})], we will do a weighted MLE:

θ,ϕarg maxθ,ϕE[w(xˉ)logp(xˉ)]\theta,\phi \leftarrow \argmax_{\theta,\phi} \mathbb{E}[w(\bar{x}) \log p(\bar{x})]
We want to assign a greater weight to rarely seen actions, and since we’ve used a generator for pθ(xz)p_\theta(x|z), we can use
w(xˉ)=pθ(xˉ)α,α[1,0)w(\bar{x}) = p_\theta(\bar{x})^{\alpha}, \alpha \in [-1,0)

⇒ Note that for any α[1,0)\alpha \in [-1,0), the entropy H(pθ(x))H(p_\theta(x)) increases ⇒ proposing broader and broader goals

The RL formulation:

maxH(p(G))\max H(p(G))
  1. π(aS,G)\pi(a|S,G) trained to reach goal GG
    1. as π\pi gets better, the final state SS gets close to GGp(GS)p(G|S) becomes more deterministic

To maximize exploration,

maxH(p(G))H(p(GS))=maxI(S;G)\max H(p(G)) - H(p(G|S)) = \max I(S;G)

This is mainly for gathering data that leads to uniform density over the state distribution ⇒ doesn’t mean the policy is going randomly, what if we want a policy that goes randomly?

But is state-entropy really a good objective?
Eysenbach’s Theorem: At test time, an adversary will choose the worst goal G ⇒ Then the best goal distribution use for training would be p(G)=arg maxpH(p(G))p(G) = \argmax_p H(p(G)) Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration

State Marginal Matching (SMM)

Learn π(as)\pi(a|s) to minimize DKL(pπ(s),p(s))D_{KL}(p_\pi(s), p^*(s))

If we want to use intrinsic motivation, we have:

r~(s)=logp(s)logpπ(s)\tilde{r}(s) = \log p^*(s) - \log p_\pi(s)

Note that this reward objective sums to exactly the state-marginal objective above ⇒ However RL is not aware that logpπ(s)-\log p_\pi (s) depends on π\pi ⇒ tail chasing problem

We can prove that this π(as)\pi^*(a|s) is a nash equilibrium for a two-player game

Learning Diverse Skills

π(as,zundefinedtask index)\pi(a|s,\underbrace{z}_{\mathclap{\text{task index}}})

π(as)=arg maxπzEsπ(sz)[r(s,z)]\pi(a|s) = \argmax_{\pi} \sum_z \mathbb{E}_{s \sim \pi(s|z)} [r(s,z)]

We want reward states for other tasks zzz’ \ne z to be unlikely, one way:

Use a classifier that gives you a softmax output of possibility of zzs

r(s,z)=logp(zs)r(s,z) = \log p(z|s)

Connection to mutual information:

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need. Gregor et al. Variational Intrinsic Control. 2016
I(z,s)=H(z)undefinedmaximized by using uniform prior p(z)H(zs)undefinedminimized by maximizing logp(zs)I(z,s) = \underbrace{H(z)}_{\text{maximized by using uniform prior $p(z)$}} - \underbrace{H(z|s)}_{\text{minimized by maximizing $\log p(z|s)$}}