Transfer Learning & Meta-RL

Transfer Learning

Transfer Learning: Use experience from one set of tasks for faster learning and better performance on a new task!

How is the knowledge stored:

Several types of transfer learning problems:

Invariance Assumption: Everything that is different between domains is irrelevant

Formally: p(x)p(x) is different; exists some z=f(x)z = f(x) such that p(yz)=p(yx)p(y|z) = p(y|x) but p(z)p(z) is same

Domain adaptation in CV
Train a classifier and punish behaviors in sim that would be impossible in the real world (assume we have some experience in target distribution)

Challenges of fine-tuning

🧙🏽‍♂️
Intuition: The more varied the training domain is, the more likely we are to generalize in zero shot to a slightly different domain

Suggested Readings

🧙🏽‍♂️
Can we learn faster by learning multiple tasks?

Multi-task learning can

Or we can add a “what to do” variable ω\omega (usually a one-hot variable) for the policy for it to know what to do

This is a contextual policy πθ(as,ω)\pi_\theta(a|s, \omega)

A particular choice is a “goal-conditioned” policy where the “what-to-do” variable ω\omega is another state ggπθ(as,g)\pi_\theta(a|s, g)

Relevant Papers (Goal-Conditioned Policy)

Meta-learning

Vision:

Three Perspectives

  1. RNN
    1. Conceptually simple
    1. relatively easy to apply
    1. vulnerable to meta-overfitting
    1. challenging to optimize in practice
  1. Gradient-based approach
    1. Good extrapolation (’consistent”)
    1. Conceptually elegant
    1. Complex, requires many samples
  1. Inference problem (VAE)
    1. Simple, effective exploration via posterior sampling
    1. Elegant reduction to solving a special POMDP
    1. vulnerable to meta-overfitting
    1. challenging to optimize in practice

Meta RL and emergent phenomena

Humans and animals seemingly learn behaviors in a variety of ways

Perhaps each of these is a separate “algorithm” in the brain

But maybe these are all emergent phenomena resulting from meta-RL?

Meta-learning with supervised learning

Meta-learning structure

Supervised learning: f(x)yf(x) \rightarrow y

Supervised meta-learning: f(Dtr,x)yf(D^{tr}, x) \rightarrow y

Meta-learner with RNN

Meta-learning in RL

Contextual Policies and Meta-Reinforcement Learning are closely related:

Left: Meta-RL, Right: Contextual Policies
Task given in multi-task RL

Meta-RL with recurrent policies:

θ=arg maxθi=1nEπϕi(τ)[R(τ)]where ϕi=fθ(Mi)\theta^* = \argmax_{\theta} \sum_{i=1}^n \mathbb{E}_{\pi_{\phi_i(\tau)}}[R(\tau)] \\ \text{where } \phi_i = f_\theta(M_i)
Meta-RL with an trained RNN network (Recurrent Policies)

What should fθ(Mi)f_\theta(M_i) do?

This Recurrent Policies will learn to explore

Architetures for Meta-RL

Attention + Temporal Conv : Mishra, Rohaninejad, Chen, Abbeel. A Simple Neural Attentive Meta-Learner.
Standard RNN(LSTM): Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. 2016.
parallel permutation-invariant context encoder: Rakelly*, Zhou*, Quillen, Finn, Levine. Efficient Off-Policy Meta-Reinforcement learning via Probabilistic Context Variables

Gradient-based Meta-RL

Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. (MAML)

For meta-learning, the formulation is:

θ=arg maxθi=1nEπϕi(τ)[R(τ)]where ϕi=fθ(Mi)\theta^* = \argmax_{\theta} \sum_{i=1}^n \mathbb{E}_{\pi_{\phi_i(\tau)}}[R(\tau)] \\ \text{where } \phi_i = f_\theta(M_i)

What if fθ(Mi)f_\theta(M_i) is itself an RL algorithm?

fθ(Mi)=θ+αθJi(θ)f_\theta(M_i) = \theta + \alpha\nabla_\theta J_i(\theta)

Note: Gradient terms requires interacting with MiM_i to estimate θEπθ[R(τ)]\nabla_\theta \mathbb{E}_{\pi_\theta}[R(\tau)]

fMAML(Dtr,x)=fθ(x)θθα(x,y)DtrθL(fθ(x),y)f_{MAML}(D^{tr},x) = f_{\theta'}(x) \\ \theta' \leftarrow \theta - \alpha \sum_{(x,y) \in D^{tr}} \nabla_\theta L(f_\theta(x), y)

Suggestive Readings

Meta RL as POMDP (Variational Inference)

Average Features as a VAE network
Rakelly, Zhou, Quillen, Finn, Levine, Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables, ICML 2019.

Policy: πθ(atst,zt)\pi_\theta(a_t | s_t, z_t)

Inference network: qϕ(zts1,a1,r1,,st,at,rt)q_\phi(z_t | s_1, a_1, r_1, \dots, s_t, a_t, r_t)

🧙🏽‍♂️
Very similar to RNN meta-RL, but with stochastic zz

Suggested Reading