Action Prior in Backward Pass of Control as Inference in RL (1)

Remember

p(Ot+1:Tst+1)=p(Ot+1:Tst+1)p(at+1st+1)dat+1p(O_{t+1:T}|s_{t+1}) = \int p(O_{t+1:T}|s_{t+1}) p(a_{t+1}|s_{t+1}) da_{t+1}

We’ve defined

Vt(st)=logβt(st)V_t(s_t) = \log \beta_t(s_t)
Qt(st)=logβt(st,at)Q_t(s_t) = \log \beta_t(s_t,a_t)

We assumed action prior is uniform, but what if it is not?

V(st)=logexp(Q(st,at)+logp(atst))datV(s_t) = \log \int \exp(Q(s_t,a_t)+ \log p(a_t|s_t)) da_t
Q(st,at)=r(st,at)+logE[exp(V(st+1))]Q(s_t,a_t) = r(s_t,a_t) + \log \mathbb{E}[\exp(V(s_{t+1}))]

Now let

Q~(st,at)=r(st,at)+logp(atst)+logE[exp(V(st+1))]\tilde{Q}(s_t, a_t) = r(s_t, a_t) + \log p(a_t|s_t) + \log \mathbb{E}[\exp(V(s_{t+1}))]
V(st)=logexp(Q~(st,at))dat=logexp(Q(st,at)+logp(atst))datV(s_t) = \log \int \exp(\tilde{Q}(s_t,a_t)) da_t = \log \int \exp(Q(s_t,a_t)+\log p(a_t|s_t))da_t

Oh! Now we’ve seen that with a modificatio to the reward function we can recover VV and QQ with a different action prior