Optimal Control and Planning Algorithms (and LQR)

Optimal Control and Planning Algorithms

🔥
In Model-based learning, we need some optimal policy extraction steps to be able to extract a policy from a environment.

Model free completely discard the transition dynamics by sampling from the data points

  1. But we often do know the dynamics
    1. Games
    1. Easily Modeled Systems
    1. Simulated Environments
  1. Or we can learn the dynamics
    1. System identification (fit unknown parameters of a known model)
    1. Learning (fit a general-purpose model to observed transition data)
Knowing the dynamics make things easier

But how do we extract policies from a known dynamics?

Objective of optimal control:

mina1,,aTE(c(s1,s2,,sT)a1,,aT)\min_{a_1, \dots, a_T} \mathbb{E}(c(s_1,s_2,\dots,s_T)|a_1,\dots,a_T)

Open-loop Planning

a1,,aT=arg maxa1,,aTE[tr(st,at)a1,,aT]a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} \mathbb{E}[\sum_t r(s_t,a_t)|a_1,\dots,a_T]

This is sub-optimal because we are not considering that later when we have executed some actions ⇒ we may get supplemented with other information.

Stochastic Optimization

Abstracts away optimal control/planning ⇒ just optimizes a function

a1,,aT=arg maxa1,,aTJ(a1,,aT)A=arg maxAJ(A)a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} J(a_1,\dots,a_T) \rightarrow A = \argmax_{A} J(A)

Simplest Method: Guess and Check (”Random Shooting” method)

  1. pick A1,,ANA_1, \dots, A_N from some distribution (e.g. uniform)
  1. Choose AiA_i based on arg maxiJ(Ai)\argmax_i J(A_i)

Cross-Entropy Method (CEM)

Iterative improvement of the “random shooting method”

  1. sample A1,,ANA_{1}, \dots, A_{N} from p(A)p(A) (typically start from Gaussian distribution)
  1. Evaluate J(A1),,J(AN)J(A_1), \dots, J(A_N)
  1. Pick the elites Ai1,,AiMA_{i_1}, \dots, A_{i_M} with the highest value (M<NM < N)
  1. Refit p(A)p(A) to the elites Ai1,,AiMA_{i_1}, \dots, A_{i_M}
  1. Repeat this process
See also CMA-ES ⇒ CEM with momentum

  1. Very fast if parallelized
  1. Extremely Simple
  1. But very harsh dimensionality limit
  1. Only open-loop planning

Close-loop Planning

Monte-Carlo Tree Search (MCTS)

Discrete Planning can be viewed as a tree search problem but problem is that with vanilla tree search, the complexity is exponential in time steps

So can we restrict the number of searches without expanding the whole tree?

The idea is to stop at depth xx and with the state, approximate the outcomes that come after by using a baseline policy.

Whicih path do we search first?

Choose nodes with best reward, but also prefer rarely visited nodes

So the algorithm:

  1. Find a leaf sls_l using TreePolicy(s1)\text{TreePolicy}(s_1)
  1. Evaluate the leaf using DefaultPolicy(sl)\text{DefaultPolicy}(s_l)
    1. Start from the root and execute every action in the tree to arrive at the leaf / a new leaf(due to randomness)
  1. Update all values in tree between s1s_1 and sls_l
  1. Repeat this iteration and then choose best action from s1s_1

UCT TreePolicy(sts_t)

  1. If sts_t not fully expanded, choose new ata_t else choose child with best Score (st+1s_{t+1})
Score(st)=Q(st)N(st)+2C2lnN(St1N(st)Score(s_t)=\frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2 \ln N(S_{t-1}}{N(s_t)}}
Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Travener, Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree Search Methods

Shootting Methods vs. Collocation

  1. Shotting method: optimize over actions only
    1. Earlier actions have larger effects on the trajectory
  1. Collocation method: Optimize over actions and states, with constraints

Trajectory Optimization

Because this is a trajectory optimization lecture, we will use xtx_t for state and utu_t for action

If we wanna use gradients, it usually helps to use a 2nd order method to optimize via backpropagation!

Linear Case: Linear Quadratic Regularizer (LQR)

Objective:

minu1,,uTc(x1,u1)+c(f(x1,u1),u2)++c(f(f()),uT)\min_{u_1,\dots,u_T} c(x_1,u_1)+c(f(x_1,u_1),u_2)+\cdots+c(f(f(\dots)\dots),u_T)

We will assume that

f(xt,ut)=Ft[xtut]+ftf(x_t,u_t)=F_t\begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t
c(xt,ut)=12[xtut]Ct[xtut]+[xtut]ctc(x_t,u_t)=\frac{1}{2} \begin{bmatrix} x_t &u_t \end{bmatrix} C_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + \begin{bmatrix} x_t &u_t \end{bmatrix} c_t

Base case: Solve for uTu_T only

Q(xT,uT)=const+12[xTuT]CT[xTuT]+[xTuT]cTQ(x_T,u_T)=\text{const}+\frac{1}{2}\begin{bmatrix} x_T \\ u_T \end{bmatrix}^{\top} C_T \begin{bmatrix} x_T \\ u_T \end{bmatrix} + \begin{bmatrix} x_T \\ u_T \end{bmatrix}^\top c_T

We can break CTC_T and cTc_T into pieces that affects uu and affects xx

(Note that here we assumed CuT,xT=CxT,uTC_{u_T, x_T} = C_{x_T,u_T}^\top

CT=[CxT,XTCxT,uTCuT,xTCuT,uT]C_T = \begin{bmatrix} C_{x_T,X_T} &C_{x_T,u_T} \\ C_{u_T,x_T} &C_{u_T,u_T} \end{bmatrix}
cT=[cxTcuT]c_T = \begin{bmatrix} c_{x_T} \\ c_{u_T} \end{bmatrix}
uTQ(xT,uT)=CuT,xTxT+CuT,uTuT+cuT=0uT=CuT,uT1(CuT,xTxT+cuT)=KTxT+kT\nabla_{u_T} Q(x_T,u_T) = C_{u_T,x_T} x_T + C_{u_T,u_T} u_T + c_{u_T}^{\top} = 0 \\ u_T = -C_{u_T,u_T}^{-1}(C_{u_T,x_T}x_T+c_{u_T}) = K_Tx_T+k_T

Subce uTu_T is fully determined by xTx_T, we can write our cost function as

V(xT)=const+12[xTKTxT+kT]CT[xTKTxT+kT]+[xTKTxT+kT]cT=const+12xTVTxT+xTvT\begin{split} V(x_T) &= \text{const} + \frac{1}{2} \begin{bmatrix} x_T \\ K_Tx_T+k_T \end{bmatrix}^{\top} C_T \begin{bmatrix} x_T \\ K_Tx_T+k_T \end{bmatrix} + \begin{bmatrix} x_T \\ K_T x_T + k_T \end{bmatrix}^\top c_T \\ &=\text{const} + \frac{1}{2} x_T^\top V_T x_T + x_T^\top v_T \end{split}

where

VT=CxT,xT+CxT,uTKT+KTCuT,xT+KTCuT,uTKTvT=cxT+CxT,uTkT+KTCuT+KTCuT,uTkTV_T = C_{x_T, x_T}+C_{x_T, u_T}K_T + K_T^\top C_{u_T,x_T} + K_T^\top C_{u_T,u_T}K_T \\ v_T = c_{x_T}+C_{x_T,u_T}k_T+K_T^\top C_{u_T} + K_T^\top C_{u_T, u_T}k_T

So what if we solve for uT1u_{T-1}? ⇒ uT1u_{T-1} affects xTx_T!

f(xT1,uT1)=xT=FT1[xT1uT1]+fT1f(x_{T-1},u_{T-1})=x_T = F_{T-1}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + f_{T-1}
Q(xT1,uT1)=const+12[xT1uT1]CT1[xT1uT1]+[xT1uT1]cT1+V(f(xT1,uT1))Q(x_{T-1},u_{T-1})=\text{const}+\frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top C_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^{\top} c_{T-1} + V(f(x_{T-1},u_{T-1}))
V(xT)=const+12[xT1uT1]FT1VTFT1[xT1uT1]+[xT1uT1]FT1VTfT1+[xT1uT1]FT1vTV(x_T)=\text{const} + \frac{1}{2} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top V_T F_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top V_T f_{T-1} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top v_T

Combining the two equations

Q(xT1,uT1)=const+12[xT1uT1]QT1[xT1uT1]+[xT1uT1]qT1Q(x_{T-1},u_{T-1})=\text{const} + \frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top Q_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top q_{T-1}
QT1=CT1+FT1VTFT1qT1=cT1+FT1VTfT1+FT1vTQ_{T-1} = C_{T-1} + F_{T-1}^\top V_T F_{T-1} \\ q_{T-1} = c_{T-1} + F_{T-1}^\top V_T f_{T-1} + F_{T-1}^\top v_T

Now we can derive the derivative

uT1Q(xT1,uT1)=QuT1,xT1xT1+QuT1,uT1uT1+quT1=0uT1=KT1xT1+kT1\nabla_{u_{T-1}} Q(x_{T-1}, u_{T-1}) = Q_{u_{T-1},x_{T-1}}x_{T-1} + Q_{u_{T-1},u_{T-1}} u_{T-1} + q_{u_{T-1}}^\top = 0 \\ u_{T-1} = K_{T-1}x_{T-1} + k_{T-1}

Where

KT1=QuT1,uT11QuT1,xT1kT1=QuT1,uT11quT1K_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} Q_{u_{T-1},x_{T-1}} \\ k_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} q_{u_{T-1}}

So looks like we can start from the last timestep and do backward recursion!

After doing backward recursion, we can compute our utu_t and xtx_t by forward recursion

LQR for Stochastic and Nonlinear Systems

Stochastic dynamics:

If

p(xt+1xt,ut)=N(Ft[xtut]+ft,Σt)p(x_{t+1}|x_t,u_t) = N(F_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t, \Sigma_t )

Then just choose actions according to the linear case (since the randomness is symmetric)

Nonlinear case - Differentiable Dynamic Programming (DDP) / Iterative LQR (ILQR)

Can we approximate a nonlinear system as a linear-quadratic system? (nearby)
f(xt,ut)f(x^t,u^t)+xt,utf(x^t,u^t)[xtx^tutu^t]f(x_t,u_t) \approx f(\hat{x}_t,\hat{u}_t)+\nabla_{x_t,u_t} f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}
c(xt,ut)c(x^t,u^t)+xt,utc(x^t,x^t)+12[xtx^tutx^t]xt,ut2c(x^t,u^t)[xtx^tutu^t]c(x_t,u_t) \approx c(\hat{x}_t, \hat{u}_t)+\nabla_{x_t,u_t} c(\hat{x}_t,\hat{x}_t)+\frac{1}{2} \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{x}_t \end{bmatrix}^\top \nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}

We will denote the change in xt,utx_t, u_t as δxt,δut\delta x_t, \delta u_t

fˉ(δxt,δut)=Ft[δxtδut]\bar{f}(\delta x_t, \delta u_t) = F_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}
cˉ(δxt,δut)=12[δxtδut]Ct[δxtδut]+[δxtδut]ct\bar{c}(\delta x_t, \delta u_t) = \frac{1}{2} \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^\top C_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix} + \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^\top c_t

Where Ft=xt,utf(x^t,u^t)F_t = \nabla_{x_t,u_t} f(\hat{x}_t, \hat{u}_t), Ct=xt,ut2c(x^t,u^t)C_t = \nabla_{x_t,u_t}^2 c(\hat{x}_t, \hat{u}_t), ct=xt,utc(x^t,u^t)c_t = \nabla_{x_t,u_t} c(\hat{x}_t, \hat{u}_t)

Now we can use LQR with dynamics fˉ\bar{f}, cost cˉ\bar{c} to find δxt\delta x_t and δut\delta u_t

IQLR is basically just an approximation of Neuton’s method for solving the control objective.

But we used a first-order dynamics approximation here, to get Newton’s method, we need to use a second order dynamics approximation (DDP)

🔥
But this may be a bad idea because we are always going to the approximated minima and the approximation is only usable within a short range We can just add a constant α\alpha term to the ktk_t in the forward pass ⇒ allows us how much we want to control from our starting point (the smaller α\alpha is the closer we are to the original starting point)

Additional Reading

  1. Mayne, Jacobson. (1970). Differential dynamic programming.
    1. Original differential dynamic programming algorithm.
  1. Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization.
    1. Practical guide for implementing non-linear iterative LQR.
  1. Levine, Abbeel. (2014). Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics.
    1. Probabilistic formulation and trust region alternative to deterministic line search.