EECS 127 Class Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley EECS127.

Reference Notice: Material highly and mostly derived from Prof. Ranade’s Lectures

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Why Optimization

  1. Control + Robotics
  1. Resource allocation problem
  1. Communications
  1. Machine Learning
🔥
Gauss used least squares technique to predict where planets would appear in the space

Optimization Forms

General:

minf0(x)subject tofi(x)bi for i=1,2,,m\min f_0(\vec{x}) \\ \text{subject to} f_i(\vec{x})\le b_i \text{ for } i=1,2,\dots,m

Notation:

Least Squares

Find x\vec{x} such that AxbA\vec{x} \approx \vec{b} by minimizing Axb2||A\vec{x}-\vec{b}||^2

It is relatively easy to minimize the squared norm rather than the norm because now the squared norm is differentiable and convex(any local minimum is a global minimum)

🔥
Quadratic functionConvex function\text{Quadratic function} \sub \text{Convex function}

Solution:

We can either derive the solution by minimizing derivative of Axb2||A\vec{x}-b||^2 or projecting b\vec{b} to Col(A)Col(A) - column space of matrix AA.

Intuition of method 2 (projection) ⇒ projection of b\vec{b} onto Col(A)Col(A) would form a right angle triangle and we know that perpendicular distances are the shortest.

We derive by:

we want (Axb)=ewhere e must be orthogonal to all of the columns of AAe=0A(Axb)=0AAx=Abx=(AA)1Ab\begin{split} & \text{we want } (A\vec{x}- \vec{b})=\vec{e} \\ & \text{where } \vec{e} \text{ must be orthogonal to all of the columns of } A \\ & A^{\top}\vec{e}=0 \\ & A^{\top}(A\vec{x}-\vec{b})=0 \\ & A^{\top}A\vec{x}=A^{\top}\vec{b} \\ & \vec{x}^{*}=(A^\top A)^{-1} A^\top \vec{b} \end{split}

Linear Algebra Review

Vectors, Norms, Gram-Schmidt, Fundamental Theorem of Linear Algebra

Vector

xRn\vec{x} \in \mathbb{R}^n

Norms

If we have a vector space XX

then a function from XRX \rightarrow \mathbb{R} is a norm provided that it satisfies

  1. xX,x0\forall \vec{x} \in X, ||\vec{x}|| \ge 0, and x=0    x=0||x|| = 0 \iff \vec{x} = \vec{0}
  1. Triangle Inequality: x+yx+y||\vec{x} + \vec{y}|| \le ||\vec{x}|| + ||\vec{y}||
  1. αx=α×x||\alpha \vec{x}|| = |\alpha| \times ||\vec{x}||

LP Norm

xp=(ixip)1/p,1p<||\vec{x}||_p = (\sum_i |x_i|^p)^{1/p}, 1 \le p < \infin

Extreme case of pp \rightarrow \infin

x=maxixi||\vec{x}||_\infin = \max_{i} |x_i|

TODO: Search proof for this

Intuition:

x=limp((maxixi)pundefineddominates+iarg maxxixip)1/p||\vec{x}||_\infin = \lim_{p \rightarrow \infin} (\underbrace{(\max_i |x_i| )^{p}}_{\mathclap{\text{dominates}}}+\sum_{i \ne \argmax |x_i|} |x_i|^p)^{1/p}

L0-Norm (Cardinality)

x0=iI{xi0}||\vec{x}||_0 = \sum_i \mathbb{I}\{x_i \ne 0\}

L2-Norm (Euclidean Norm)

x2=ixi2||\vec{x}||_2 = \sqrt{\sum_i x_i^2}

Cauchy Schwartz Inequality

<x,y>=xy=x2×y2×cosθx2×y2<\vec{x},\vec{y}>=\vec{x}^{\top}\vec{y} = ||x||_2 \times ||y||_2 \times \cos \theta \le ||x||_2 \times ||y||_2

where θ\theta is the angle between vector x\vec{x} and y\vec{y}

Holder’s Inequality

Generalization of Cauchy Schwartz
p,q1,s.t. 1p+1q=1xtyixiyixpyqp,q \ge 1, \text{s.t. } \frac{1}{p}+\frac{1}{q} = 1 \\ |\vec{x}^t \vec{y}| \le \sum_{i} |x_iy_i| \le ||\vec{x}||_p||\vec{y}||_q

Proof not in scope for this class

First Optimization Problem

maxxys.t. xp1,yRn is constant\max \vec{x}^\top \vec{y} \\ \text{s.t. } ||\vec{x}||_p \le 1, \vec{y} \in \mathbb{R}^n \text{ is constant} \\

p=1,

xi={sign(yi)if arg maxiyi=i0otherwisex_i = \begin{cases} \text{sign} (y_i) &\text{if } \argmax_i |y_i| = i \\ 0 &\text{otherwise} \end{cases}

Produces sparse solution

maxx11xy=maxiyi=y\max_{||\vec{x}||_1 \le 1} \vec{x}^\top \vec{y} = \max_i |y_i| = ||\vec{y}||_\infin

p = 2,

We will choose the x\vec{x} on the unit circle in the direction of y\vec{y} because we want cosθ\cos \theta to be 1.

p=p = \infin

xi={1yi01yi<0,x=sign(y)x_i = \begin{cases} 1 &y_i \ge 0 \\ -1 &y_i < 0 \end{cases}, \\ \vec{x} = \text{sign}(\vec{y})
maxx1xy=iyi=y1\max_{||\vec{x}||_\infin \le 1} \vec{x}^\top \vec{y} = \sum_i |y_i| = ||\vec{y}||_1

Gram-Schmidt / Orthonormalization + QR Decomposition

We have a vector space XX and given basis a1,,an\vec{a_1}, \dots, \vec{a_n}

We can generate an orthonormal basis for the vector space

v1=a1a12\vec{v_1} = \frac{\vec{a}_1}{||\vec{a_1}||_2}
v2=a2projv1a2a2projv1a22\vec{v_2}=\frac{\vec{a}_2-proj_{\vec{v}_1}{\vec{a_2}}}{||\vec{a}_2-proj_{\vec{v}_1}{\vec{a_2}}||_2}
vk=aki<kprojviakaki<kprojviak2\vec{v_k}=\frac{\vec{a}_k - \sum_{i<k} proj_{\vec{v}_i}\vec{a}_k}{||\vec{a}_k - \sum_{i<k} proj_{\vec{v}_i}\vec{a}_k||_2}

where

projvtovfrom=(vfromv˙to)v˙to=vtovfromvto22vtoproj_{\vec{v}_{to}} \vec{v}_{from} = (\vec{v}_{from} \cdot \dot{\vec{v}}_{to}) \dot{\vec{v}}_{to} =\frac{\vec{v}_{to} \cdot \vec{v}_{from}}{||\vec{v}_{to}||_2^2} \vec{v}_{to}

QR Decomposition

A=QRA = QR

where

A=[a1a2an]A = \begin{bmatrix} \vec{a}_1 &\vec{a}_2 &\cdots &\vec{a}_n \end{bmatrix}
Q=[q1q2qn]Q = \begin{bmatrix} \vec{q}_1 &\vec{q_2} &\cdots &\vec{q}_n \end{bmatrix}
R=[r11r12r1n0r22r2n000rnn]upper triangular matrixR = \begin{bmatrix} r_{11} &r_{12} &\cdots &r_{1n} \\ 0 &r_{22} &\cdots &r_{2n} \\ \vdots &\vec{0} &\ddots &\vdots \\ 0 &0 &\cdots &r_{nn} \end{bmatrix} \leftarrow \text{upper triangular matrix}

rijr_{ij} are taken from the gram schmidt equations!

Fundamental Theorem of Linear Algebra

ARm×nN(A)undefineddirect sumR(A)=RnA \in \mathbb{R}^{m \times n} \\ N(A) \underbrace{\oplus}_{\mathclap{\text{direct sum}}} R(A^{\top}) = \mathbb{R}^n
For any vector in Rn\mathbb{R}^n, I can write it as the sum of the null component of a matrix A and another component that belongs to the range space of that matrix transposed.

We can also say:

R(A)N(A)=RmR(A)\oplus N(A^{\top}) = \mathbb{R}^m

To prove this, we have to use the Orthogonal Decomposition Theorem


Orthogonal Decomposition Theorem (Thm 2.1)

Let XX be any general vector space, and SS be a subspace, then

xX,x=s+rwhere sS,rS\forall \vec{x} \in X, \vec{x}=\vec{s}+\vec{r} \\ \text{where } s \in S, \vec{r} \in S^{\bot}
Note: S={rsS,<r,s>=0}S^{\bot} = \{\vec{r}|\forall \vec{s} \in S, <\vec{r},\vec{s}> = 0\}

This can be summarized by

X=SSX = S \oplus S^{\bot}

Thanks to the ODT, now we only want to show that

N(A)=R(A)N(A) = R(A^{\top})^{\bot}

This means we need to show:

N(A)R(A)R(A)N(A)\begin{align} N(A) \sube R(A^{\top})^{\bot} \\ R(A^{\top})^{\bot} \sube N(A) \end{align}

To show (1)

Let xN(A)\vec{x} \in N(A), show xR(A)\vec{x} \in R(A^{\top})^{\bot}

xN(A)Ax=0\because \vec{x} \in N(A) \\ \therefore A\vec{x} = \vec{0}

We want to prove:

wR(A),<x,w>=0\forall \vec{w} \in R(A^{\top}), <\vec{x},\vec{w}>=0

We know that since wR(A)\vec{w} \in R(A^{\top}), w=Ay\vec{w} = A^{\top} \vec{y} for some y\vec{y}

<x,w>=<x,Ay>=xAyundefinedit’s a sclar, so we can simply transpose it=yAx=0<\vec{x},\vec{w}>=<\vec{x},A^{\top}\vec{y}>=\underbrace{\vec{x}^{\top}A^{\top}\vec{y}}_{\mathclap{\text{it's a sclar, so we can simply transpose it}}}=\vec{y}^{\top}A\vec{x}=0

To show (2)

Let xR(A)\vec{x} \in R(A^{\top})^{\bot}, show xN(A)\vec{x} \in N(A)

xR(A)w=Ay,where yRm,<x,w>=0\because \vec{x} \in R(A^{\top})^{\bot} \\ \therefore \forall \vec{w} = A^{\top}\vec{y}, \text{where } \vec{y} \in \mathbb{R}^m, \\ <\vec{x},\vec{w}>=0
<x,Ay>=xAy=yAx=0,yRn\begin{split} <\vec{x},A^{\top}\vec{y}> &=\vec{x}^{\top}A^{\top}\vec{y} \\ &= \vec{y}^{\top}A\vec{x} = 0, \forall \vec{y} \in \mathbb{R}^n \\ \end{split}

A specific AxA\vec{x} might be in the null space of y\vec{y}^{\top}, but this cannot be true for every y\vec{y}, so therefore

Ax=0A\vec{x} = 0

Diagonalization of Matrices

A=UΛU1A = U\Lambda U^{-1}

Not all matrices are diagonalizable

Matrices are diagonalizable when for each eigenvalue:

Algebraic Multiplicity=Geometric Multiplicity\text{Algebraic Multiplicity} = \text{Geometric Multiplicity}

Algebraic Multiplicity:

When finding eigenvalues for a matrix we find roots of the polynomial det(λIA)det(\lambda I - A), the “algebraic multiplicity” of an eigenvalue λi\lambda_i is how many times λi\lambda_i appears as root of the polynomial.

Geometric Multiplicity (for λi\lambda_i):

dim(N(λiIA))\dim(N(\lambda_i I -A))
📌
Important Property: N(λiIA)=ΦiN(\lambda_i I - A) = \Phi_i is exactly the eigen-space of AA corresponding to eigenvalue λi\lambda_i

Symmetric Matrices

A=A or Aij=AjiASnA = A^{\top} \text{ or } A_{ij} = A_{ji} \leftrightarrow A \in S^n

e.g.

  1. Covariance Matrices
  1. Graph Laplacians (matrix representing connectivity in a graph)

Properties:

  1. Eigenvalues λi,λiR\forall \lambda_i, \lambda_i \in \mathbb{R}
  1. Eigenspaces λiλj\lambda_i \ne \lambda_j are orthogonal and (ΦiΦj\Phi_i \bot \Phi_j)
    1. Φi=N(λiIA)\Phi_i = N(\lambda_i I - A)
  1. If μi\mu_i is algebraic multiplicity of λi\lambda_i, then dim(N(Φi))=μi\dim(N(\Phi_i))=\mu_i
    1. geometric and algebraic multiplicities are always equal
  1. 1-3 shows always diagnolizable
    1. ASnA=UΛUA \in S^{n} \rightarrow A = U\Lambda U^{\top}
      1. where UU ⇒ orthonormal
      1. Λ\Lambda ⇒ diagonal

Spectral Theorem

ASnA \in S^n

Then

A=UΛUA = U\Lambda U^{\top}

Where UU is an orthonormal(unitary) matrix, Λ\Lambda is diagonal, and

U=[u1u2urundefinedRange space R(A)ur+1unundefinedNull Space N(A)]U = \begin{bmatrix} \underbrace{\begin{matrix} \vec{u}_1 &\vec{u}_2 &\cdots &\vec{u}_r \end{matrix}}_{\mathclap{\text{Range space $R(A)$}}} &\underbrace{\begin{matrix} \vec{u}_{r+1} &\cdots &\vec{u}_n \end{matrix} }_{\mathclap{\text{Null Space $N(A)$}}} \end{bmatrix}
Λ=[λ100000λ200000λr000000000000]\Lambda = \begin{bmatrix} \lambda_1 &0 &\cdots &0 &0 &\cdots &0 \\ 0 &\lambda_2 &\cdots &0 &0 &\cdots &0 \\ \vdots &\vdots &\ddots &\vdots &\vdots &\ddots &\vdots \\ 0 &0 &\cdots &\lambda_r &0 &\cdots &0 \\ 0 &0 &\cdots &0 &0 &\cdots &0\\ \vdots &\vdots &\ddots &\vdots &\vdots &\ddots &\vdots \\ 0 &0 &\cdots &0 &0 &\cdots &0 \end{bmatrix}

Note: (i<j)    (λi>λj)(i<j) \implies (\lambda_i > \lambda_j), 0i,jr,(λi,vi)\forall 0 \le i,j \le r,(\lambda_i, \vec{v}_i) are eigenvalue-vector pairs of AA

We can also write

A=irλiuiui\begin{split} A &= \sum_i^r \lambda_i \vec{u}_i\vec{u}_i^{\top} \end{split}

Variational Characterization of Eigenvalues of a symmetric matrix (& Rayleigh Coefficient)

Given ASnA \in S^n

r(Raylaigh Coefficient)=xAxxx=xAxx22r(\text{Raylaigh Coefficient}) = \frac{\vec{x}^{\top}A\vec{x}}{\vec{x}^{\top}\vec{x}} = \frac{\vec{x}^{\top}A\vec{x}}{||\vec{x}||_2^2}

Important property:

λmin(A)rλmax(A)\lambda_{\min}(A) \le r \le \lambda_{\max}(A)

and

λmax(A)=maxx2=1xAxλmin(A)=minx2=1xAx\lambda_{\max}(A) = \max_{||\vec{x}||_2=1} \vec{x}^{\top}A\vec{x} \\ \lambda_{\min}(A) = \min_{||\vec{x}||_2=1} \vec{x}^{\top}A\vec{x}

Note:

xAx=xUΛUx=yΛy(y=Ux)=i=1rλiyi2i=1rλmaxyi2=λmaxy22=λmaxx22i=1rλminyi2=λminy22=λminx22\begin{split} \vec{x}^{\top}A\vec{x}&=\vec{x}^{\top}U\Lambda U^{\top}\vec{x} \\ &=\vec{y}^{\top}\Lambda\vec{y}\quad (\vec{y}=U^{\top}\vec{x}) \\ &=\sum_{i=1}^r \lambda_i y_i^2 \\ &\quad \le \sum_{i=1}^r \lambda_{\max} y_i^2 = \lambda_{\max}||\vec{y}||_2^2 = \lambda_{\max}||\vec{x}||_2^2 \\ &\quad \ge \sum_{i=1}^r \lambda_{\min}y_i^2=\lambda_{\min}||\vec{y}||_2^2=\lambda_{\min} ||\vec{x}||_2^2 \end{split}

Note that both UU and UU^{\top} are orthornormal matrices, and they preserve norm.

And it is obvious what what values of x\vec{x} to choose to preserve max or min.

PCA + SVD

Principle Component Analysis + Singular Value Decomposition
🔥
Idea: I have n-dimensional data but my data seems like they have some kind of structure in a lower dimension, how do I extract this out?

Goal of PCA:

Given data vectors x1,x2,,xn\vec{x}_1, \vec{x}_2, \dots, \vec{x}_n, find kk orthornormal vectors wi\vec{w}_i that minimizes the projection error

Err=1ni=1nei2Err = \frac{1}{n}\sum_{i=1}^n e_i^2

where

ei2=xij=1k<wj,xi>wj2=xi22j=1k<wj,xi>\begin{split} e_i^2 &= ||\vec{x}_i - \sum_{j=1}^k <\vec{w}_j,\vec{x}_i>\vec{w}_j ||^2 \\ &= ||\vec{x}_i||_2^2 - \sum_{j=1}^k <\vec{w}_j, \vec{x}_i> \end{split}

So our problem becomes

max{w1,,wk}i=1nj=1k1n<wj,xi>=1nj=1kXwj2=1nj=1k(wjXXundefinedsymmetric, use spectral theorem!wj)\begin{split} \max_{\{\vec{w}_1, \cdots, \vec{w}_k\}} \sum_{i=1}^n \sum_{j=1}^k \frac{1}{n}<\vec{w}_j,\vec{x}_i> &= \frac{1}{n}\sum_{j=1}^k||X\vec{w}_j||^2 \\ &= \frac{1}{n}\sum_{j=1}^k(\vec{w}_j^{\top}\underbrace{X^{\top}X}_{{\text{symmetric, use spectral theorem!}}}\vec{w}_j) \end{split}

Its easier to prove if we define data as rows so we will proceed by this…But later we can flip everything.

Data Matrix X=[x1x2xn]\text{Data Matrix } X = \begin{bmatrix} \vec{x}_1^{\top} \\ \vec{x}_2^{\top} \\ \vdots \\ \vec{x}_n^{\top} \end{bmatrix}
C=1nXX=1n[x12<x1,x2><x1,xn><x1,x2>x222<x2,xn><x1,xn><x2,xn>xn2]C = \frac{1}{n}XX^{\top}=\frac{1}{n} \begin{bmatrix} ||\vec{x}_1^{\top}||^2 &<\vec{x}_1,\vec{x}_2> &\cdots &<\vec{x}_1,\vec{x}_n> \\ <\vec{x}_1,\vec{x}_2> &||\vec{x}_2^2||^2 &\cdots &<\vec{x}_2,\vec{x}_n> \\ \vdots &\vdots &\ddots &\vdots \\ <\vec{x}_1,\vec{x}_n> &<\vec{x}_2,\vec{x}_n> &\cdots &||\vec{x}_n||^2 \end{bmatrix}

Note:

XXSn,CSnXX^{\top} \in S^n, C \in S^n

We will also define

D=1nXXD = \frac{1}{n}X^{\top}X

Note that DSnD \in S^n as well

The SVD Decomposition of A can be written as

A=Uundefinedm×mΣundefinedm×nVundefinedn×nA = \underbrace{U}_{m \times m} \underbrace{\Sigma}_{m \times n} \underbrace{V^{\top}}_{n \times n}

where the diagonal values of Σ\Sigma are called singular values of AA ⇒ they are actually eigenvalues of both AA,AAA^{\top}A, AA^{\top}

The orthonormal eigenvectors of AAA^{\top}A are called right singular vectors of AA and they rest inside columns of VV, in the order such that they corresponds to the square root of their eigenvalues in Σ\Sigma

The orthonormal eigenvectors of AAAA ^{\top} are called left singular vectors of AA and they rest inside columns of UU, in the order such that they corresponds to the square root of their eigenvalues in Σ\Sigma

Graphical Interpretation: U,VRotation / ReflectionU, V \rightarrow \text{Rotation / Reflection}, ΣScaling\Sigma \rightarrow Scaling

To understand this graphical representation of a general vector, think about decomposing the vector x\vec{x} into eigenvectors of VV

x=α1v1+α2v2++arvr\vec{x}=\alpha_1 \vec{v}_1 + \alpha_2 \vec{v}_2 + \cdots + a_r \vec{v}_r

And think about how VV^{\top} first projects x\vec{x} onto every eigenvector of AAA^{\top}A, then scales them, then transforms them as if they were in coordinates of AAAA^{\top}

We can write this in the compact form

A=i=1rσiuiviA = \sum_{i=1}^r \sigma_i \vec{u}_i\vec{v}_i^{\top}

🔥
Singular Value Decomposition is most of the time not unique because everytime we have a repeated eigenvalue (λi=λj,ij\lambda_i = \lambda_j, i \ne j) then we can order the eigenbasis in different ways.

Vector Calculus

f(x)RnRf(\vec{x}) \in \mathbb{R}^n \rightarrow \mathbb{R}
Varaiya ⇒ the main theorem

Scalar Calculus Review

Say we have a function

f(x):RRf(x): \mathbb{R} \rightarrow \mathbb{R}

Then the derivative

dfdx\frac{df}{dx}

Tells the instant rate of change of f with respect to x

Taylor’s theorem

Let x0Rx_0 \in \mathbb{R} be a fixed point, then

f(x+Δx)=f(x0)+dfdxx=x0Δx+12d2fdx2x=x0(Δx)2+f(x+\Delta x) = f(x_0) + \frac{df}{dx}|_{x=x_0} \Delta x + \frac{1}{2} \frac{d^2f}{dx^2} |_{x=x_0} (\Delta x)^2 + \cdots

We usually use taylor’s theorem as an approximation tool and now if we expand our understanding of calculus onto vectors, we can make linear approximations using linear algebra!

Dimensions of Vector Gradients

f(x):Rn×1Rf(\vec{x}): \mathbb{R}^{n \times 1} \rightarrow \mathbb{R}
ΔxRn×1\Delta \vec{x} \in \mathbb{R}^{n \times 1}

Therefore the derivative (or the transpose of gradient)

dfdx=fxR1×n\frac{df}{d\vec{x}} = \nabla f |_{\vec{x}}^{\top} \in \mathbb{R}^{1 \times n}

Notion of Gradient

f(x)\nabla f(\vec{x}) captures change according to all components of x\vec{x}

f(x)=[fx1fx2fxn]\nabla f(\vec{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} &\frac{\partial f}{\partial x_2} &\cdots &\frac{\partial f}{\partial x_n} \end{bmatrix}^{\top}

Notion of Hessian

2f(x)i,j=2fxixj\nabla^2 f(\vec{x}) _{i,j} = \frac{\partial^2 f}{\partial x_i \partial x_j}
If we have nice smooth functions, then the order of the denominator can be interchanged. This is not generally true. Therefore often symmetric

Jacobian Matrix

Jacobian matrix describes the derivative of a vector function with respect to a vector

f(x):RnRmf(\vec{x}): \mathbb{R}^n \rightarrow \mathbb{R}^m
J=[fx1fxn]=[Tf1Tfm]=[f1x1f1xnfmx1fmxn]{J} ={\begin{bmatrix}{\dfrac {\partial \mathbf {f} }{\partial x_{1}}}&\cdots &{\dfrac {\partial \mathbf {f} }{\partial x_{n}}}\end{bmatrix}}={\begin{bmatrix}\nabla ^{\mathrm {T} }f_{1}\\\vdots \\\nabla ^{\mathrm {T} }f_{m}\end{bmatrix}}={\begin{bmatrix}{\dfrac {\partial f_{1}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{1}}{\partial x_{n}}}\\\vdots &\ddots &\vdots \\{\dfrac {\partial f_{m}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{m}}{\partial x_{n}}}\end{bmatrix}}

Taylor’s Theorem for Vectors

f(x0+Δx)=f(x0)+fx=x0(Δx)+12(Δx)2fx=x0undefinedHessian(Δx)+f(\vec{x}_0+\Delta \vec{x}) = f(\vec{x}_0)+\nabla f |_{\vec{x}=\vec{x}_0}^{\top} (\Delta \vec{x}) + \frac{1}{2}(\Delta \vec{x})^{\top} \nabla^2 \underbrace{f|_{\vec{x} = \vec{x}_0}}_{\mathclap{\text{Hessian}}} (\Delta \vec{x}) + \cdots
In practice you would never get higher order terms because its very hard to compute with

The main theorem

f:RnR,differentiable everywheref: \mathbb{R}^{n} \rightarrow \mathbb{R}, \text{differentiable everywhere}

Then the optimization:

minf(x)s.t. xΩ\min f(\vec{x}) \\ \text{s.t. } \vec{x} \in \Omega

Where Ω\Omega is an openset(the boundaries are not included) ΩRn\Omega \sube \mathbb{R}^n

Then if x\vec{x}^* is an optimal solution of the optimization problem, then:

dfdx(x)=0\frac{df}{dx}(\vec{x}^*)=0

Optimization Forms

General:

minf0(x)subject tofi(x)bi for i=1,2,,m\min f_0(\vec{x}) \\ \text{subject to} f_i(\vec{x})\le b_i \text{ for } i=1,2,\dots,m

Notation:

Noise/Perturbation/Sensitivity Analysis

Ax=yA\vec{x}=\vec{y}
yy+δy and because of this xx+δx\vec{y} \leftarrow \vec{y} +\vec{\delta_y} \text{ and because of this } \vec{x} \leftarrow \vec{x} + \vec{\delta_x}
We want to measure how sensitive our solution is to a perturbation in our measurement

We want to know how big is δx\delta \vec{x}, in particular

δx2x2\frac{||\vec{\delta_x}||_2}{||\vec{x}||_2}

The problem becomes:

A(x+δx)=y+δyAδx=δyδx=A1δyδx2=A1δy2\begin{split} A(\vec{x}+\vec{\delta_x}) &= \vec{y} + \vec{\delta_y} \\ A\vec{\delta_x} &= \vec{\delta_y} \\ \vec{\delta_x} &= A^{-1} \vec{\delta_y} \\ ||\vec{\delta_x}||_2 &= ||A^{-1} \vec{\delta_y}||_2 \end{split}

Recall the L2 matrix norm

A2=maxy2=1Ay2=maxyAy2y2||A||_2 = \max_{||\vec{y}||_2 = 1} ||A\vec{y}||_2 = \max_{\vec{y}} \frac{||A\vec{y}||_2}{||\vec{y}||_2}

Also:

Ax2=y2A2x2y2x2y2A2||A\vec{x}||_2 = ||\vec{y}||_2 \\ ||A||_2 ||\vec{x}||_2 \ge ||\vec{y}||_2 \\ ||\vec{x}||_2 \ge \frac{||\vec{y}||_2}{||A||_2}

Therefore

δx2=A1δy2δx2A12δy2\begin{split} ||\vec{\delta_x}||_2 &= ||A^{-1} \vec{\delta_y}||_2 \\ ||\vec{\delta_x}||_2 &\le ||A^{-1}||_2 ||\vec{\delta_y}||_2 \\ \end{split}

Combining those two

δx2x2A12δy2A2y2δx2x2(A2)(A12)[δy2y2]\begin{split} \frac{||\delta_x||_2}{||x||_2} &\le ||A^{-1}||_2 ||\delta_y||_2 \frac{||A||_2}{||y||_2} \\ \frac{||\delta_x||_2}{||x||_2} &\le (||A||_2)(||A^{-1}||_2) [\frac{||\delta_y||_2}{||y||_2}] \end{split}

We know that A2=σmax(A)||A||_2 = \sigma_{max}(A), and A12=1/σmin(A)||A^{-1}||_2 = 1 / \sigma_{min}(A)

δx2x2σmax(A)σmin(A)undefinedcondition number of a matrixδy2y2\frac{||\delta_x||_2}{||x||_2} \le \underbrace{\frac{\sigma_{max}(A)}{\sigma_{min}(A)}}_{\mathclap{\text{condition number of a matrix}}} \frac{||\delta_y||_2}{||y||_2}
Condition Number of a matrix is σmax/σmin\sigma_{max} / \sigma_{min}

Least Squares

Find x\vec{x} such that AxbA\vec{x} \approx \vec{b} by minimizing Axb2||A\vec{x}-\vec{b}||^2

It is relatively easy to minimize the squared norm rather than the norm because now the squared norm is differentiable and convex(any local minimum is a global minimum)

🔥
Quadratic functionConvex function\text{Quadratic function} \sub \text{Convex function}

Solution:

We can either derive the solution by minimizing derivative of Axb2||A\vec{x}-b||^2 or projecting b\vec{b} to Col(A)Col(A) - column space of matrix AA.

Intuition of method 2 (projection) ⇒ projection of b\vec{b} onto Col(A)Col(A) would form a right angle triangle and we know that perpendicular distances are the shortest.

We derive by:

we want (Axb)=ewhere e must be orthogonal to all of the columns of AAe=0A(Axb)=0AAx=Abx=(AA)1Ab\begin{split} & \text{we want } (A\vec{x}- \vec{b})=\vec{e} \\ & \text{where } \vec{e} \text{ must be orthogonal to all of the columns of } A \\ & A^{\top}\vec{e}=0 \\ & A^{\top}(A\vec{x}-\vec{b})=0 \\ & A^{\top}A\vec{x}=A^{\top}\vec{b} \\ & \vec{x}^{*}=(A^\top A)^{-1} A^\top \vec{b} \end{split}

Sensitivity analysis on least squares

x=(AA)1Ab\vec{x} = (A^{\top}A)^{-1}A^\top \vec{b}

rewrite:

(AA)x=Ab(A^\top A)\vec{x} = A^\top \vec{b}

Look at the conditional number on AAA^\top A

Minimum Norm Problem

System of equations:

Ax=bA\vec{x} = \vec{b}

where ARm×n,xRn,bRmA \in \mathbb{R}^{m \times n}, \vec{x} \in \mathbb{R}^n, \vec{b} \in \mathbb{R}^m

If mnm \gg n, then we have an overdetermined state (no solution)

If mnm \ll n, then we have an underdetermined state (inifintely many solutions)

One common solution for the underdetermined state is to pick the minimum energy solution

minx22s.t. Ax=b\min ||\vec{x}||_2^2 \\ \text{s.t. } A\vec{x}=\vec{b}

So how can we optimize this?

  1. Components of x\vec{x} that are in N(A)N(A)
    1. x=y+z,yN(A),zR(A)\vec{x} = \vec{y} + \vec{z}, \vec{y} \in N(A), \vec{z} \in R(A^{\top})
    1. Ax=A(y+z)=0+Az=bA\vec{x} = A(\vec{y}+\vec{z})=0+A\vec{z}=\vec{b}
    1. x22=y22+z22||\vec{x}||_2^2 = ||\vec{y}||_2^2 + ||\vec{z}||_2^2 by pythagorean theorem
    1. So w,z=Aw\exists \vec{w}, \vec{z} = A^{\top} \vec{w}
    1. Az=bAAundefinedIf A has full rank, then this square matrix is invertiblew=bA\vec{z}=\vec{b} \rightarrow \underbrace{A A^{\top}}_{\mathclap{\text{If $A$ has full rank, then this square matrix is invertible}}} \vec{w} = \vec{b}
    1. w=(AA)1bz=A(AA)1b\vec{w} = (AA^{\top})^{-1} \vec{b} \rightarrow \vec{z} = A^{\top}(AA^{\top})^{-1}\vec{b}

Tikhonov Regularization

minxW1Axb22+W2xx022\min_{\vec{x}} ||W_1A\vec{x}-\vec{b}||_2^2+ ||W_2\vec{x}-\vec{x}_0||_2^2

Where W1W_1 is a weight matrix, W2W_2 and x0\vec{x}_0 are supplemental informations

🔥
Prof. Ranade also included MAP and MLE examples of this regularization technique (and in special forms of ridge regression and MSE). See my CS189 Notes for those concepts.

Ridge Regression

Can we shift property of eigenvalues so least squares are less turbulent in response to changes in observations?

(A+λI)v1undefinedeigenvector of A=Av1+λv1=λ1v1+λv1(A+\lambda I) \underbrace{\vec{v}_1}_{\mathclap{\text{eigenvector of $A$}}} = A\vec{v}_1 + \lambda \vec{v}_1 = \lambda_1 \vec{v}_1 + \lambda \vec{v}_1
🧙🏽‍♂️
CS 189 also talks about this, basically least square + L2 Norm Mean Loss Notation is a bit different though, CS189 uses λ\lambda in the objective f(x)f(x) while here we use λ2\lambda^2.

Consider now the objective

minAxb2+λ2x22\min ||A\vec{x} - \vec{b}||^2 + \lambda^2 ||\vec{x}||_2^2
Copied from prof. Schewchuk’s lecture notes
Also copied from prof. Schewchuk’s slides
Anther interpretation of L2 norm is we’re basically augmenting the data matrix with lambda I
f(x)=2AAx2(bA)+2λ2x\nabla f(\vec{x}) = 2A^{\top}A\vec{x}- 2 \vec({b}^{\top} A)^{\top} + 2\lambda^2\vec{x}

Set this gradient to 0

(AA+λ2I)x=Ab(A^{\top}A+\lambda^2I) \vec{x} = A^\top \vec{b}

So now:

x=(AA+λ2I)1Ab\vec{x}^* = (A^{\top}A+\lambda^2 I)^{-1}A^\top \vec{b}
🔥
Also now AA+λ2IA^{\top}A + \lambda^2I matrix is absolutely invertible because every eigenvalue is now lower-bounded by λ2\lambda^2

If we let

x=Vz\vec{x} = V\vec{z}

And

A=UΣVA = U\Sigma V^\top

Remember we have:

minAxb2+λ2x22=minAVzb2+λ2x22=minUΣzb2+λ2x22\begin{split} & \min ||A\vec{x} - \vec{b}||^2 + \lambda^2 ||\vec{x}||_2^2 \\ =& \min ||AV\vec{z}-\vec{b}||^2 + \lambda^2||\vec{x}||^2_2 \\ =& \min ||U\Sigma \vec{z} - \vec{b}||^2 + \lambda^2 ||\vec{x}||_2^2 \\ \end{split}

Therefore

x=V(ΣΣ+λ2I)1ΣUb=V[diag(σiσi2+λ2)0]Uy\begin{split} \vec{x}^* &= V(\Sigma^\top \Sigma+\lambda^2I)^{-1} \Sigma^\top U^\top \vec{b} \\ &=V \begin{bmatrix} diag(\frac{\sigma_i}{\sigma_i^2 + \lambda^2}) & \vec{0} \end{bmatrix} U^\top \vec{y} \end{split}

Lasso Regression

Formulation:

minxAxb1\min_{\vec{x}} ||A\vec{x} - \vec{b}||_1

How do we solve this?

Let Axb=eA\vec{x}-\vec{b} = \vec{e}

then

minAxe=be1\min_{A\vec{x} - \vec{e} = \vec{b}} ||\vec{e}||_1

Let’s consider a problem (with similar formulation)

minx1s.t.Ax=b\min ||\vec{x}||_1 \\ \text{s.t.} \\ A\vec{x} = \vec{b}

Suppose AA is wide with full row rank (so that there’s inifite solutions)

Let xix_i = xi+xix_i^+ - x_i^-, then xi=xi++xi|x_i| = x_i^+ + x_i^-

Our program then becomes

mini=1nxi++i=1nxiA(x+x)=bxi+0,xi0\min \sum_{i=1}^n x_i^+ + \sum_{i=1}^n x_i^- \\ A(\vec{x}^+ - \vec{x}^-) = \vec{b} \\ x_i^+ \ge 0, x_i^- \ge 0

Claim: This new program will always choose only one of xi+x_i^+ or xix_i^- nonzero

Suppose: xi+>0,xi>0x_i^+ > 0, x_i^- > 0

Consider: xi+ϵ,xiϵx_i^+ - \epsilon, x_i^- - \epsilon

Another problem:

minxi=1nxbi1\min_{\vec{x}} \sum_{i=1}^n ||\vec{x} - \vec{b}_i||_1

This is solvable with a LP

Lasso Regression

minxAxb22+λx1\min_{\vec{x}} ||A\vec{x}-\vec{b}||_2^2 + \lambda ||\vec{x}||_1

Nice property: encourages sparsity

Lasso can be reformulated as a quadratic problem, Subgradient descent and least-angle regression(LARS), forward stagewise algorithms can be used to solve for Lasso.

Total Least Squares

What is least squares, both XX and yy have perturbations?
(X+X~)w=y+y~(X+\tilde{X})\vec{w} = \vec{y} + \vec{\tilde{y}}

We can rewrite the problem as

[X+X~y+y~]undefinedZ~[w1]=0\underbrace{\begin{bmatrix} X+\tilde{X} &\vec{y}+\vec{\tilde{y}} \end{bmatrix}}_{\mathclap{\tilde{Z}}} \begin{bmatrix} \vec{w} \\ -1 \end{bmatrix} = \vec{0}

We know Z~\tilde{Z} has rank at most nn

If we let Z=[Xy]Z = \begin{bmatrix} X &\vec{y} \end{bmatrix}

Then the optimization problem becomes (an eckart-young problem)

minrk(Z~)nZZ~F\min_{rk(\tilde{Z}) \le n} ||Z - \tilde{Z}||_F

Low-rank Approximation (SVD)

Is there a way to store the most important parts of a matrix? ⇒ SVD?

But there are multiple perspective considering parts

  1. Matrix as an operator
  1. Matrix as a chunck of data

So we can define matrix norms

“Frobenius Norm” ⇒ As a chunck of data

AF=i,jAij2=trace(AA)||A||_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{trace(A^{\top}A)}
🔥
Frobenius Norms are invariant to orthonormal transforms
UAF=AUF=AF||UA||_F = ||AU||_F=||A||_F

👆TODO: Prove this as an exercise!

“L2-Norm / Spectral Norm” ⇒ As an operator ← max scaling of a vector

A2=maxx2=1Ax2=λmax(AA)=σmax(A)||A||_2 = \max_{||\vec{x}||_2=1} ||A\vec{x}||_2 = \sqrt{\lambda_{max} (A^{\top}A)} = \sigma_{max} (A)

Gradient Descent

Primarily for unconstraint optimization problems
p=minxRnf0(x)p^* = \min_{\vec{x}^* \in \mathbb{R}^n} f_0(\vec{x})

We want to find pp^* and xx^*

“Guess and get better”

The best direction to improve is to use the gradient direction

xk+1=xkηundefinedstep sizef(xk)\vec{x}_{k+1} = \vec{x}_k - \underbrace{\eta}_{\mathclap{\text{step size}}} \nabla f(\vec{x}_k)
Usually SGD with constant step size is not going to converge because each data point pulls the parameters in different directions ⇒ but if we use time-dependent decreasing stepsize it’s more likely to converge

Projected GD

minxXf(x)\min_{\vec{x} \in X} f(\vec{x})

Where

So we will do the following:

xk+1=ΠX(xkηf(xk))\vec{x}_{k+1} = \Pi_X (\vec{x}_k - \eta \nabla f(\vec{x}_k))

Where

ΠX(y)=arg minvXyv22\Pi_X(\vec{y}) = \argmin_{\vec{v} \in X} ||\vec{y} - \vec{v}||_2^2

Conditional GD / Frank Wolfe

Let γk\gamma_k be a predetermined sequence

yk=arg minyXf(xk)y\vec{y}_k = \argmin_{\vec{y} \in X} \nabla f(\vec{x}_k)^{\top} \cdot \vec{y}

Notice here we want to find a direction that maximizes the cross-product with direction of gradient, but since we want to subtract the gradient, we can just add the direction that minimizes (maximizes negative magnitude) cross-correlation with

xk+1=(1γk)xk+γkyk=xk+γ(ykxk)\begin{split} \vec{x}_{k+1} &= (1-\gamma_k) \vec{x}_k + \gamma_k \vec{y}_k \\ &=\vec{x}_k + \gamma(\vec{y}_k - \vec{x}_k) \end{split}
Has a nice sparsity property

We are basically replacing the projection with something in the set that maximizes the direction of gradient

Newton’s Method

Approximate the function as a quadratic function and descend to the lowest point in the quadratic estimation
f(x+v)=f(x)+f(x)v+12v2f(x)vundefinedQuadratic Approximation+f(\vec{x} + \vec{v}) = \underbrace{f(\vec{x}) + \nabla f(\vec{x})^\top \vec{v} + \frac{1}{2} \vec{v}^\top \nabla^2 f(\vec{x}) \vec{v}}_{\mathclap{\text{Quadratic Approximation}}} + \cdots

Suppose

So Newton step:

xk+1=xk(2f(xk))1f(xk)x_{k+1} = x_k - (\nabla^2 f(\vec{x}_k))^{-1} \nabla f(\vec{x}_k)

If HH is singular?

Black (Gradient Descent) vs. Blue (Newton’s Methods)

Comparison to other methods:

Coordinate Descent

Given an optimization objective

minxf0(x)\min_{\vec{x}} f_0(\vec{x})

The coordinate descent will update one coordinate per timestamp, let xRn\vec{x} \in \mathbb{R}^n

(xt+1)i={arg min(xt+1)if0(x)if t+1modi0(xt)iotherwise(x_{t+1})_i = \begin{cases} \argmin_{(x_{t+1})_i} f_0(\vec{x}) &\text{if } t+1 \mod i \equiv 0 \\ (x_{t})_i &\text{otherwise} \end{cases}

Partitioning Problem

minxWx,WSns.t. i[1,n],xi2=1\min \vec{x}^\top W \vec{x}, W \in \mathbb{S}^n \\ \\ \text{s.t. } \forall i \in [1,n], x_i^2 = 1

Not a convex problem ⇒ the domain is a ring

Write out the lagrangian

L(x,ν)=xWx+i=1nνi(xi21)=xWx+xdiag(ν)xi=1nνi=x(W+diag(ν))xi=1nνi\begin{split} L(\vec{x}, \vec{\nu}) &= \vec{x}^\top W \vec{x} +\sum_{i=1}^n \nu_i (x_i^2 - 1) \\ &=\vec{x}^\top W \vec{x} + \vec{x}^\top diag(\vec{\nu}) \vec{x} - \sum_{i=1}^n \nu_i \\ &=\vec{x}^\top (W+diag(\vec{\nu})) \vec{x} - \sum_{i=1}^n \nu_i \\ \end{split}
g(ν)=minxL(x,ν)={i=1nνiif W+diag(ν)0otherwise (choose infinitely large x in negative eigenvalue direction)\begin{split} g(\vec{\nu}) &= \min_{\vec{x}} L(\vec{x}, \vec{\nu}) \\ &= \begin{cases} -\sum_{i=1}^n \nu_i &\text{if } W + diag(\vec{\nu}) \ge 0 \\ -\infin &\text{otherwise (choose infinitely large $\vec{x}$ in negative eigenvalue direction)} \end{cases} \end{split}

So dual problem (Semidefinite Program SDP):

maxi=1nνis.t. W+diag(ν)0\max -\sum_{i=1}^n \nu_i \\ \text{s.t. } W + diag(\vec{\nu}) \ge 0

Solution is

ν=λmin(W)pnλmin(W)\vec{\nu} = \lambda_{\min} (W) \rightarrow p^* \ge n \cdot \lambda_{\min}(W)

Convexity

Convex Combination

i=1nλixi\sum_{i=1}^n \lambda_i \vec{x}_i is a convex combination of xi\vec{x}_i if λi0\lambda_i \ge 0 and i=1nλi=1\sum_{i=1}^n \lambda_i = 1

Convex Set

Set CC is convex if any line segment joining any two points in the set is contained in the set.

OR

Set CC is convex if any convex combination of points inside the set is contained inside the set

Seperating Hyperplane Theorem

If there exists two disjoint convex sets C,DC,D then there exists a hyperplane ax=b\vec{a}^\top \vec{x} = b that seperates the two sets.

Convex Function (Bowl)

f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R}

ff is convex if domain of ff is a convex set and satisfies the Jensen’s Inequality:

f(θx+(1θ)y)θf(x)+(1+θ)f(y)f(\theta \vec{x} + (1-\theta)\vec{y})\le \theta f(\vec{x})+(1+\theta)f(\vec{y})

Epigraph:

Epi f={(x,t)},xDomain(f),f(x)t\text{Epi }f = \{(x,t)\}, x \in \text{Domain}(f), f(x) \le t

Property:

ff is a convex function if and only iff Epi f\text{Epi }f is a convex set

First-order conditions

Let the convex function ff be differentiable.

Then ff is convex if and only if

x,yDomain(f),f(y)f(x)+f(x)(yx)\forall \vec{x}, \vec{y} \in \text{Domain}(f), f(\vec{y}) \ge f(\vec{x}) + \nabla f(\vec{x})^\top (\vec{y}-\vec{x})

“Tangent line is always below the function”

If f(x)=0\nabla f(\vec{x}^*) = 0, then f(y)f(x)+0(yx)f(\vec{y}) \ge f(\vec{x}^*)+0(\vec{y}-\vec{x})x\vec{x}^* is a global min

Second-order conditions

Letconvex function ff have the property that Dom(f)Dom(f) is convex and twice differenciable function, then:

2f(x)0\nabla^2 f(\vec{x}) \ge 0

Strict Convexity

Strict Convexity → Convex

Dom(f)Dom(f) should be convex

Zero-order condition (θ(0,1)\theta \in (0,1)):

x,yDom(f),f(θx+(1θ)y)<θf(x)+(1θ)f(y)\forall x,y \in Dom(f), f(\theta\vec{x} + (1-\theta)\vec{y}) < \theta f(\vec{x}) + (1-\theta) f(\vec{y})

First-order condition:

f(y)>f(x)+f(x)(yx)f(\vec{y}) > f(\vec{x})+\nabla f(\vec{x})^\top (y-x)

Second-order condition:

2f(x)>0\nabla^2 f(\vec{x}) > 0

Strong Convexity

Strongly Convex → Strict Convex → Convex

ff is differentiable, Dom(f)Dom(f) is convex

Then ff is μ\mu-strongly (μ>0\mu > 0) convex if

xyDom(f):f(y)f(x)+f(x)(yx)+μ2yx2\forall \vec{x} \ne \vec{y} \in Dom(f): \\ f(\vec{y}) \ge f(\vec{x})+\nabla f(\vec{x})^\top(\vec{y}-\vec{x})+\frac{\mu}{2}||\vec{y} - \vec{x}||^2

Note the second-order taylor approximation:

f(y)f(x)+f(x)(yx)+12(yx)2f(x)(yx)f(\vec{y}) \approx f(\vec{x}) + \nabla f(\vec{x})^\top (\vec{y}-\vec{x}) + \frac{1}{2}(\vec{y} - \vec{x})^\top \nabla^2 f(\vec{x})(\vec{y} - \vec{x})

If we look at the term μ2yx2\frac{\mu}{2} ||\vec{y} - \vec{x}||^2, we can actually rewrite it as

12(yx)diag(μ,μ,,μ)(yx)\frac{1}{2}(\vec{y}-\vec{x})^\top diag(\mu, \mu, \dots, \mu) (\vec{y}-\vec{x})

So we are saying that we want the hessian

2f(x)μI\nabla^2 f(x) \ge \mu I
We are lower-bounding the hessian at every value

L-Smooth

It’s like the upper bound of “strong convexity” ⇒ strong convexity says that you can fit inside a bowl, and L-smooth says you can fit anther bowl inside your function

Convex Optimization

p=minf0(x)s.t.i[1,m],fi(x)0p^* = \min f_0(\vec{x}) \\ \text{s.t.} \\ \forall i\in [1,m], f_i(\vec{x}) \le 0

Assuming f0(x)f_0(\vec{x}) convex and i,fi(x)\forall i, f_i(\vec{x}) convex

Problem Transformations

  1. Addition of “slack variables” ⇒ “epigraph” reformulation
    1. minxXf0(x)mint:f(x)t,xXt\min_{\vec{x} \in X} f_0(\vec{x}) \rightarrow \min_{t: f(\vec{x}) \le t, \vec{x} \in X} t

Linear Program

mincxs.t.Ax=bPxq\min \vec{c}^\top \vec{x} \\ \text{s.t.} \\ A\vec{x} = \vec{b} \\ P\vec{x} \le \vec{q}

Theorem:

If minxXcx\min_{\vec{x} \in X} \vec{c}^\top \vec{x}, and XX is a closed convex set, then:

xBoundary(X)\vec{x}^* \in Boundary(X)

Proof:

Assume x\vec{x}^* is in the interior of the set XX, then there exists some ball of radius r>0r > 0 s.t. BallXBall \in X

z,xz2rzX\forall \vec{z}, ||\vec{x} - \vec{z}||_2 \le r \rightarrow \vec{z} \in X

Consider z=αc\vec{z} = -\alpha \vec{c}, where α=rc2\alpha = \frac{r}{||\vec{c}||_2}

Consider: x+z\vec{x}^* + \vec{z}

f0(x+z)=cxαcc<f0(x)f_0(\vec{x}+\vec{z}) = c^\top \vec{x}^* - \alpha \vec{c}^\top \vec{c} < f_0(\vec{x}^*)

Contradiction!

Minimax Inequality

With sets X,YX, Y and FF being any function

minxXmaxyYF(x,y)maxyYminxXF(x,y)\min_{x \in X} \max_{y \in Y} F(x,y) \ge \max_{y \in Y} \min_{x \in X} F(x,y)

Lagrangian as a Dual

We have optimization questions of the form

p=minf0(x)s.t.i[1,m],fi(x)0i[1,p],hi(x)=0p^* = \min f_0(\vec{x}) \\ \\ \text{s.t.} \\ \forall i \in [1,m], f_i(\vec{x}) \le 0 \\ \forall i \in [1,p], h_i(\vec{x}) = 0

Define the Lagrangian function:

L(x,λ,ν)=f0(x)+i=1mλifi(x)+i=1pνihi(x)L(\vec{x},\vec{\lambda}, \vec{\nu}) = f_0(\vec{x}) + \sum_{i=1}^m \lambda_i f_i(\vec{x}) + \sum_{i=1}^p \nu_i h_i(\vec{x})

So, the Lagrangian problem:

minxL(x,λ,ν)=g(λ,ν)s.t. λ0\min_{\vec{x}} L(\vec{x},\vec{\lambda},\vec{\nu}) = g(\vec{\lambda},\vec{\nu}) \\ \text{s.t. } \vec{\lambda} \ge 0

Properties of gg:

  1. gg is a function of only λ,ν\vec{\lambda}, \vec{\nu}
  1. L(x,λ,ν)L(\vec{x}, \vec{\lambda}, \vec{\nu}) is an affine function of λ,ν\vec{\lambda}, \vec{\nu}
    1. Affine functions are both convex and concave
  1. What can we say about gg?
    1. Fact: pointwise max of convex function is convex
    1. Fact: pointwise min of concave function is concave
    1. gg is a pointwise min of concave functions
      1. Because Ki(λ,ν)=L(xi,λ,ν)K_i(\vec{\lambda},\vec{\nu}) = L(\vec{x}_i, \vec{\lambda}, \vec{\nu})
        1. g(λ,ν)=min{i,Ki(λ,v)}g(\vec{\lambda}, \vec{\nu}) = \min \{\forall i, K_i(\lambda,\vec{v}) \}
      1. Therefore gg is a concave function of λ,ν\vec{\lambda}, \vec{\nu}
  1. g(λ,ν)g(\vec{\lambda}, \vec{\nu}) is a lower bound on the primal optimal pp^*

Dual problem:

maxλ0,νg(λ,ν)=dDual Problem\max_{\vec{\lambda} \ge 0, \nu} g(\vec{\lambda}, \vec{\nu}) = d^* \rightarrow \text{Dual Problem}

Maximization of a concave function with linear constraints ⇒ Convex program

Properties:

  1. Number of variables = number of constraints of optimal
  1. Always convex problem, even if primal is not!
  1. dd^* may not always give you the optimal value of your primal
    1. but it is always a lower bound dpd^* \le p^*
      1. Weak Duality
    1. We call
      1. pdp^* - d^*
        1. Duality Gap
      1. If p=dp^* = d^*
        1. Strong Duality
        1. Sometimes hold sometimes don’t

Minmax in Lagrangian

Consider

maxλ0,νL(x,λ,ν)=maxλ0,νf0(x)+i=1mλifi(x)+i=1nνihi(x)={f0(x)if x is feasible\begin{split} \max_{\vec{\lambda} \ge 0, \vec{\nu}} L(\vec{x}, \vec{\lambda}, \vec{\nu}) &= \max_{\vec{\lambda} \ge 0, \vec{\nu}} {f_0(\vec{x}) + \sum_{i=1}^m \lambda_i f_i(\vec{x}) + \sum_{i=1}^n \nu_i h_i(\vec{x})} \\ &= \begin{cases} f_0(\vec{x}) &\text{if $\vec{x}$ is feasible} \\ \infin \end{cases} \end{split}

Note

p=minxmaxλ0,νL(x,λ,ν) p^* = \min_{\vec{x}} \max_{\vec{\lambda} \ge 0, \vec{\nu}} L(\vec{x}, \vec{\lambda}, \vec{\nu})

Because of minmax inequality,

pdp^* \ge d^*

Strong Duality / Slater’s Condition

Strong duality holds if x0\exists x_0 such that fi(x0)<0if_i(x_0) < 0 \forall i (”strictly feasible) and the primal problem is convex

Refined Slater’s Condition

Assume problem is convex, assume we have f1,f2,,fkf_1, f_2, \dots, f_k that are affine functions as constraints (other constraints fk+1,,fmf_{k+1}, \dots, f_m are not affine)

Then strong duality holds if

x0:i[1,k]fi(x0)0i[k+1,m]fi(x0)<0\exists \vec{x}_0: \\ \forall i \in [1,k] f_i(\vec{x}_0) \le 0 \wedge \forall i \in [k+1,m] f_i(\vec{x}_0) < 0

Dual of LP

Suppose the problem:

minAxbcx\min_{A\vec{x} \le \vec{b}} \vec{c}^\top \vec{x}

Then

L(x,λ)=cx+λ(Axb)=(Aλ+c)xbλL(\vec{x}, \vec{\lambda}) = \vec{c}^\top \vec{x} + \vec{\lambda}^\top(A\vec{x}-\vec{b}) = (A^\top \vec{\lambda} + \vec{c})^\top \vec{x} - \vec{b}^\top\vec{\lambda}

Therefore

g(λ)=minxL(x,λ)={if Aλ+c0bλif Aλ+c=0\begin{split} g(\vec{\lambda}) &= \min_{\vec{x}} L(\vec{x}, \vec{\lambda}) \\ &= \begin{cases} -\infin &\text{if } A^\top \vec{\lambda}+c \ne 0 \\ -\vec{b}^\top \vec{\lambda} &\text{if } A^\top \vec{\lambda}+c = 0 \end{cases} \end{split}

So dual is (and strong duality holds for any LP):

maxλ0,Aλ+c=0bλ=d\max_{\vec{\lambda} \ge 0, A^\top \vec{\lambda} + \vec{c} =0} -\vec{b}^\top \vec{\lambda} = d^*

Certificate of Optimality

Let (λ1,ν1)(\vec{\lambda}_1, \vec{\nu}_1) be a dual feasible point,

We know that for any feasible point in the dual problem, the optimum of the primal function is always lower-bounded by the dual objective

pg(λ1,ν1)p^* \ge g(\vec{\lambda}_1, \vec{\nu}_1)

Then if we found a point x1\vec{x}_1 in a primal feasible set,

f0(x1)pf0(x1)g(λ1,ν1)f_0(\vec{x}_1)-p^* \le f_0(\vec{x}_1) - g(\vec{\lambda}_1, \vec{\nu}_1)
So when we’ve picked a point that we know is ϵ\epsilon away from the dual, we know it is at most ϵ\epsilon away from the optimal policy ⇒ e.g. dual asend

Complementary Slackness

Original Problem:

p=minf0(x)s.t.fi(x)0(i,1im)hi(x)=0(i,1ip)p^* = \min f_0(\vec{x}) \\ \text{s.t.} \\ f_i(\vec{x}) \le 0 (\forall i, 1 \le i \le m) \\ h_i(\vec{x}) = 0 (\forall i, 1 \le i \le p)

Consider primal optimal x\vec{x}^* and dual optimal (λ,ν)(\vec{\lambda}^*, \vec{\nu}^*)

Assume strong duality holds d=g(λ,ν)=p=f0(x)d^* = g(\vec{\lambda}^*, \vec{\nu}^*) = p^* = f_0(\vec{x}^*)

Note:

g(λ,ν)=minx(f0(x)+i=1mλifi(x)+i=1pνihi(x))f0(x)+i=1mλifi(x)undefined0+i=1pνihi(x)undefined=0f0(x)+0undefinedbecause (λ,ν) maximizes g()+0\begin{split} g(\vec{\lambda}^*, \vec{\nu}^*) &= \min_{\vec{x}} (f_0(\vec{x}) + \sum_{i=1}^m \lambda_i^* f_i(\vec{x}) + \sum_{i=1}^p \nu_i^* h_i(\vec{x})) \\ &\le f_0(\vec{x}^*) + \underbrace{\sum_{i=1}^m \lambda_i^* f_i(\vec{x}^*)}_{\le 0} + \underbrace{\sum_{i=1}^p \nu_i^* h_i(\vec{x}^*)}_{= 0} \\ &\le f_0(\vec{x}^*) + \underbrace{0}_{\mathclap{\text{because $(\vec{\lambda}^*, \vec{\nu}^*)$ maximizes $g(\cdots)$}}} + 0 \end{split}

But in our assumption, we’ve assumed p=dp^* = d^*?

This forces the inequality above to become equalities!

g(λ,ν)=f0(x)+i=1mλifi(x)+i=1pνihi(x)=f0(x)+0+0\begin{align} g(\vec{\lambda}^*, \vec{\nu}^*) &= f_0(\vec{x}^*) + \sum_{i=1}^m \lambda_i^* f_i(\vec{x}^*) + \sum_{i=1}^p \nu_i^* h_i(\vec{x}^*) \\ &= f_0(\vec{x}^*) + 0 + 0 \end{align}

We’ve showed that:

minxL(x,λ,ν)=L(x,λ,ν)\min_{\vec{x}} L(\vec{x},\vec{\lambda}^*,\vec{\nu}^*) = L(\vec{x}^*, \vec{\lambda}^*, \vec{\nu}^*)

Also:

i=1mλifi(x)=0i[1,m],λifi(x)=0\sum_{i=1}^m \lambda_i^* f_i(\vec{x}^*) = 0 \\ \forall i \in [1,m], \lambda_i^* \cdot f_i(\vec{x}^*) = 0

Therefore

👆
Those two conditions are called “complementary slackness” Note that we did not use convexity in deriving this, only assumed strong duality.

KKT Conditions (Karush-Kuhn-Tucker Conditions)

KKT Condition Part I (Necessary Conditions)

x,λ,ν\vec{x}^*, \vec{\lambda}^*, \vec{\nu}^* optimality implies

  1. i[1,m],fi(x)0\forall i \in [1,m], f_i(\vec{x}^*) \le 0
    1. feasible set
  1. i[1,p],hi(x)=0\forall i \in [1,p], h_i(\vec{x}^*) = 0
    1. feasible set
  1. i[1,m],λi0\forall i \in [1,m], \lambda_i^* \ge 0
    1. constraint on λ\lambda, dual feasible set
  1. i[1,m],λifi(x)=0\forall i \in [1,m], \lambda_i^* \cdot f_i(\vec{x}^*) = 0
    1. complementary slackness
  1. f0(x)+λifi(x)+νihi(x)=0\nabla f_0(\vec{x}^*) + \sum \lambda_i^* \nabla f_i(\vec{x}^*) + \sum \nu_i^* \nabla h_i(\vec{x}^*) = 0
    1. Gradient of lagrangian is equal to 0
    1. We proved earlier that x=minxL(x,λ,ν)\vec{x}^* = \min_{\vec{x}} L(\vec{x}, \vec{\lambda}^*, \vec{\nu}^*)
    1. implied by “the main theorem”

👆
Having those conditions do not necessarily mean that we have an optimality, but a way that we find optimal λ,ν\vec{\lambda}, \vec{\nu} is to find points where those property are met and that gives us a base space to search for optimality

KKT Condition Part II (Sufficient Condition)

x~,λ~,ν~\tilde{x}, \tilde{\lambda}, \tilde{\nu} are points that satisfy

  1. i[1,m],fi(x~)0\forall i \in [1,m], f_i(\tilde{x}) \le 0
  1. i[1,p],hi(x~)=0\forall i \in [1,p], h_i(\tilde{x}) = 0
  1. i[1,m],λ~i0\forall i \in [1,m], \tilde{\lambda}_i \ge 0
  1. i[1,m],λ~ifi(x~)=0\forall i \in [1,m], \tilde{\lambda}_i \cdot f_i(\tilde{x}) = 0
  1. f0(x~)+λ~ifi(x~)+ν~ihi(x~)=0\nabla f_0(\tilde{x}) + \sum \tilde{\lambda}_i \nabla f_i(\tilde{x}) + \sum \tilde{\nu}_i \nabla h_i(\tilde{x}) = 0

Then:

x~,λ~,ν~\tilde{x}, \tilde{\lambda}, \tilde{\nu} are primal and dual optimum

Linear Program

Special Case of a CONVEX optimization problem

Standard Form of LP

mincxs.t.Ax=bx0\min \vec{c}^\top \vec{x} \\ \text{s.t.} \\ A\vec{x} = \vec{b} \\ \vec{x} \ge 0

  1. Eliminate inequalities
    1. Add slack variables
    1. j=1naijxjbi\sum_{j=1}^n a_{ij} x_j \le b_ii=1naijxj+si=bi,si0\sum_{i=1}^n a_{ij} x_j + s_i = b_i, s_i \ge 0
  1. Unconstrainted variables
    1. xiRx_i \in \mathbb{R}xi=xi+xix_i = x_i^+ - x_i^- where xi+0,xi0x_i^+ \ge 0, x_i^- \ge 0

Dual of Standard Form

maxbλs.t.Aλ+c=0λ0\max -\vec{b}^\top \vec{\lambda} \\ \text{s.t.} \\ A^\top \vec{\lambda} + \vec{c} = 0 \\ \vec{\lambda} \ge 0

Simplex Algorithm

Polyhedron

Extreme Point of Polyhedron PP:

{xP¬(yx,zxP,λ[0,1]:x=λy+(1λ)z)}\{ \vec{x} \in P | \neg(\exists \vec{y} \ne \vec{x},\vec{z} \ne \vec{x} \in P, \lambda \in [0,1] : \vec{x} = \lambda\vec{y} + (1-\lambda) \vec{z} ) \}

Fact:

If we assume

Then:

Proof:

P={xAxb},p=minxPcxP = \{x|Ax \le b\}, p^* = \min_{x \in P} c^\top x

Let QQ be the set of all possible solutions

Q={xAxb,cx=p}Q = \{x | Ax\le b, c^\top x = p^* \}

If uu is a vertex of QQ, how do we prove uu is a vertex of PP

Suppose for contradiction that uu is not a vertex of PP

then y,zP,y,zu\exists y, z \in P, y, z \ne u and λy+(1λ)z=u,λ(0,1)\lambda y + (1-\lambda) z = u, \lambda \in (0,1)

We know cu=pc^\top u = p^*λcy+(1λ)cz=cu=p\lambda c^\top y + (1-\lambda) c^\top z = c^\top u = p^*

We know since pp^* is the optimal,

cyp,czpc^\top y \ge p^*, c^\top z \ge p^*

By the above strict equation, we infer

cy=cz=pc^\top y = c^\top z = p^*

So now y,zQy, z \in Q

But uu is a vertex in QQ, we have a contradiction!

Therefore uu must be a vertex in PP.

Quadratic Programs

min12xHx+cxs.t.AxbCx=d\min \frac{1}{2} \vec{x}^\top H \vec{x} + \vec{c}^\top \vec{x} \\ \text{s.t.} \\ A\vec{x} \le \vec{b} \\ C\vec{x} = \vec{d}

If HSnH \in \mathbb{S}^n and H0H \ge 0 then this problem is convex

Eigenvalues of HSnH \in \mathbb{S}^n (assuming unconstrained problem)

  1. HH has at least one negative eigenvalue
    1. p=p^* = -\infin
  1. H0H \ge 0, cR(H)\vec{c} \in R(H)
    1. Rewrite f(x)=12(xx0)H(xx0)+αf(\vec{x}) = \frac{1}{2} (\vec{x} - \vec{x}_0)^\top H (\vec{x} - \vec{x}_0) + \alpha
      1. α\alpha is an arbitrary constant we choose
    1. If HH is invertible, then
      1. x=H1c\vec{x}^* = - H^{-1} \vec{c}
    1. If not (HH has a null-space), use pseudo-inverse (cuz any Hx0=c-H \vec{x}_0 = \vec{c} would suffice)
      1. H=UΣ1VH^{\dagger} = U\Sigma^{-1}V^\top
  1. H0,cR(H)H \ge 0, \vec{c} \notin R(H)
    1. c=Hx0+r\vec{c} = -H \vec{x}_0 + \vec{r}
    1. f(r)=12rHr+cr=r22f(\vec{r}) = \frac{1}{2} \vec{r}^\top H \vec{r} + \vec{c}^\top \vec{r} = ||\vec{r}||_2^2

Cones

Cone: Set of points CRnC \sube \mathbb{R}^n is a cone iff xC,α0,αxC\forall \vec{x} \in C, \forall \alpha \ge 0, \alpha \vec{x} \in C.

Polyhedral Cone
Second-order Cone in R3\mathbb{R}^3

Second Order Cone Program

minqxs.t.i{1,,m},Aix+bi2cix+di\min \vec{q}^\top \vec{x} \\ \text{s.t.} \\ \forall i \in \{1,\dots,m\}, ||A_i \vec{x} + \vec{b}_i||_2 \le \vec{c}_i^\top \vec{x} + d_i
🤔
Note that each constraint must lay in a second-order cone! (Aix+bi,cix+di)C(A_i \vec{x} + \vec{b}_i, \vec{c}_i^\top \vec{x} + d_i) \in C

Application of Optimizations

Linear Quadratic Regulator (LQR)

xt+1=Axt+But\vec{x}_{t+1} = A\vec{x}_t + B\vec{u}_t

Cost:

C=t=0N112(xtQxt+utRut)+12xNQfxNC=\sum_{t=0}^{N-1} \frac{1}{2} (\vec{x}_t^\top Q \vec{x}_t + \vec{u}_t ^\top R \vec{u}_t ) + \frac{1}{2} \vec{x}_N Q_f \vec{x}_N

Note: Assume QQ and RR PSD

The goal of LQR is:

🧙🏽‍♂️
Turns out to be a quadratic program, and instead of solving the problem in a normal QP formulation (which grows in runtime as the number of timestep grows), we will use a Ricatti equation to solve this

Ricatti Equation is based on the KKT condition (wheras standard derivation is based on dynamic equation / Bellman equation)

The dual: Slater’s Conditions hold ⇒ Strong duality holds

L(x0,,u0,,λ1,,λN)=t=0N112(xtQxt+utRut)+12xNQfxN+t=0N1λt+1(Axt+Butxt+1)\begin{split} L(\vec{x}_0, \dots, \vec{u}_0, \dots,\vec{\lambda}_1, \dots, \vec{\lambda}_N) &= \sum_{t=0}^{N-1} \frac{1}{2} (\vec{x}_t Q \vec{x}_t + \vec{u}_t^\top R \vec{u}_t) + \frac{1}{2} \vec{x}_N^\top Q_f \vec{x}_N + \sum_{t=0}^{N-1} \vec{\lambda}_{t+1}^\top (A\vec{x}_t + B\vec{u}_t - \vec{x}_{t+1}) \\ \end{split}

Consider KKT Conditions here:

utL=Rut+Bλt+1=0,t=0,1,,N1\nabla_{\vec{u}_t}L = R^\top\vec{u}_t + B^{\top} \vec{\lambda}_{t+1} = 0, \forall t = 0, 1, \dots, N-1
xtL=Qxt+Aλt+1λt=0,1,,N1\nabla_{\vec{x}_t} L = Q^\top \vec{x}_t +A^\top \vec{\lambda}_{t+1}- \vec{\lambda}_t = 0, \forall 1, \dots, N-1
xNL=QfxNλN=0\nabla_{\vec{x}_N} L = Q_f \vec{x}_N - \vec{\lambda}_N = 0

If we rearrange a bit,

We have (1) dynamics of the “adjoint system”, where λ\lambda is the co-state

λN=QfxN\vec{\lambda}_N = Q_f \vec{x}_N
λt=Qxt+Aλt+1\vec{\lambda}_t = Q^\top \vec{x}_t + A^\top \vec{\lambda}_{t+1}

We can also solve for ut\vec{u}_t

ut=(R)1Bλt+1\vec{u}_t = -(R^\top)^{-1}B^\top \vec{\lambda}_{t+1}

Can we solve this using backward induction?

Goal: Find optimal ut\vec{u}_t

Support Vector Machines (SVM)

Consider binary classification problem

{(xi,yi)}\{ (\vec{x}_i, y_i) \} where i,yi{1,1}\forall i, y_i \in \{-1, 1\}

Seperating Hyperplane

A seperating hyperplane of data is

f(x)=wx+bf(\vec{x}) = \vec{w}^\top \vec{x} + b

such that

🧙🏽‍♂️
We have so many seperating hyperplanes (given the datapoints are linearly seperable), which one do we use? Idea: Use hyperplane that maximizes distance between classes

Define distance to hyperplane from a class to be the minimum of all distances of all points in the class to the hyperplane.

So formulation of SVM:

maxw,b,mms.t.yi(wxi+b)>0wxi+bw2m\max_{\vec{w}, b, m} m \\ \text{s.t.} \\ y_i(\vec{w}^\top \vec{x}_i + b) > 0 \\ \frac{|\vec{w}^\top \vec{x}_i + b|}{||\vec{w}||_2} \ge m

If (m,w,b)(m, \vec{w}, b) is a solution, then (m,αw,αb)(m, \alpha \vec{w}, \alpha b) is also a solution ⇒ there’s unlimited representation, so let’s impose norm on w\vec{w}!

w2=1m||\vec{w}||_2 = \frac{1}{m}

With this, we can write

minw22s.t.yi(wxi+b)1\min ||\vec{w}||_2^2 \\ \text{s.t.} \\ y_i (\vec{w}^\top \vec{x}_i + b) \ge 1

This is called a hard-margin SVM

We also have “Soft-margin SVM

minw,b,ξ12w22+C(i=1nξi)s.t.yi(wxi+b)1ξiξi0\min_{\vec{w}, b, \vec{\xi}} \frac{1}{2}||\vec{w}||_2^2 + C(\sum_{i=1}^n \xi_i)\\ \text{s.t.} \\ y_i(\vec{w}^\top \vec{x}_i + b) \ge 1 - \xi_i \\ \xi_i \ge 0

Here C>0C > 0 is a hyperparameter ⇒ large C means we want to minimize the slack(more like hard-margin SVM)

Hinge Loss formulation of SVM

Suppose we consider the 0-1 loss for classification

L01(yi,f(xi))={0if yif(xi)>01if yif(xi)<0L_{01}(y_i,f(x_i)) = \begin{cases} 0 &\text{if } y_i f(\vec{x}_i) > 0 \\ 1 &\text{if } y_i f(\vec{x}_i) < 0 \end{cases}

For the case of SVM, f(xi)=yi(wxi+b)f(\vec{x}_i) = y_i(\vec{w}^\top \vec{x}_i + b)

So in order to do a correct classification, we want to

min1ni=1nL01(yi,wxi+b)\min \frac{1}{n} \sum_{i=1}^n L_{01}(y_i, \vec{w}^\top \vec{x}_i + b)

We cannot optimize this because it is not convex

So we use hinge loss

Lhinge(yi,f(xi))=max(1yif(xi),0)L_{hinge}(y_i, f(\vec{x}_i)) = \max(1-y_i f(\vec{x}_i),0)
Hinge loss is the blue line here

So now we have

minw,b1nLhinge(yi,wxi+b)+λw22\min_{\vec{w}, b} \frac{1}{n} \sum L_{hinge}(y_i, \vec{w}^\top \vec{x}_i + b) + \lambda ||\vec{w}||_2^2
📌
We will see that this form of hinge loss formulation is exactly the same optimization problem as soft-margin SVM

Remember Soft-margin SVM formulation:

minw,b,ξ12w22+C(i=1nξi)s.t.yi(wxi+b)1ξiξi0\min_{\vec{w}, b, \vec{\xi}} \frac{1}{2}||\vec{w}||_2^2 + C(\sum_{i=1}^n \xi_i)\\ \text{s.t.} \\ y_i(\vec{w}^\top \vec{x}_i + b) \ge 1 - \xi_i \\ \xi_i \ge 0

Transform the formulation

minw,b,ξ12w22+C(i=1nξi)s.t.ξimax(1yi(wxi+b),0)\min_{\vec{w}, b, \vec{\xi}} \frac{1}{2}||\vec{w}||_2^2 + C(\sum_{i=1}^n \xi_i)\\ \text{s.t.} \\ \xi_i \ge \max (1-y_i(\vec{w}^\top \vec{x}_i + b), 0)

We can claim that at optimum, the constraint is tight (otherwise the objective function can be lowered)

minw,b12w22+C(i=1nL0,1(yi,wxi+b))\min_{\vec{w}, b} \frac{1}{2}||\vec{w}||_2^2 + C\bigg(\sum_{i=1}^n L_{0,1}(y_i, \vec{w}^\top \vec{x}_i + b)\bigg)

Therefore if we let C=12nλC = \frac{1}{2n \lambda} then this is equivalent to the hinge-loss formulation

Dual perspective of Soft-margin SVM

L(w,b,ξ,α,β)=12w22+Ci=1nξi+i=1nαi((1ξi)yi(wxi+b))+i=1nβi(ξi)=12w22i=1nαiyi(wxi+b)+i=1nαi+i=1n(Cαiβi)ξi\begin{split} L(\vec{w},b,\vec{\xi},\vec{\alpha},\vec{\beta}) &= \frac{1}{2}||\vec{w}||_2^2 + C \sum_{i=1}^n \xi_i + \sum_{i=1}^n \alpha_i ((1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b)) + \sum_{i=1}^n \beta_i (-\xi_i) \\ &=\frac{1}{2} ||\vec{w}||_2^2 - \sum_{i=1}^n \alpha_i y_i(\vec{w}^\top\vec{x}_i + b) + \sum_{i=1}^n \alpha_i + \sum_{i=1}^n (C-\alpha_i-\beta_i) \xi_i \end{split}

We have the primal and dual formulations:

p=minw,b,ξmaxα,βL()d=maxα,βminw,b,ξL()p^* = \min_{\vec{w},\vec{b}, \vec{\xi}} \max_{\vec{\alpha}, \vec{\beta}} L(\cdots) \\ d^* = \max_{\vec{\alpha}, \vec{\beta}} \min_{\vec{w},\vec{b}, \vec{\xi}} L(\cdots)

Consider first-order KKT conditions

wL=wαiyixi=0w=i=1nαiyixi\nabla_{\vec{w}} L = \vec{w} - \sum \alpha_i y_i \vec{x}_i = 0 \Longrightarrow \vec{w} = \sum_{i=1}^n \alpha_iy_i\vec{x}_i
Lb=αiyi=0\frac{\partial L}{\partial b} = -\sum \alpha_i y_i = 0
Lξi=Cαiβi=0\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \beta_i = 0

Consider complementary slackness

αi((1ξi)yi(wxi+b))=0,i\alpha_i ((1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b)) = 0, \forall i
βiξi=0\beta_i \xi_i = 0

Combining those equations,

  1. if αi0\alpha_i \ne 0, αiC\alpha_i \ne C, 0<αi<C0 < \alpha_i < C
    1. (1ξi)yi(wxi+b)=0(1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b) = 0
    1. βi=Cαi\beta_i = C - \alpha_iβi0\beta_i \ne 0ξi=0\xi_i = 0
    1. yi(wxi+b)=1y_i(\vec{w}^\top \vec{x}_i + b) = 1
      1. Exactly on the margin!
      1. “Support vectors”
  1. if αi0,αi=C\alpha_i \ne 0, \alpha_i = C
    1. Cαiβi=βi=0C-\alpha_i - \beta_i = -\beta_i = 0
      1. Then ξi\xi_i can choose to not be zero(but it’s also possible to be 0)
    1. Also, (1ξi)yi(wxi+b)=0(1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b) = 0
      1. yi(wxi+b)=1ξi1y_i(\vec{w}^\top\vec{x}_i + b) = 1-\xi_i \le 1
  1. αi=0\alpha_i=0
    1. βi=C0\beta_i = C \ne 0
    1. ξi=0\xi_i = 0
      1. No margin violation
      1. Decision boundary not dependent on this point

The dual problem

L()=12(αiyixi)(αiyixi)+αiL(\cdots) = -\frac{1}{2} (\sum \alpha_i y_i\vec{x}_i)^\top(\sum \alpha_i y_i \vec{x}_i) + \sum \alpha_i

By substituting in KKT conditions

We have expressed the whole lagrangian in dual variables!

L()=12[α1y1αnyn][x1x2xn][x1x2xn][α1y1αnyn]+αi=12αdiag(y)XXdiag(y)undefinedQα+αi\begin{split} L(\cdots) &= -\frac{1}{2}\begin{bmatrix} \alpha_1 y_1 &\cdots &\alpha_ny_n \end{bmatrix} \begin{bmatrix} \vec{x}_1^\top \\ \vec{x}_2^\top \\ \vdots \\ \vec{x}_n^\top \end{bmatrix} \begin{bmatrix} \vec{x}_1 & \vec{x}_2 & \cdots & \vec{x}_n \end{bmatrix} \begin{bmatrix} \alpha_1 y_1 \\ \vdots \\ \alpha_n y_n \end{bmatrix} + \sum \alpha_i \\ &=-\frac{1}{2} \vec{\alpha}^\top \underbrace{diag(\vec{y}) XX^\top diag(\vec{y})}_{Q} \vec{\alpha} + \sum \alpha_i \end{split}

Kernels(Lifting)

Instead of using

f(x)=wx+bf(\vec{x}) = \vec{w}^\top \vec{x}+b

Use

f(x)=wΦ(x)+bf(\vec{x})=\vec{w}^\top\Phi(\vec{x})+b

Interior Point Methods

minf0(x)s.t.fi(x)0,iAx=b\min f_0(\vec{x}) \\ \text{s.t.} \\ f_i(\vec{x}) \le 0, \forall i \\ A\vec{x} = \vec{b}

Conditions:

  1. fi(x)f_i(\vec{x}) are convex, twice differentiable
  1. ARp×nA \in \mathbb{R}^{p \times n}, rank(A)=prank(A) = p, A has full row rank
  1. Problem is solvable, x\vec{x}^* exists
  1. Slater’s condition holds ⇒ strong duality

With those conditions, for optimal points, KKT conditions are neccesary and sufficient!

x,λ,ν s.t.Ax=b,fi(x)0,λ0,f0(x)+i=1mλifi(x)+Aν=0,λifi(x)=0,i\exists \vec{x}^*, \vec{\lambda}^*, \vec{\nu}^* \text{ s.t.} \\ A\vec{x}^* = \vec{b}, f_i(\vec{x}^*) \le 0, \vec{\lambda}^* \ge 0, \\ \nabla f_0(\vec{x}^*) + \sum_{i=1}^m \lambda_i^* \nabla f_i(\vec{x}^*) + A^\top \vec{\nu}^* = 0, \\ \lambda_i^* \cdot f_i(\vec{x}^*) = 0, \forall i

Those works for:

Motivation:

🔥
Convert our problem to a series of equality constrained problems and remove inequality constraints ⇒ IPM

Barrier Function

minAx=bf0(x)+i=1nI(fi(x))\min_{A\vec{x} = b} f_0(\vec{x})+\sum_{i=1}^n I_-(f_i(\vec{x}))

Where

I:RR s.t. I={0if u0if u>0I_-: \mathbb{R} \rightarrow \mathbb{R} \text{ s.t. } I_- = \begin{cases} 0 &\text{if } u\le 0\\ \infin &\text{if } u > 0 \end{cases}

Can we make a differentiable convex approximation to this function?

Iapprox(u)=1tlog(u)I_{approx}(u) = -\frac{1}{t} \log(-u)

Where t>0t > 0, as tt gets larger and larger the approximation is sharper and sharper

IapproxI_{approx} is called the approximate barrier function

Therefore we transformed the original problem into an approximate form, using logarithmic barrier function

ϕ(x)=i=1mlog(fi(x))\phi(\vec{x}) = -\sum_{i=1}^m \log(-f_i(\vec{x}))
minAx=bf0(x)+1tϕ(x)\min_{A\vec{x} = \vec{b}} f_0(\vec{x}) + \frac{1}{t} \phi(\vec{x})

Compute the gradient and hessian

ϕ(x)=i=1m1fi(x)fi(x)\nabla \phi(\vec{x}) = \sum_{i=1}^m -\frac{1}{f_i(\vec{x})}\nabla f_i(\vec{x})
2ϕ(x)=i=1m1fi(x)2fi(x)fi(x)+i=1m1fi(x)2fi(x)\nabla^2 \phi(\vec{x}) = \sum_{i=1}^m \frac{1}{f_i(\vec{x})^2} \nabla f_i(\vec{x})\nabla f_i(\vec{x})^\top + \sum_{i=1}^m -\frac{1}{f_i(\vec{x})} \nabla^2 f_i(\vec{x})

Assume that this is solvable by Newton + unique solution

Solution: x(t)\vec{x}^*(t) ⇒ belongs to the “central path”

This solution should satisfy the KKT condition

Ax(t)=b,fi(x(t))<0,ν^:tf0(x(t))+ϕ(x(t))+Aν^=0A \vec{x}^*(t) = b, f_i(\vec{x}^*(t)) < 0, \\ \exists \hat{\vec{\nu}} : t \cdot \nabla f_0(\vec{x}^*(t)) + \nabla \phi(\vec{x}^*(t)) + A^\top \hat{\vec{\nu}} = 0

From this define

λi(t)=1tfi(x(t)),ν(t)=ν^/t\lambda_i^*(t) = -\frac{1}{t f_i(\vec{x}^*(t))}, \vec{\nu}^*(t) = \hat{\vec{\nu}}/t

We claim that λ,ν\vec{\lambda}^*, \vec{\nu}^* is the feasible duals for the original problem

Let’s write the FOC condition of the transformed problem

tf0(x(t))+1fi(x(t))fi(x(t))+Aν^=0f0(x(t))+λi(t)fi(x(t))+Aν(t)=0t \cdot \nabla f_0(\vec{x}^*(t)) + \sum -\frac{1}{f_i(\vec{x}^*(t))} \cdot \nabla f_i(\vec{x}^*(t)) + A^\top \hat{\vec{\nu}} =0 \\ \nabla f_0(\vec{x}^*(t)) + \sum \lambda_i^*(t) \cdot \nabla f_i(\vec{x}^*(t)) + A^\top \nu^*(t) = 0
🔥
Notice that now this is the FOC of the lagrangian of the original problem!

Means ⇒ x(t)\vec{x}^*(t) is the minimizer of L(x,λ(t),ν(t))L(\vec{x}, \vec{\lambda}^*(t), \vec{\nu}^*(t))

But we can derive a partial dual for the original problem

g(λ(t),ν(t))=minxL(x,λ(t),ν(t))=f0(x(t))+i=1mλi(t)fi(x(t))+ν(t)(Ax(t)b)=f0(x(t))mt\begin{split} g(\vec{\lambda}^*(t), \vec{\nu}^*(t)) &= \min_{\vec{x}} L(\vec{x}, \vec{\lambda}^*(t), \vec{\nu}^*(t)) \\ &= f_0(\vec{x}^*(t)) + \sum_{i=1}^m \lambda_i(t) \cdot f_i(\vec{x}^*(t))+\vec{\nu}^*(t)^\top (A\vec{x}^*(t)-\vec{b}) \\ &= f_0(\vec{x}^*(t)) -\frac{m}{t} \end{split}

Another way of stating λi(t)=1tfi(x(t))\lambda_i(t) = -\frac{1}{t f_i(\vec{x}^*(t))}:

λi(t)fi(x(t))=1t-\lambda_i(t) f_i(\vec{x}^*(t)) = \frac{1}{t}

This is “approximate complementary slackness” ⇒ as tt \rightarrow \infin, completementary slackness is satisfied

Say I want accuracy ϵ\epsilon

Set ϵ=mt\epsilon = \frac{m}{t}t=mϵt = \frac{m}{\epsilon} ⇒ solve the approximate problem using equality constrained Newton’s method

Barrier Methods

Given strictly feasible x,t>0,μ>1,ϵ>0x, t > 0, \mu > 1, \epsilon >0

Repeat:

  1. Centering
    1. Compute x(t)\vec{x}^*(t) by minimizing tf0(x)+ϕ(x)t f_0(\vec{x}) + \phi(\vec{x}) s.t. Ax=bA\vec{x} = \vec{b}
    1. Starting at x1\vec{x}_1, using Newton’s method
  1. Update x:=x(t)\vec{x} := \vec{x}^*(t)
  1. Stopping critera: If m/t<ϵm/t < \epsilon
  1. Increase t=μtt = \mu \cdot t