CS189 Final Review Doc

Created by Yunhao Cao (Github@ToiletCommander) in Spring 2022 for UC Berkeley CS189.

Reference Notice: Material highly and mostly derived from Prof Shewchuk's lecture notes, some ideas were borrowed from wikipedia & Stanford CS231N & Andrew Ng’s Intro to ML Course@Stanford.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Last Update: 2022-05-13 21:46 PST ⇒ After final release

Machine Learning Abstraction

Application / DataIs the data labeled? 1. Yes - categorical(classification) or quantatative(regression) 2. No - similarity (clustering) or positioning (dimensional reduction)
Model-Decision Fns: linear, polynomial, logistic, neural net... -Nearest Neighbours, Decision Trees -Features -Low vs. high capacity (affects overfitting, underfitting, inference)
Optimization ProblemVariables(weights), objective function, constraints
Optimization AlgorithmGradient Descent, Simplex, SVD

Classifiers

Suppose we have Bayes Optimal Classifier r(x)r(x), that is r(x)r(x) will output the prediction y that minimizes Bayesian Risk R(r(x))R(r(x)).

3 ways to build classifiers

  1. Generative Models(e.g. LDA)
    1. Assume sample points come from probability distributions, different for each class.
    1. Guess form of distributions
    1. For each class C, fit distribution parameters to class C points, giving f(XY=C)f(X|Y=C)
    1. For each class C, estimate P(Y=C)P(Y=C)
    1. Baye’s Theorem gives P(YX)P(Y|X)
    1. If 0-1 loss, pick class C that maximizes
      1. P(Y=CX=x)=f(X=xY=C)P(Y=C)P(Y=C|X=x)=f(X=x|Y=C)P(Y=C)
  1. Discriminative Models (e.g. logistic regression)
    1. Model P(YX)P(Y|X) directly.
  1. Decision Boundary
    1. Model r(x)r(x) directly

ROC Curve (Receiver Operating Characteristics)

X axis - False Positive Rate (FP) - Predictor predicts + but actually is -

Y axis - Sensitivity / True Positive Rate (TP) - Predictor predicts + and right

1 - x ⇒ Specificity / True Negative Rate (TN) - Predictor predicts - and right

1 - y ⇒ False Negative Rate (FN) - Predictor predicts - but actually is +

We want to minimize x and (1-y)

🔥
Has to be better than the line y=xy=x if not we can simply flip the classifier.

Data Cleaning

Centering

🔥
Subtracting mean μ^\hat{\mu} from each sample
XX˙X \rightarrow \dot{X}

Decorrelating

Let RR be uniform distribution on sample points,

Var(R)=1nX˙X˙Var(R)=\frac{1}{n} \dot{X}^{\top}\dot{X}

We want to perform a linear transformation that maps sample points to axis-aligned distribution

We will apply rotation Z=X˙VZ = \dot{X}V, where Var(R)=VΛVVar(R)=V\Lambda V^{\top}.

Then Var(Z)=1nZZ=1nVX˙X˙V=VVar(R)V=VVΛVV=ΛVar(Z) = \frac1n Z^{\top}Z = \frac1n V^{\top}\dot{X}^{\top}\dot{X}V=V^{\top}Var(R)V=V^{\top}V\Lambda V^{\top}V = \Lambda

Sphering

Apply transform W=X˙Var(R)1/2W = \dot{X}Var(R)^{-1/2}

Recall that Σ1/2\Sigma^{-1/2} maps ellipsoids to spheres.

Whitening

Centering + Sphering

XWX \rightarrow W

Ideas

Feature Lifting (Kernel)

Suppose we have data point XiX_i, the lifted data point is Φ(Xi)\Phi(X_i).

Parabolic Lifting Map

🔥
Φ(X1),,Φ(Xn)\Phi(X_1),\dots,\Phi(X_n) are linearly seperable if X1,,XnX_1, \dots, X_n are seperable by a hypersphere.
Φ:RdRd+1\Phi: \mathbb{R}^d \rightarrow \mathbb{R}^{d+1}
Φ(Xi)=[XiXi2]\Phi(X_i)=\begin{bmatrix} X_i \\ ||X_i||^2 \end{bmatrix}

p-degree polynomial

Φ(Xi):RdRO(dp)\Phi(X_i): \mathbb{R}^d \rightarrow \mathbb{R}^{O(d^p)}
Φ(Xi)=[x13x12x2x1x22x23x12x1x2x22x1x2]\Phi(X_i)=\begin{bmatrix} x_1^3 &x_1^2x_2 &x_1x_2^2 &x_2^3 &x_1^2 &x_1x_2 &x_2^2 &x_1 &x_2 \end{bmatrix}^{\top}

Polynomial Kernel:

k(x,z)=(xz+1)p=Φ(x)Φ(z)k(x,z)=(x^{\top}z+1)^p = \Phi(x)^{\top}\Phi(z)

where Φ(x)\Phi(x) contains every monomial in xx of degree 0,,p0, \dots, p

Demonstration of d=2, p=2

(xz+1)2=x12z12+x22z22+2x1z1x2z2+2x1z1+2x2z2+1=[x12x222x1x22x12x21][z12z222z1z22z12z21]=Φ(x)Φ(z)\begin{split} &(x^{\top}z+1)^2 \\ &= x_1^2z_1^2+x_2^2z_2^2+2x_1z_1x_2z_2+2x_1z_1+2x_2z_2+1 \\ &=\begin{bmatrix} x_1^2 &x_2^2 &\sqrt2x_1x_2 &\sqrt2x_1 &\sqrt2x_2 &1 \end{bmatrix} \begin{bmatrix} z_1^2 &z_2^2 &\sqrt2z_1z_2 &\sqrt2z_1 &\sqrt2z_2 &1 \end{bmatrix}^{\top} \\ &=\Phi(x)^{\top}\Phi(z) \end{split}

Saves tons of computing power since we only need to compute the inner product of the original data point instead of adding the polynomial features.

Gaussian Kernel (RBF - Radial Basis Function Kernel)

Φ(x)=exp(x22σ2)[1xσ1!x2σ22!x3σ33!](for d=1)\Phi(x)=\exp(-\frac{x^2}{2\sigma^2})\begin{bmatrix} 1 &\frac{x}{\sigma\sqrt{1!}} &\frac{x^2}{\sigma^2\sqrt{2!}} &\frac{x^3}{\sigma^3\sqrt{3!}} &\dots \end{bmatrix}^{\top} \quad \text{(for $d=1$)}

An infinite vector, and Φ(x)Φ(z)\Phi(x) \cdot \Phi(z) is a series that converges to k(x,z)k(x,z)

RBF Kernel Function:

k(x,z)=exp(xz22σ2)k(x,z)=\exp(-\frac{||x-z||^2}{2\sigma^2})
🔥
Very popular in practice 1. gives very smooth h (infinitely differentiable) 2. Behaviors like k-nearest neighbours, but smoother 3. Osciallates less than polynomials (depending on σ\sigma) 4. k(x,z)k(x,z) is more like a similarity measure 5. Sample points “vote” for value at z, closer points get weightier vote.
Choose σ\sigma by cross-validation. Larger σ\sigma meas wider Gaussians and smoother h, but more bias and less variance

Transformation of Space in finding weights

In lots of ML algorithms we want to find a decision function that usually took form of a geometric shape like hyperplane / isocontour / isosurface in the original data space (x-space). However, in doing the optimization we usually want to find a vector(point) of weight w\vec{w} in the w-spaace that represents the geometric shape in the x-space. Therefore we have implicitly performed a transformation of space from x-space to w-space

Transformation to w-space in perceptron algorithm

For the perceptron algorithm, the transformation is performed below:

x-spacew-space
decision boundry{z:wz=0}\{z : w \cdot z = 0\} All points that has inner product of 0 with www
dataxx{z:xz=0}\{z : x \cdot z = 0\} All points that has inner product of 0 with xx

Since we want to force inequality xw0x \cdot w \ge 0 (for yi=1y_i = 1, then

  1. in x-space, xx points for yi=1y_i = 1 should be on the same side of {z:wz=0}\{z : w \cdot z = 0\} as ww
  1. in w-space, ww should be on the same side of {z:wz=0}\{ z : w \cdot z = 0\} as xx that have yi=1y_i=1

(Batch) Gradient Descent (GD / BGD)

🔥
Walk downhill by using the “steepest” direction(the gradient) Complexity: O(nd)O(nd)

With a risk function R(w)R(w), calculate:

R(w)=[Rw1Rw2Rwd]\nabla R(w) = \begin{bmatrix} \frac{\partial R}{\partial w_1} \\ \frac{\partial R}{\partial w_2} \\ \vdots \\ \frac{\partial R}{\partial w_d} \end{bmatrix}

Then for each propagating step, run the following:

wwϵR(w)w \leftarrow w - \epsilon\nabla R(w)

Where ϵ\epsilon is the learning rate.

Stochastic Gradient Descent (SGD)

🔥
Instead of GD on the entire batch, do GD on single training data Complexity: O(d)O(d)

With a cost function L(w,Xi)L(w,X_i), calculate:

L(w)=[Lw1Lw2Lwd]\nabla L(w) = \begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_d} \end{bmatrix}

Then for each propagating step, run the following:

wwϵL(w)w \leftarrow w - \epsilon\nabla L(w)

Where ϵ\epsilon is the learning rate.

🤔
Note: Not necessarily the “steepest” descent direction when viewing the effect on the whole batch(does not guarentee to work for every problem that GD works for), techniques like RMSProp / Momentum might help resolve this issue.

Quadratic Form

A way to visualize symmetric matrices. Shows how applying the matrix affects the length of a vector.

Matrix A streching along the direction of eigenvector (1,1) with eigenvalue 2 and shrinking along the direction of eigenvector (1,-1) with eigenvalue -1/2.

If a symmetric matrix is diagonal:

  1. eigenvectors are coordinate axes
  1. ellipsoids are axis-aligned

A symmetric matrix MM is:

  1. Positive Definite (PD)
    1. if w0,wMw>0\forall w \ne 0, w^{\top}Mw > 0
    1. or all eigenvalues positive
  1. Positive Semi-Definite (PSD)
    1. if w,wMw0\forall w, w^{\top}Mw \ge 0
    1. or all eigenvalues non-negative
  1. Indefinite
    1. if both positive eigenvalue and negative eigenvalue exists
  1. Invertible(non-singular)
    1. if has no eigenvalue of 0
⚠️
Note that pos definite’s quadratic form only has one point of value 0, pos semidefinite has (infinitely) many points of value 0.

By eigendecomposition, we see that any real symmetric matrix has:

A=VΛVA=V\Lambda V^{\top}

Where Λ\Lambda is a diagonal matrix containing the eigenvalues of AA and VV is a orthonormal matrix which each column is the eigenvector of AA.

Note that A1x=1 or xA2x=1||A^{-1}x|| = 1 \text{ or } x^{\top}A^{-2}x = 1 will have ellipsoid with axes and radii specified by the eigenvector and eigenvalues.

Neton’s Method

Iterative optimization method for smooth function J(w)J(w)

🔥
You are at point vv, Approximate J(w)J(w) near vv by quadratic function and jump to the quadratic function’s critical point.

Taylor’s series of cost function around point v:

J(w)wJ(w)w=v+2J(w)w=v(wv)+O(wv2)\nabla J(w) \approx \nabla_wJ(w)\bigg\rvert_{w=v}+\nabla^2 J(w) \bigg\rvert_{w=v}(w-v)+O(||w-v||^2)

Where 2J(w)w=v\nabla^2 J(w) \bigg\rvert_{w=v} is the Hessian of JJ evaluated at point vv

We will have further exapsnsions like 3J(w)w=v\nabla^3 J(w)\bigg\rvert_{w=v}but those are simplified to O(wv2)O(||w-v||^2) since we are only approximating with quadratic function.

By setting J(w)=0\nabla J(w) = 0, we find

w=v(2J(w)w=v)1J(w)w=vw = v-(\nabla^2 J(w)\bigg\rvert_{w=v})^{-1}\nabla J(w)\bigg\rvert_{w=v}
Note that we don’t want to compute a matrix inverse directly. It is faster to solve a linear system of equations, by Cholesky factorization or the conjugate gradient method.

We set e=(2J(v))1J(v)e=-(\nabla^2 J(v))^{-1}\nabla J(v), then we just need to solve

(2J(v))e=J(v)(\nabla^2 J(v))e=-\nabla J(v)

Therefore, in conclusion, the whole process of Newton’s method:

  1. Pick starting point ww
  1. Repeat until Converge
    1. e solution to linear system (2J(w))e=J(w)e \leftarrow \text{ solution to linear system } (\nabla^2 J(w))e = -\nabla J(w)
    1. ww+ew \leftarrow w + e

Bias-Variance Decomposition

BiasError due to iability of hypothesis hh to fit ground truth, gg, perfectly. e.g. fitting quadratic gg with linear hh
VarianceError due to fitting random noise in data

For each model, we have

XiD,ϵiD,yi=g(Xi)+ϵiX_i \sim D, \epsilon_i \sim D', y_i=g(X_i)+\epsilon_i

Note that DD’ has mean 00.

We want fit hypothesis hh to X,yX, y, where hh is a random variable (weights)

If we take any arbitraty point(not necessarily a sample point, just any point from the real ground-truth distribution) zRdz \in \mathbb{R}^d and γ=g(z)+ϵ,ϵD\gamma = g(z)+\epsilon, \epsilon \in D’

Then the risk function when loss = sqaured error:

R(h)=E[L(h(z),γ)](the expectation is taken over all possible values of weights)=E[(h(z)γ)2]=E[h(z)2]+E[γ2]2E[γh(z)]=Var(h(z))+E[h(z)]2+Var(γ)+E[γ]22E[γ]E[h(z)]=(E[h(z)]E[γ])2+Var(h(z))+Var(γ)=(E[h(z)]g(z))2+Var(h(z))+Var(ϵ)\begin{split} R(h)&=\mathbb{E}[L(h(z),\gamma)] \quad \text{(the expectation is taken over all possible values of weights)} \\ &=\mathbb{E}[(h(z)-\gamma)^2] \\ &=\mathbb{E}[h(z)^2] + \mathbb{E}[\gamma^2] - 2\mathbb{E}[\gamma h(z)] \\ &=Var(h(z))+\mathbb{E}[h(z)]^2+Var(\gamma)+\mathbb{E}[\gamma]^2 - 2\mathbb{E}[\gamma]\mathbb{E}[h(z)] \\ &=(\mathbb{E}[h(z)]-\mathbb{E}[\gamma])^2+Var(h(z))+Var(\gamma) \\ &=(\mathbb{E}[h(z)]-g(z))^2+Var(h(z))+Var(\epsilon) \end{split}

The first term is bias2bias^2, second term is varvar and third term is irreducible error\text{irreducible error} of the method.

γ\gamma and h(z)h(z) are independent since h(z)h(z) only depends on training data and γ\gamma only on label of z.

🔥
We cannot precisely measure bias or variance of real-world data
  1. For many distributions, variance0 as nvariance \rightarrow 0 \text{ as } n \rightarrow \infin
  1. If hh can fit gg exactly, then bias0 as nbias \rightarrow 0 \text{ as } n \rightarrow \infin
  1. Adding a good feature reduces bias; adding a bad feature rarely increases it.
  1. Adding a feature usually increases variance (added dimension).
  1. Noise in test set only affects Var(ϵ)Var(\epsilon)
  1. Noise in training set only affects bias&Var(h)bias \& Var(h)
  1. We can test learning algorithms by choosing gg and making synthetic data

Feature Selection

We want to identify poorly predictive features and ignore them so that we can:

  1. enjoy better inference speed
  1. less variance

Best Subset Selection

🔥
Choose 2d12^d-1 nonempty subsets of features. Choose best classifier by cross-validation. Very slow.

Forward Stepwise Selection

  1. Start with null model (0 features)
  1. repeatedly add best feature until validation errors start increasing
    1. At each outer iteration, inner loop tries every feature and chooses the best by validation
  1. Requires training O(d2)O(d^2) modules instead of O(2d)O(2^d)
  1. But
    1. won’t find the best 2-feature model if neither one of those features yields the best 1-feature model.

Backward stepwise selection

  1. Start with all d features
  1. repeatedly remove feature whose removal gives best reduction in validation error
  1. Also trains O(d2)O(d^2) models
  1. Additional heuristic: only try to remove features with small weights
    1. Small relative to the variance of least-square regression
    1. Variance of least sqaure regression is proportional to σ2(XX)1\sigma^2 (X^{\top}X)^{-1}
    1. z-score of weight wiw_i is zi=wiσviz_i = \frac{w_i}{\sigma\sqrt{v_i}} where viv_i is the ith diagonal entry of (XX)1(X^{\top}X)^{-1}
    1. Small z-score hint that the “true” wiw_i could be zero.

Ensemble Learning 集成学习

🔥
Methods that have low bias, but high variance like decision trees can use this technique to reduce variance

Idea: take an average answer of a bunch of decision trees

Bagging

Same learning algoirthm on many random subsamples of one training set.

Works well on most learning algorithms, maybe not k-nearest neighbours?

🔥
Given n-point training sample, generate random subsample of size n’ by sampling with replacement.

If n=nn=n' then around 63.2% are chosen.

Different Learning Algorithms

Different hypothesis can help reduce variance

Random Forests

Discussed in the “Algorithms” section

Adaboost

Similar to Random Forests, but different and more powerful.

Discussed in the “Algorithms” section

Kernel Trick

Kernel Trick is developed from the fact that in many algorithms,

  1. weights can be written as a linear combination of sample points
    1. Therefore we can write w=i=1naiXi=Xa,aRnw = \sum_{i=1}^n a_i X_i = X^{\top} a, a \in \mathbb{R}^n
  1. we can use inner products of Φ(x)\Phi(x) only instead of directly computing Φ(x)\Phi(x).
    1. We will define the kernel function k(x,z)=Φ(x)Φ(z)k(x,z) = \Phi(x) \cdot \Phi(z)
    1. Note that we can use different kernel function can be substituted to use different kernels.
    1. The real magic happens when we can compute k(x,z)k(x,z) really quickly.

Kernel Ridge Regression

Remember in Ridge Regression,

(XX+λI)w=Xyw=(XX+λI)1Xy(X^{\top}X+\lambda I')w=X^{\top}y \\ w = (X^{\top}X+\lambda I')^{-1}X^{\top}y

To dualize ridge regression, we need the weights to be a linear combination of the sample points, but that would not happen unless we penalize the bias term wd+1=αw_{d+1}=\alpha.

Therefore we want to center XX and yy so that the “expected” value of bias is zero

XiXiμX,yiyiμy,Xi,d+1=1X_i \leftarrow X_i - \mu_X, y_i \leftarrow y_i - \mu_y, X_{i,d+1}=1
🔥
The actual bias won’t usually be exactly zero, but it will often be close enough that we won’t do too much harm by penalizing it.

Now we have

(XX+λI)w=Xy(X^{\top}X+\lambda I)w = X^{\top}y

Suppose aa is the solution to the following:

(XX+λI)a=y(XX^{\top}+\lambda I)a=y

Then we see that

Xy=XXXa+XλIa=(XX+λI)XaX^{\top}y=X^{\top}XX^{\top}a+X^{\top}\lambda I a=(X^{\top}X+\lambda I)X^{\top}a

Since the First term(LHS) the above equation matches the RHS of the original ridge regression, and that the term XaX^{\top}a of the RHS of the above equation matches the ww term in the original ridge regression solution

🔥
we’ve proved that if we set ww to XaX^{\top}a then we have a solution of weight ww that is a linear combination of sample point, where aa is the solution to our assumption.

Therefore we want to find aa, the dual solution that minimizes the dual form of ridge regression (obtained by substituting in w=Xaw=X^{\top}a)

XXay2+λXa2||XX^{\top}a-y||^2+\lambda||X^{\top}a||^2

Therefore the entire algorithm:

We will define K=XXK=XX^{\top} where Kij=k(Xi,Xj)K_{ij}=k(X_i,X_j)

KK is singular if n>d+1n > d+1 (and sometimes even its not) ⇒ we have to choose a positive λ\lambda.

Training:

Solve (XX+λI)a=y for a\text{Solve }(XX^{\top} + \lambda I)a = y \text{ for a}

Optimized:

i,jKijk(Xi,Xj)O(n2d)Solve (K+λI)a=y for aO(n3)\begin{split} \forall i,j \quad K_{ij} \leftarrow k(X_i,X_j) &\qquad\Longleftarrow O(n^2d) \\ \text{Solve } (K+\lambda I)a=y \text{ for } a &\qquad\Longleftarrow O(n^3) \end{split}

Testing:

h(z)=wz=aXz=i=1nai(Xiz)=inaik(Xi,z)O(nd)\begin{split} h(z) &= w^{\top}z \\ &= a^{\top}Xz \\ &=\sum_{i=1}^n a_i (X_i^{\top}z) \\ &=\sum_{i}^n a_i k(X_i,z) \quad \Longleftarrow O(nd) \end{split}

Comparison between dual and primal

DualSolve n×nn \times n linear system, O(n3+n2d)O(n^3+n^2d)
PrimalSolve d×dd \times d linear system, O(d3+d2n)O(d^3+d^2n)

Therefore we prefer dual when d>nd > n (when we add a lot of polynomial / rbf ... features)

Kernel Perceptrons

Featurized(Primal) Perceptron Algorithm:

Training

wy1Φ(X1)while some yiΦ(Xi)x<0ww+ϵyiΦ(Xi)\begin{split} &w \leftarrow y_1 \Phi(X_1) \\ &\text{while some } y_i\Phi(X_i)\cdot x < 0 \\ &\quad w \leftarrow w + \epsilon y_i \Phi(X_i) \\ \end{split}

Testing

h(z)=wΦ(z)h(z)=w\cdot \Phi(z)

We will Dualize the problem with w=Φ(X)aw = \Phi(X)^{\top}a

Same as before, Let KK be Φ(X)Φ(X)\Phi(X)\Phi(X)^{\top}, so Ki,j=k(Xi,Xj)K_{i,j} = k(X_i,X_j)

🔥
Now Φ(Xi)w=(Φ(X)w)i=(Φ(X)Φ(X)a)i=(Ka)i\Phi(X_i) \cdot w = (\Phi(X)w)_i = (\Phi(X)\Phi(X)^{\top}a)_i=(Ka)_i

Training:

a[y100]start point is arbitrary, but cannot be zeroi,j,Kijk(Xi,Xj)O(n2d)while some yi(Ka)i<0aiai+ϵyiO(1) time, update Ka in O(n) time\begin{split} &a \leftarrow \begin{bmatrix} y_1 &0 &\dots &0 \end{bmatrix} \quad \text{start point is arbitrary, but cannot be zero} \\ &\forall i, j, K_{ij} \leftarrow k(X_i, X_j) \quad\Longleftarrow O(n^2d) \\ &\text{while some } y_i(Ka)_i < 0 \\ &\quad a_i \leftarrow a_i + \epsilon y_i \quad \Longleftarrow O(1) \text{ time, update } Ka \text{ in } O(n) \text{ time} \\ \end{split}

Testing:

Same as kernelized ridge regression,

h(z)=wz=aXz=i=1nai(Xiz)=inaik(Xi,z)O(nd)\begin{split} h(z) &= w^{\top}z \\ &= a^{\top}Xz \\ &=\sum_{i=1}^n a_i (X_i^{\top}z) \\ &=\sum_{i}^n a_i k(X_i,z) \quad \Longleftarrow O(nd) \end{split}

Kernel Logistic Regression

Training:

Our starting point would be zero.

SGD:

aiai+ϵ(yis((Ka)i))a_i \leftarrow a_i + \epsilon(y_i-s((Ka)_i))

GD:

aa+ϵ(ys(Ka))a \leftarrow a + \epsilon(y-s(Ka))
🔥
For SGD, update KaKa in O(n)O(n) time each iteration instead of computing KaKa entirely each time

Testing:

h(z)s(i=1naik(Xi,z))h(z)\leftarrow s(\sum_{i=1}^n a_i k(X_i,z))

High-Dimensional Data

🔥
Intuition about high-dimensional data is very important

Distribution of random points

Suppose we have a random point pN(0,I)Rdp \sim N(0,I) \in \mathbb{R}^d

The vast majority of the random points are at approximately the same distance from the mean, laying in a thin shell. (Think about chi-sqaure distribution)

Angles between random points

cosθ=pqpq=p1pcos \theta = \frac{p \cdot q}{||p|| ||q||}=\frac{p_1}{||p||}
E(cosθ)=0;Var(cosθ)1d\mathbb{E}(cos \theta)=0; Var(cos\theta)\approx \frac{1}{\sqrt{d}}

Therefore as d becomes larger, θ\theta becomes more and more likely to be very close to 90 deg!

Learning Theory

🔥
If we want to generalize (to things we haven’t learned about), we must constrain what hypothesis we allow our learner to consider.

Range Space(Set System) AA: a pair (P,H)(P,H) where

PP is the set of all possible test/training points

HH is the set of hypotheses(ranges/classifiers), each hypothesis states which subset of P (hPh \sube P) are in class CC

Examples of classifiers:

  1. Power set classifier: PP is a set of kk numbers, HH is the power set of P, containing all 2k2^k subsets of PP
    1. This classifier cannot generalize at all because if given any subset of PP as training data, it can fit a lot of classifiers and we wouldn’t know which classfier would be good generalization of the other data we haven’t seen unless we see the other data.
  1. Linear classifier: P=RdP = \mathbb{R}^d, HH is the set of all halfspaces: {w,α,list of {x:wxα}}\{\forall w, \alpha, \text{list of }\{x: w \cdot x \ge -\alpha\}\}

A great question to ask about a certain class of hypotheses would be:

🔥
How well the training error predicts the test error?

Let all training points and test points drawn independently from probability distribution DD defined on domain PP, then

Suppose our hypothesis is hHh \in H,

the risk / generalization error R(h)R(h) of hh, is the probability that hh misclassifies a random point xx drawn from DD ⇒ average test error

the empirical risk / training error R^(h)\hat{R}(h) is the percentage of training data points misclassified by hh

Hoeffding’s inequality tells us probability of bad estimate:

“How likely that a number drawn from a binomial distribution will be far from its mean”

P(R^(h)R(h)>ϵ)2e2ϵ2n\mathbb{P}(|\hat{R}(h)-R(h)|>\epsilon) \le 2e^{-2\epsilon^2n}

So...we want to choose h^H\hat{h} \in H that minimizes R^(h^)\hat{R}(\hat{h}), a technique called empirical risk minimization

however, its computationally infeasible to pick the best hypothesis ⇒ SVM find a linear classifier with zero training error when the training data is linearly separable, but when it isn’t SVM try to find a linear classifier with low training error, not minimum.

🔥
Its okay to have infinitely many hypotheses, but its not okay to have too many dichotomies

Dichotomies

A dichotomy of XX is XhX \cap h, where hHh \in H

It picks out the training points that h predicts are in class C

Up to O(2n)O(2^n) dichotomies. The more dichotomies, the more likely it is that one of them will get lucky and have misleading low empirical risk

Given Π\Pi dichotomies,

P(at least one dichotomy has R^R>ϵ)δwhere δ=2Πe2ϵ2n\mathbb{P}(\text{at least one dichotomy has } |\hat{R}-R|>\epsilon) \le \delta \\ \text{where } \delta = 2\Pi e^{-2\epsilon^2n}
fixing a value of δ and solve for ϵhH,R^Rϵ=12nln(2Πδ)\text{fixing a value of $\delta$ and solve for $\epsilon$, } \forall h \in H, \\ |\hat{R}-R|\le \epsilon=\sqrt{\frac{1}{2n}\ln(\frac{2\Pi}{\delta})}

Therefore, the smaller we make Π\Pi and the bigger we make nn, the more accurate we can predict the risk based on the empirical risk

Suppose we chose h^\hat{h} as our hypothesis, then with probability 1δ\ge 1 - \delta,

R(h^)R^(h^)+ϵR^(h)+ϵR(h)+2ϵR(\hat{h})\le \hat{R}(\hat{h})+\epsilon\le \hat{R}(h^*)+\epsilon\le R(h^*)+2\epsilon

Where

ϵ=12nln(2Πδ)\epsilon = \sqrt{\frac{1}{2n}\ln(\frac{2\Pi}{\delta})}

and hh^* being the optimal hypothesis in the class HH

Therefore, with enough training data and a limit on the number of dichotomies, empirical risk minimization usually chooses a classifier close to the best one in the hypothesis class.

After choosing δ\delta and ϵ\epsilon,

the sample complexity is the number of training points needed to achieve this ϵ\epsilon with probability 1δ\ge 1-\delta:

n12ϵ2ln(2Πδ)n \ge \frac{1}{2\epsilon^2}\ln(\frac{2\Pi}{\delta})

Shatter Function

number of dichotomies: ΠH(X)={Xh:hH}[1,2n] where n=X\Pi_H(X) = |\{X \cap h: h \in H\}| \quad \in [1,2^{n}] \text{ where } n = |X|

shatter function: ΠH(n)=maxX=n,XPΠH(X)\Pi_H(n) = \max_{|X|=n, X \sube P} \Pi_H(X) ⇒ max number of dichotomies of any training set of size n

Its called a shatter function because HH shatters aset XX of n points if ΠH(X)=2n\Pi_H(X)=2^n.

e.g. Linear classifiers in plane, HH = set of all halfplanes, ΠH(3)=8\Pi_H(3)=8

Fact: for all range spaces, either ΠH(n)\Pi_H(n) is O(np)O(n^p) or O(2n)O(2^n)

Cover’s theorem

linear classifiers in Rd\mathbb{R}^d allow up to ΠH(n)=2i=0d(n1i)\Pi_H(n)=2\sum_{i=0}^d {n-1 \choose i} dichotomies of n points

For nd+1n \le d+1, ΠH(n)=2n\Pi_H(n)=2^n

For nd+1n \ge d+1, ΠH(n)2(e(n1)d)dpolynomial in n with exponent d\Pi_H(n) \le 2{e(n-1) \choose d}^d \quad \Longleftarrow \text{polynomial in n with exponent d}

🔥
Linear Classifiers need only nO(d)n \in O(d) training points for the training error to predict the test error

VC Dimension (Vapnik-Chervonenkis Dimension)

VC(H)=max{n:ΠH(n)=2n}can be VC(H)=\max\{n: \Pi_H(n) = 2^n\} \quad \Longleftarrow \text{can be } \infin

VC dimension is the bound for polynomial shatter function

Theorem:

ΠH(n)i=0VC(H)(ni)\Pi_H(n) \le \sum_{i=0}^{VC(H)}{n \choose i}

Therefore,

ΠH(n)(e×nVC(H))VC(H)(nVC(H))\Pi_H(n) \le (\frac{e \times n}{VC(H)})^{VC(H)} \quad (n \ge VC(H))

Also: O(VC(H))O(VC(H)) training points suffice for accuracy (hidden constant is big though)

Algorithms

Perceptron Algorithm

🤔
Augment data to have extra feature with value fixed to 1 so that one weight can offset the seperating hyperplane to offset from the origin.

Label

yi={1 if Xiclass C1 otherwisey_i = \begin{cases} 1 \text{ if } X_i \in \text{class C} \\ -1 \text{ otherwise} \end{cases}

Goal:

Find weights ww such that

Xiw0 if yi=1,Xiw0 if yi=1X_i \cdot w \ge 0 \text{ if } y_i = 1, \\ X_i \cdot w \le 0 \text{ if } y_i = -1

Loss Function:

L(y^i,yi)={0if yiy^i0yiy^iotherwiseL(\hat{y}_i,y_i)=\begin{cases} 0 &\text{if } y_i\hat{y}_i \ge 0 \\ -y_i\hat{y}_i &\text{otherwise} \end{cases}

Cost/Risk Function

R(w)=1ni=1nL(Xiw,yi)=1ni:yiXiw<0yiXiw\begin{split} R(w)&=\frac1n \sum_{i=1}^nL(X_i \cdot w, y_i) \\ &=\frac1n \sum_{i: y_iX_i \cdot w < 0} -y_iX_i \cdot w \end{split}

If w classifies all X1,,XnX_1, \dots, X_n correctly, then R(w)=0R(w) = 0.

Optimizer:

  1. Gradient Descent, SGD.
    1. Both guarenteed to work if linear seperable
    1. Guarenteed to classfy all data in at most O(r2/γ2)O(r^2/\gamma^2) iteratinos, where r=maxXir=\max ||X_i|| is the radius of the data and γ\gamma is the maximum margin (defined in Max-Margin Classifier).

Hard-Margin Support Vector Machine (Hard-Margin SVM)

The problem with perceptron algorihtm is that we can fit infinitely many hyperplanes to the data since it does not constrain the weight vector, just asking it to be correct on classifying data (it can fit training data 100% but it may not generalize well to validation or test set). This may lead to increased variane.

🔥
One idea we find that generalizes well with validation / test set is to find a unique boundry that maximizes the margin between the two classes in the training set.

We will enforce the constraints:

yi(wXi+α)1 for i[1,n]y_i(w \cdot X_i + \alpha) \ge 1 \text{ for } i \in [1,n]

Note that the RHS is a 1, and it is makes impossible for the weight vector ww to get set to zero.

🔥
Note: Signed distance from hyperplane to XiX_i is wwXi+αw\frac{w}{||w||}\cdot X_i + \frac{\alpha}{||w||} Therefore the margin is mini1wwXi+α1w\min_i \frac1{||w||} |w \cdot X_i + \alpha| \ge \frac1{||w||} In order to maximize the margin, we can instead minimize w||w||. (At the optimal solution, the margin is exactly 1w\frac1{||w||} because at least one constraint holds with equality)

Therefore the optimization problem:

Find w and α that minimize w2s.t. i[1,n],yi(Xix+α)1\text{Find } w \text{ and } \alpha \text{ that minimize } ||w||^2 \\ \text{s.t. } \forall i \in [1,n], y_i(X_i \cdot x + \alpha) \ge 1

Its a quadratic program in d+1d+1 dimensions and nn constraints, and has a unique solution (if points are linearly seperable).

The reason that we don’t want to use w||w|| is because it is not smooth at w=0w=0, whereas w2||w||^2 is smooth everywhere, making it easier to optimize.
Left: SVM in 3D w-space(w1,w2,alpha), Right: 2D Cross-section at w1=1/17. The SVM is trying to find a solution that lays in the pocket that is as close to the origin as possible.

Soft-Margin SVM

🔥
Hard-margin SVM fail if data not linearly seperable & sensitive to outliers.
Example where one outlier moves the entire boundry of hard-margin SVM a lot

Idea: Allow some points to violate the margin with slack variables ξi0\xi_i \ge 0.

yi(Xiw+α)1ξiy_i(X_i \cdot w + \alpha) \ge 1 - \xi_i

Now we re-define the margin to be 1/w1/||w||, since the margin is no longer the distance from the decision boundary to the nearest sample point.

🔥
We also want to add a term to the loss function that penalizes the abuse of slack variable.

Therefore the optimization prblem:

Find w,α,ξi that minimize w2+Ci=1nξis.t. i,yi(Xiw+α)1ξii,ξi0\text{Find } w, \alpha, \xi_i \text{ that minimize } ||w||^2+C\sum_{i=1}^n \xi_i \\ \begin{split} \text{s.t. } &\forall i, y_i(X_i \cdot w + \alpha) \ge 1 - \xi_i \\ &\forall i, \xi_i \ge 0 \end{split}

C>0C>0 is a scalar regularization hyperparameter that trades off

small Cbig C
desiremaximize margin 1w\frac1{||w||}keep most slack variables zero or small
dangerunderfittingoverfitting
outliersless sensitivevery sensitive
boundarymore “flat”more sinuous

Gaussian Discriminant Analysis(GDA)

⚠️
We assume that each class comes from normal distribution (Gaussian) For each class, estimate mean μc\mu_c, and prior πc=P(Y=C)\pi_c = P(Y=C) Notation of σ\sigma might be different between QDA and LDA.

Isotropic Gaussians

XN(μ,σ2):f(x)=1(2πσ)dexp(xμ22σ2)X \sim N(\mu,\sigma^2): f(x)=\frac1{(\sqrt{2\pi}\sigma)^d}\exp(-\frac{||x-\mu||^2}{2\sigma^2})

Anisotropic Gaussians(Different variances along different directions)

XN(μ,Σ):f(x)=1(2π)dΣexp((xμ)Σ1(xμ)2)X \sim N(\mu, \Sigma): f(x)=\frac{1}{\sqrt{(2\pi)^d|\Sigma|}}\exp(-\frac{(x-\mu)^{\top}\Sigma^{-1}(x-\mu)}{2})

Where Σ\Sigma is d×dd \times d Semi Positive Definite Covariance Matrix.

Σ=Var(X)=[Var(X1)Cov(X1,X2)Cov(X1,Xd)Cov(X2,X1)Var(X2)Cov(X2,Xd)Cov(Xd,X1)Cov(Xd,X2)Var(Xd)]\Sigma=Var(X)=\begin{bmatrix} Var(X_1) &Cov(X_1,X_2) &\dots &Cov(X_1,X_d) \\ Cov(X_2,X_1) &Var(X_2) &\dots &Cov(X_2,X_d) \\ \vdots &\vdots &\vdots &\vdots \\ Cov(X_d,X_1) &Cov(X_d,X_2) &\dots &Var(X_d) \end{bmatrix}

Semi Positive Definite: w,wΣw0\forall \vec{w}, \vec{w}^{\top}\Sigma \vec{w} \ge 0

And Σ1\Sigma^{-1} is d×dd \times d SPD precision matrix.

Let q(x)=(xμ)Σ1(xμ)q(x) = (x-\mu)^{\top}\Sigma^{-1}(x-\mu), therefore f(x)=n(q(x))f(x) = n(q(x)) where n()n(\cdot) is a sample, monotonic, convex function that is an exponential of the negative of its argument. n()n(\cdot) does not affect the isosurfaces.

q(x)q(x) is the squared distance from Σ1/2x\Sigma^{-1/2}x to Σ1/2μ\Sigma^{-1/2}\mu.

d(x,μ)=Σ1/2xΣ1/2μ=(xμ)Σ1(xμ)=q(x)d(x,\mu)=||\Sigma^{-1/2}x-\Sigma^{-1/2}\mu||=\sqrt{(x-\mu)^{\top}\Sigma^{-1}(x-\mu)}=\sqrt{q(x)}

Important fact: if you understands isosurface of quadratic form then you can understand isosurface of Gaussian, only difference is that the max isovalue is at the mean of the Gaussian.

Suppose only two classes C, D Bayes decision rule r(x)r^*(x) predicts class C that maximizes P(Y=CX=x)=P(X=xY=C)πCP(Y=C|X=x)=P(X=x|Y=C)\pi_C

Since ln(ω)\ln(\omega) is monotonically increasing for ω>0\omega > 0, so

arg maxcP(Y=CX=x)=arg maxcQc(x)=arg maxcln((2π)dfc(x)πc)=arg maxcxμc22σc2dln(σc)+ln(πc)if not anisotropic=arg maxc(xμ)Σ1(xμ)2dln(Σ)+ln(πc)if anisotropic\begin{split} &\argmax_c P(Y=C|X=x)\\ = &\argmax_c Q_c(x) \\ = &\argmax_c ln((\sqrt{2\pi})^d f_c(x) \pi_c) \\ = &\argmax_c -\frac{||x-\mu_c||^2}{2\sigma_c^2}-d\ln(\sigma_c)+\ln(\pi_c) \quad \text{if not anisotropic} \\ = &\argmax_c -\frac{(x-\mu)^{\top}\Sigma^{-1}(x-\mu)}{2}-d\ln(|\Sigma|)+\ln(\pi_c) \quad \text{if anisotropic} \end{split}

MLE for Isotropic Gaussian

For isotropic Gaussian, use the following MLE:

L(μ,σX1,,Xn)=P(X1,,Xnμ,σ)=f(X1)f(X2)f(Xn)L(\mu,\sigma|X_1,\dots,X_n) = P(X_1,\dots,X_n|\mu,\sigma)=f(X_1)f(X_2)\cdots f(X_n)

“log likelihood”

l(μ,σX1,X2,,Xn)=lnf(X1)+lnf(X2)lnf(Xn)l(\mu,\sigma|X_1,X_2,\cdots,X_n)=\ln f(X_1)+\ln f(X_2) \cdots \ln f(X_n)

After computing the derivative, we see

μ^=1ni=1nXi\hat{\mu} = \frac{1}{n}\sum_{i=1}^n X_i
σ^2=1ndi=1nXiμ2\hat{\sigma}^2 = \frac{1}{nd}\sum_{i=1}^n||X_i-\mu||^2

We dont know μ\mu exactly, so substitue μ^\hat{\mu} for μ\mu to compute σ^2\hat{\sigma}^2

Quadratic Discriminant Analysis (QDA)

⚠️
Assume each class has different variance Σc\Sigma_c or σc2\sigma_c^2

compute conditional mean μ^c\hat{\mu}_c and conditional variance σ^c2\hat{\sigma}_c^2, Σc\Sigma_c of each class C seperately.

And estimate the priors π^c=ncn\hat{\pi}_c = \frac{n_c}{n} (number of points in class c divided by total number of sample points)

σ^2=1ncdi:yi=CXiμ2\hat{\sigma}^2 = \frac{1}{n_cd}\sum_{i:y_i=C}||X_i-\mu||^2
Σ^c=1nci:yi=C(Xiμ^c)(Xiμ^c)\hat{\Sigma}_c = \frac{1}{n_c} \sum_{i: y_i = C} (X_i-\hat{\mu}_c)(X_i-\hat{\mu}_c)^{\top}

Note: If we had zero eigenvalues in the variance matrix, then the standard QDA doesn’t work. We can fix it by adding a regularization term λI\lambda I on the matrix Λ\Lambda, which Σ^c=VΛV\hat{\Sigma}_c = V\Lambda V^{\top}, and use this Σ~c=V(Λ+λI)V\tilde{\Sigma}_c = V(\Lambda + \lambda I) V^{\top} term as our new covariance matrix. We want to keep the term λ\lambda relatively small to not affect the covariance matrix too much.

Note: Posterior P(Y=CX=x)=s(Qc(x)dcQd(x))P(Y=C|X=x) = s(Q_c(x) - \sum_{d \ne c} Q_d(x)) where s()s(\cdot) is the sigmoid function.

Linear Discriminant Analysis (LDA)

⚠️
Assume each class has same variance Σ\Sigma or σ2\sigma^2

compute conditional mean μ^c\hat{\mu}_c of each class C seperately.

Assuming same variance across classes help eliminates the quadratic term in the optimization problem.

σ^2=1ndci:yi=CXiμ^c2\hat{\sigma}^2 = \frac{1}{nd} \sum_c \sum_{i: y_i=C} ||X_i - \hat{\mu}_c||^2
Σ^=1nci:yi=C(Xiμ^c)(Xiμ^c)\hat{\Sigma} = \frac{1}{n} \sum_c \sum_{i:y_i = C} (X_i-\hat{\mu}_c)(X_i-\hat{\mu}_c)^{\top}

Regression

Need:

  1. Regression function h(xp)h(x|p), pp is parameter
  1. Cost function to optimize

Regression Functions.

(1) Lienar: h(xw,α)=wx+αh(x|w,\alpha) = w \cdot x + \alpha

(2) polynomial [equivalent to lineear with added polynomial features]

(3) logistic: h(xw,α)=s(wx+α)h(x|w,\alpha) = s(w \cdot x + \alpha) ⇒ no hidden layer neural net with only one output node and final activation sigmoid. Only seperates linearly seperable classes.

Loss Functions:

(A) Sqaured Error L(y^,y)=(y^y)2L(\hat{y},y) = (\hat{y}-y)^2

(B) Absolute Error L(y^,y)=y^yL(\hat{y},y)=|\hat{y}-y|

(C) Logistic Loss (Cross-entropy loss) L(y^,y)=yln(y^)(1y)ln(1y^)L(\hat{y},y)=-y\ln(\hat{y})-(1-y)\ln(1-\hat{y})

Cost Functions to Optimize:

(a) Mean Loss J(h)=1ni=1nL(h(Xi),yi)J(h) = \frac1n \sum_{i=1}^n L(h(X_i),y_i)

(b) Maximum Loss J(h)=maxi=1nL(h(Xi),yi)J(h) = \max_{i=1}^n L(h(X_i), y_i)

(c) Weighted Sum J(h)=i=1nωiL(h(Xi),yi)J(h) = \sum_{i=1}^n \omega_i L(h(X_i),y_i)

(d) L2 Penalized J(h)=J0(h)+λw2J(h) = J_0(h) + \lambda||w||^2 where J0(h)J_0(h) is one of (a), (b), or (c)

(e) L1 Penalized J(h)=J0(h)+λwl1J(h) = J_0(h) + \lambda||w||_{l1}

Least Squares Linear Regression

(1) Linear Function + (A) Squared Error + (a) Mean Loss

🔥
We will augment the data matrix with 1s so that the last term of weight wd+1w_{d+1} would be offset α\alpha , of the line from the origin.

By calculus,

XXw=XyX^{\top}Xw=X^{\top}y

However if XXX^{\top}X is singular, then the problem is underconstrained (infinitely many solutions for ww).

w=(XX)1Xyw = (X^{\top}X)^{-1}X^{\top}y

XXX^{\top}X is always PSD because all sample points lie on a common hyperplane. If XXX^{\top}X is singular, we can use the pseudoinverse of this matrix, to compute ww.

Logistic Regression

(3) Logistic Function + (C) Logistic Loss + (a) Mean Loss

Solved by Gradient Descent

🔥
Usually used for classification. The input yiy_i can be probabilities, but usually they’re all 0 or 1.
In LDA, the posterior probabilities are often modeled well by a logistic function. So why not just try to fit a logistic function directly to the data, skipping Gaussian modeling?
J=i=1nL(si,yi)=i=1nyilnsi+(1yi)ln(1si)J=\sum_{i=1}^n L(s_i, y_i)=-\sum_{i=1}^n y_i \ln s_i + (1-y_i)\ln(1-s_i)

Where si=s(Xiw)s_i = s(X_i \cdot w)

wJ=X(ys(Xw))\nabla_{w}J = - X^{\top} (y-s(Xw))
w2J(w)=XΩX,Ω=[s1(1s1)000s2(1s2)000sn(1sn)]\nabla^2_w J(w)=X^{\top}\Omega X, \Omega = \begin{bmatrix} s_1(1-s_1) & 0 &\cdots &0 \\ 0 &s_2(1-s_2) & \cdots &0 \\ \vdots &\vdots &\vdots &\vdots \\ 0 &0 &\dots &s_n(1-s_n) \end{bmatrix}

Converges within 1 iteration of Newton’s Method.

📎
Logistic Regression ALWAYS seperates linearly seperable points (as we scale w to have inifite length for points in class C, s(Xiw)1s(X_i \cdot w) \rightarrow 1 and for points not in class C, s(Xiw)0s(X_i \cdot w) \rightarrow 0 and J(w)0J(w) \rightarrow 0) A 2018 paper by Soudry, Hoffer, Nacson, Gunasekar, and Srebro shows that gradient descent applied to logistic regression eventually converges to maximum margin classifier
🔥
Compared to LDA, 1. LDA stable for well-separated classes. Logistic regression suprisingly unstable 2. LDA is elegend when 2 classes, logistic regression needs modify to softmax. 3. LDA slightly more accurate when classes are nearly normal 4. Logistic regression puts more emphasis on decision boundary / always seperates linearly seperable points. 5. Misclassified points far from the decision boundary have biggest effect in logistic regression but same weight across points in LDA. 6. Logistic regression more robust on some non-Gaussian distributions (distributions with large skew) 7. Logistic Regression naturally fits labels between 0 and 1.

Least-Squares Polynomial Regression

Replace each XiX_i with all terms of degree 0...p

Φ(Xi)=[Xi12Xi1Xi2Xi22Xi1Xi21]\Phi(X_i)=\begin{bmatrix} X_{i1}^2 &X_{i1}X_{i2} &X_{i2}^2 &X_{i1} &X_{i2} &1 \end{bmatrix}^{\top}

Not numerically stable...higher degree polynomials tend to oscilate and does not e

Developed from MLE for multivariate normal distributions, see lecture 12 for detail.

Weighted Least Squares Regression

(1) Linear Function + (A) Squared Loss + (c) Weighted Sum Cost

🔥
Some points might be more trusted than others, or there might be certain points that we want to fit particularly well.

Collect weight ωi\omega_i in n×nn \times n diagonal matrix Ω\Omega.

Find ww that minimizes (Xwy)Ω(Xwy)=i=1nωi(Xiwyi)2(Xw-y)^{\top}\Omega(Xw-y)=\sum_{i=1}^n \omega_i(X_i \cdot w - y_i)^2

Solve for w in normal equations:

XΩXw=XΩyX^{\top}\Omega X w = X^{\top}\Omega y

Ridge Regression

(1) Linear Function + (A) Squared Loss + (d - a) L2 Penalized Mean Loss

🔥
Adds a regularization(penalty) term for shrinkage: to encourage small w||w’||
  1. Guarentees positive definite normal equations
    1. Always unique solution
Left: Cost function many minima and the problem is ill-posed.

After setting J=0\nabla J = 0,

(XX+λI)w=Xy(X^{\top}X+\lambda I')w=X^{\top}y

Where II’ is identity matrix with last diagonal term zero (so we don’t penalize the bias term α\alpha)

🔥
Ideally, features should be “normalized” to have the same variance. Or use asymmetric penaly by replacing II’ with other diagonal matrix.

Bayesian Justification for Ridge Regression:

Suppose we have a prior on w:wN(0,σ2)w’: w' \sim N(0,\sigma^2)

then

f(wX,y)=f(yX,w)×priorf(w)f(yX)=L(w)f(w)f(yX)f(w|X,y)=\frac{f(y|X,w)\times prior f(w')}{f(y|X)}=\frac{L(w)f(w')}{f(y|X)}

Maximize log posterior:

log posterior=lnL(w)+lnf(w)const=constXwy2constw2const\begin{split} \text{log posterior}&=\ln L(w)+\ln f(w')-const\\ &=-\text{const}||Xw-y||^2-\text{const}||w'||^2-\text{const} \end{split}

So therefore minimize Xwy2+λw2||Xw-y||^2+\lambda||w’||^2

Lasso Regression

(1) Linear Function + (A) Squared Loss + (e - a) L1 Penalized Mean Loss

Similar to ridge regression, but has the advantage that it often naturally sets some of the weights to zero (not always zero, just due to the graphical property of L1 Loss the weights are more likley to rest on points of 0)

The isosurfaces of w1||w’||_1 are cross-polytopes (we write ww’ because we don’t want to penalize α\alpha)

The unit cross-polytope is the convex hull (凸包 - 刚好包住所有点的塑胶膜) of all the positive and negative unit coordinate vectors.

A typical isosurface of w1|w’|_1

Compare to ridge regression, the solution given by lasso is more likely to be sparse because the isosurfaces of the regularization term and the isosurfaces of squared distance is more likely to meet at the direction of the unit coordinate vectors.

Lasso can be reformulated as a quadratic problem, Subgradient descent and least-angle regression(LARS), forward stagewise algorithms can be used to solve for Lasso.

Decision Trees

Still widely used, fast, interpretable, invariant to scaling/translation, robust to irrelevant features ,and can achieve 100% training accuracy, but prune to overfitting. Random Forest and Adaboost is introduced to prevent some of the overfitting

  1. Cuts x-space into rectangular cells
  1. Works well with both categorical(split by subset) and quantitative(split by boundary) features
  1. Interpretable result
  1. Decision boundary can be arbitrarily complicated

Learning:

def GrowTree(S):
	if(training points in node have same class) or (node exceeds maximum depth):
		return most common class in node
	else:
		choose best splitting feature j and sppliting value (or subset) beta
		filter training points to the left split or right split
		return Node(
				j,
				beta,
				GrowTree(left training points),
				GrowTree(right training points)
		)
⚠️
We may want to stop early (max depth, stop when 95% points same class)... 1. Most of the node’s points have same class (deal with outliers) 2. Complete tree may overfit (cells edge too tiny)
📌
Instead of returning labels, we can also return posteriors of each class P(Y=Cv)P(Y=C|v). Its pretty reasonable to do if there are many points in each leaf.

How to select best split:

  1. Try all splits
  1. Choose split that maximizes information gain H(S)HafterH(S)-H_{after}

Where H(S)H(S) is the entropy of an index set SS ⇒ the average surprise

The suprise of YY being in class C is log2(pc)-log_2(p_c)

  1. event with probability gives us zero surprise
  1. event with probability gives us inifinite surprise

The entropy of an index set SS is the average surprise

H(S)=CpClog2(pC),pC=iS:yi=CSH(S)=-\sum_C p_C \log_2(p_C), p_C = \frac{|{i\in S: y_i = C}|}{|S|}
  1. If all points in S belong to the same class ⇒ H(S)=1log2(1)=0H(S)=-1\log_2(1)=0
  1. Half class C, half class D ⇒ H(S)=0.5log2(0.5)0.5log2(0.5)=1H(S)=-0.5\log_2(0.5)-0.5\log_2(0.5)=1
  1. n points, all different classes ⇒ H(S)=log21n=log2(n)H(S)=-\log_2 \frac{1}{n} = \log_2(n)

Where HafterH_{after} is the weighted average entropy after the split

Hafter=SlH(Sl)+SrH(Sr)Sl+SrH_{after} = \frac{|S_l|H(S_l)+|S_r|H(S_r)}{|S_l|+|S_r|}
🔥
The entropy function we defined is strictly concave. This is important because we want the interior (linear combination of left+right splits) to be strictly below the curve.

Multivariate splits

🔥
Better classifier at cost of worse interpretability or speed

Standard decision trees are fast because they only check one feature at each node. But if there are hundreds of features, it slows down classification a lot since in every node you need to look at lots of features at once.

Pruning

  1. Grow tree
  1. greedily remove each split whose removal improves validation performance
  1. More reliable than stopping early
📌
Reason wht pruning often works better is often a split that doesn’t make much progress is followed by a split that makes a lot of progress

Decision Tree Regression

  1. Uses the concept of decision tree
  1. Creates a piecewise constant regression function

Cost:

J(S)=1SiS(yiμs)2J(S)=\frac{1}{|S|}\sum_{i \in S}(y_i-\mu_s)^2

Where μs\mu_s is the mean label yiy_i for sample points iSi \in S

🔥
We choose split that minimizes the weighted average of the costs of the children after the split, instead of minimizing entropy.

Random Forests

🔥
Random Sampling (Bagging) isn’t enough since with bagging, the decision trees look very similar because the first level always chooses the best feature that best “splits” the data.

Idea:

  1. At each treenode, only allowed to split on m features (out of d)
    1. not allowed to split on the other d-m features
    1. mdm \approx \sqrt{d} great for classification, and md/3m \approx d/3 for regression
    1. mm is a hyperparameter

Pros:

  1. Sometimes test error reduction up to 100x or 1000x of decision trees

Cons:

  1. Slow
  1. Loses interpretability

Variant: Multivariate split random forest

Left: Axis-aligned splits, middle: splits with lines & arbitrary rotations, right: splits with conic sections
🔥
Special note: For axis-aligned splits Random Forests, the decision boundaries are still segments of linear boundaries since Random Forests are just linear combinations of Decision Trees.

Neural Networks

Oh yes, please also check out my notes for Andrew Ng’s DL Specialization

🔥
Very similar to perceptron (actually called Multi-layer Perceptron(MLP) if no activation function is used)
Problem with perceptron: only linear combinations, cannot model a XOR gate

Idea 1: Feature Lifting

Idea 2: Activation Function (Used by NN!) ⇒ Added Non-linearity between the linear combinations, can be as simple as clamping the output

History chooses sigmoid, but now we can use tanh, relu, leaky relu, softmax, etc.

Therefore steps to train neural network

  1. Design architecture (# of hidden layers, activation function, etc.)
  1. Choose Cost Function (BCE Loss, Squared Loss, etc.)
  1. Choose Optimizer (Gradient Descent, Stochastic Gradient Descent, Mini-batch Gradient Descent, Momentum, RMSProp, Adam, AMSGrad, etc.)
  1. Training Loop
    1. Forward prop ⇒ back prop

Problem for Neural Net:

  1. Has a symmetry problem
    1. If weights are initialized to be zero, there’s no difference between one hidden unit and an other - they are computing the same thing
    1. Break symmetry by weight initialization
  1. Diminishing/Exploding Gradient ⇒ Gradient fail to pass through a network too deep / Gradient being too big
    1. Gradient Clipping(for exploding gradient)
    1. BCE loss for the last layer
    1. Skip connections
    1. Centering the hidden units
      1. replace sigmoids with tanh
    1. Whitening Training Data (to gain a great bowl shape in w-space)
    1. BatchNorm layer
  1. If dimension of data is big, then second-order optimization is not applicable
    1. Newton’s method required too much computation power for the Hessian
    1. Alternative - Stochastic Levenberg Marquardt: approximates a diagonal Hessian

To avoid overfitting:

  1. Random init weights
  1. Bagging
  1. L2 Regularization
  1. Dropout

For CNN, see my notes for Andrew Ng.

Principal Component Analysis (PCA)

3D points projected to 2D by PCA

A technique for dimensional reduction (subset of unsupervised learning)

🔥
Find k directions that capture most of the variation Before doing PCA, make sure the data is centered.
Normalize the data? Sometimes yes, sometimes no If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)

Reason for PCA:

  1. Speed - Reducing number of dimensions makes some computations cheaper
  1. Variance Reduction - Remove irrelevant dimensions to reduce overfitting
    1. Like subset selection, but features are not axis-aligned
    1. instead linear combinations of input features
  1. Finding a small basis for representing maximum variations in complex things
  1. Minimizes the mean sqaured projection distance

Projection: Orthogonal projection of point xx onto vector ww is

projw(x)={(xw)wif w is unit, w=1xww2wotherwiseproj_{w}(x)=\begin{cases} (x\cdot w)w &\quad \text{if $w$ is unit, $||w||=1$} \\ \frac{x \cdot w}{||w||^2}w &\quad \text{otherwise} \end{cases}

We want to pick the best k directions, span{v1,v2,,vk}span\{v_1,v_2,\dots,v_k\} to project our inputs onto.

The Singular Value Decomposition of X,

X=UΛVX = U \Lambda V^{\top}

gives us the best basis to project onto.

Note: Here, U,VU,V are orthonormal matrices with column vectors and Λ\Lambda is a diagonal matrix with singular value σi\sigma_is.

Columns of UU are unit eigenvectors of XXXX^{\top}, Columns of VV (Rows of VV^{\top}) are eigenvectors of XXX^{\top}X. And the singular value, σi\sigma_is, are square root (σi=λi\sigma_i=\sqrt{\lambda_i}) of eigenvalues of both XXXX^{\top} and XXX^{\top}X.

We can verify that by...

XX=VΛUUΛV=VΛ2VX^{\top}X = V \Lambda^{\top} U^{\top}U\Lambda V^{\top}=V\Lambda^2V^{\top}
XX=UΛVVΛU=UΛ2UXX^{\top} = U \Lambda V^{\top}V\Lambda^{\top} U^{\top}=U\Lambda^2U^{\top}

Oh yes, now we’ve expressed those two symmetric matrices as their eigen-decomposition form.

🔥
If the data in XX are stored row by row, then the principle components of XX are rows of VV^{\top} If the data in XX are stored column by column, then the principle components of XX are columns of UU. We can pick the first k principle components by picking the components that have the k biggest singular values.

Clustering

🔥
Partition data into clusters so points in a cluster are more similar than across clusters

K-Means Clustering

Normalize the data? Sometimes yes, sometimes no If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)

Partition n points into k disjoint clusters

Assign each input point XiX_i a cluster label yi[1,k]y_i \in [1,k]

Cluster ii will have mean μi=1niyj=iXj\mu_i = \frac1{n_i} \sum_{y_j=i} X_j given nin_i points in cluster ii.

Find yy that minimizes i=1kyj=iXjμi2\sum_{i=1}^k \sum_{y_j=i} ||X_j - \mu_i||^2

NP-hard, try every partition ⇒ O(nkn)O(nk^n)

Lloyd’s Algorithm (Finds a Local Minimum on the Cost Function)

🔥
Must optimize the cost function, if changes nothing than does not change the objective function and should probably terminate.

First, start with either

  1. Forgy method: choose k random sample points to be initial μi\mu_i ⇒ goto 2
  1. Random partition: randomly assign each sample point to a cluster ⇒ goto 1
  1. k-means ++: Like Forgy, but biased distribution so that each center is chosen with a preference for points far from previous centers.

Alternates between

  1. fixing yjy_j, updating μi\mu_i
    1. Optimal μi\mu_i is the mean of points in cluster i by calculus
  1. fixing μi\mu_i, updating yjy_j
    1. Optimial yy assigns each point XjX_j to the closest center μi\mu_i

k-Medoids Clustering

Generalizes k-means clustering beyond Euclidean distance

Specify a distance(dissimilarity) function between points d(x,y)d(x,y)

Replace mean with medoid(中心点), the sample point that minimizes total distance to other points in the same cluster.

Hierarchical Clustering

🔥
One difficulty with k-means/k-medoids is that it is hard to choose number of k before we start, and there isn’t any reliable way.
  1. Creates a tree
    1. Every subtree is a cluster

Two techniques

  1. agglomerative clustering (Greedy agglomerative algorithm naturally takes O(n3)O(n^3) time)
    1. From bottom up
    1. Start with each point being a cluster, repeatedly fuse pairs
  1. divisive clustering
    1. From top down
    1. Start with all points in the same cluster, repeatedly split it.

We need a distance function between clusters A,B

  1. complete linkage d(A,B)=max{d(w,x):wA,xB}d(A,B)=\max\{d(w,x): w \in A, x \in B\}
  1. single linkage d(A,B)=min{d(w,x):wA,xB}d(A,B)=\min\{d(w,x): w \in A, x \in B\}
  1. average linkage d(A,B)=1ABwAxBd(w,x)d(A,B)=\frac{1}{|A||B|}\sum_{w \in A}\sum_{x \in B}d(w,x)
  1. centroid linkage d(A,B)=d(μA,μB)d(A,B) = d(\mu_A, \mu_B) where μs\mu_s is mean of SS
Visualization of Hierarchical Tree using Dendrogram: The vertical axis encodes all the linkage distances
Comparison of three linkages, the complete linkage gives the best-balanced dendrogram, whereas the single linkage gives a very unbalanced dendrogram that is sensitive to outliers

Random Projection

An alternative to PCA as preprocess for clustering, classification, regression

  1. Sometimes preserves distance better than PCA
    1. best when projecting a very high-dim space to a medium-dim space

We pick a small ϵ\epsilon, a small δ\delta, and a random subspace SRdS \sub \mathbb{R}^d of dimension k=2ln(1/δ)ϵ2/2ϵ3/3k = \lceil \frac{2\ln(1/\delta)}{\epsilon^2/2-\epsilon^3/3} \rceil

Subspace SS can be obtained by choosing kk arbitrary directions and use Gram-Schmidt to orthonormalize them.

We will project any point qq to q^\hat{q}, and q^=dkprojS(q)\hat{q} = \sqrt{\frac{d}{k}} \cdot proj_{S}(q)

The constant dk\sqrt{\frac{d}{k}} is used to preserve distance

Johnson-Lindenstrauss Lemma(modified)

For any two points, q,wRdq,w \in \mathbb{R}^d

(1ϵ)qw2q^w^2(1+ϵ)qw2(1-\epsilon)||q-w||^2 \le ||\hat{q}-\hat{w}||^2\le (1+\epsilon)||q-w||^2

Has Probability P12δP \ge 1-2\delta

🔥
Typical values: ϵ[0.02,0.5]\epsilon \in [0.02,0.5], δ[1/n3,0.05]\delta \in [1/n^3,0.05]
⚠️
The squared distance between two points after projection might change by 2% - 50%. For the point distances to be accurate, δ<1/n2\delta < 1/n^2, needs a subspace of dimension Θ(logn)\Theta(log n). Reducing δ\delta doesn’t cost much, but reducing ϵ\epsilon costs more.

k-Nearest Neighbour

🔥
Idea: When querying a point q, find k nearest points of q and return the “average” label returned by those nearest points.
As k becomes larger, the decision boundary is smoother.

Depending on dimension,

  1. 2-5 dimensions: Vornoni Diargrams
  1. Medium dimension (up to ~30): k-d trees
  1. Large dim
    1. exhaustive k-NN with PCA or random projection
    1. locality sensitive hashing

Exhaustive k-NN Algorithm:

Given query point q:

  1. Scan through all n sampple points, computing (squared) distance to q
  1. Maintain a max-heap with the k-shortest distances so far
    1. When we encounter a sample point closer to q than the point at the top of the heap, simply remove the top of the heap and insert the new pint.

Time to train: O(0)O(0)

Time to query: O(nd+nlog(k))O(nd+n\log(k)), expected Θ(nd+klog(n)log(k))\Theta(nd+k \log(n) \log(k)) if random point order

Voroni Diagrams

🔥
Original VD only supports 1-NN, there are modified order-k Voronoi digrams, but nobody uses those because the size gets really bad. (for 2D it is O(k2n)O(k^2n))

Let XX be a point set, the Voroni cell(always a convex polyhedron or polytope) of wXw \in X is

Vorw={pRd:pwpvvX}Vor_w = \{p \in \mathbb{R}^d: ||p-w|| \le ||p-v|| \quad \forall v \in X\}

Voroni diagram of XX is the set of XX’s Voroni cells

Size(# of vertices)O(nd/2)\text{Size(\# of vertices)} \in O(n^{\lceil d/2 \rceil})

2D ⇒ O(nlogn)O(n\log n) time to compute the digram and a trapezoidal map for point location, O(logn)O(logn) query time

dD ⇒ Use binary space partition tree (BSP tree) for point location

k-d Trees

Build:

Very similar to decision trees, differences:

  1. Chooses splitting feature with greatest width ⇒ choose feature iiarg maxi(XjiXki)\argmax_{i} (X_{ji} - X_{ki})
    1. We want the cutted box to be as closely to cubical as possible
    1. Or we can instead rotate the features, builds the tree faster by a factor of O(d)O(d)
  1. Choose splitting value ⇒ median point for feature i
    1. Guarentees log2n\lfloor \log_2 n \rfloor tree depth
  1. Each internal node stores a sample point ⇒ usually its the splitting point

Build time: O(ndlogn)O(nd \log n), or O(nlogn)O(n \log n) if we rotate through the features

Query:

ϵ\epsilon ⇒ parameter for approximate NN ⇒ in high dimensional space its likely to have several points equidistant from the query point, approximate NN saves tons of time

Qheap containing root node with key zerordistance to nearest point seen so far (variable for 1-NN, max heap for k-NN)while Q not empty and (1+ϵ)minkey(Q)<rBremovemin(Q)wB’s sample pointrmin{r,dist(q,w)}B,Bchild boxes of Bif (1+ϵ)dist(q,B)<r then insert B’ into Q with key dist(q,B)if (1+ϵ)dist(q,B)<r then insert B” into Q with key dist(q,B)return point that determined r\begin{split} &Q \leftarrow \text{heap containing root node with key zero} \\ &r \leftarrow \text{distance to nearest point seen so far (variable for 1-NN, max heap for k-NN)} \\ &\text{while Q not empty and } (1+\epsilon) \cdot minkey(Q) < r \\ &\quad B \leftarrow removemin(Q) \\ &\quad w \leftarrow \text{B's sample point} \\ &\quad r \leftarrow \min\{r, dist(q,w)\} \\ &\quad B', B'' \leftarrow \text{child boxes of B} \\ &\quad \text{if } (1+\epsilon) \cdot dist(q,B')<r \text{ then insert B' into Q with key } dist(q,B') \\ &\quad \text{if } (1+\epsilon) \cdot dist(q,B'') < r \text{ then insert B'' into Q with key } dist(q,B'') \\ &\text{return point that determined r} \end{split}
🔥
Worst case, we have to visit every node in the tree to find the exact nearest neighbour, then the k-d tree is slower than the simple exhaustive search.

AdaBoost

AdaBoost, Clearly Explained
https://youtu.be/LsK-xG1cLYA

👆Check out the video above, very useful

Idea: We will train TT classifiers(weak learners), G1,,GTG_1, \dots, G_T, and combine them into a big metalearner M(z)M(z)

Each weak learner will get a different “amount of say” βt\beta_t by how much error they made during training.

Risk Function:

R(β1,,βT,G1,,GT)=1ni=1nL(M(Xi),yi)R(\beta_1,\dots,\beta_T,G_1,\dots,G_T)=\frac1n \sum_{i=1}^n L(M(X_i),y_i)

(Exponential) Loss Function:

L(yi^,yi)=eyi^yi={eyi^yi=1eyi^yi=1L(\hat{y_i},y_i)=e^{-\hat{y_i}y_i}=\begin{cases} e^{-\hat{y_i}} &\quad y_i=1 \\ e^{\hat{y_i}} &\quad y_i=-1 \end{cases}

With this loss, we will pick GTG_T that minimizes the sum of sample weights over all misclassified points!

Sample Weights: (Initialize each weight with 1n\frac{1}{n})

wit+1=witeβtyiGt(Xi)={wi(t)eβtyi=Gt(Xi)wi(t)eβtyiGt(Xi)w_i^{t+1}=w_i^{t}e^{-\beta_{t}y_iG_t(X_i)}=\begin{cases} w_i^{(t)}e^{-\beta_t} &\quad y_i = G_t(X_i) \\ w_i^{(t)}e^{\beta_t} &\quad y_i \ne G_t(X_i) \end{cases}

Amount of say: (Equation obtained by setting derivative of risk to 0)

βt=12(1errterrt)\beta_t = \frac12(\frac{1-err_t}{err_t})

Where

errt=i:Gt(Xi)yiwiterr_t = \sum_{i: G_t(X_i) \ne y_i}{w_i^{t}}