STAT 134 Notes

Notes of UC Berkeley’s STAT 134 Notes created by Yunhao Cao (Github@ToiletCommander)

Disclaimer:

  1. Some of the images and resources are taken from notes on STAT 134’s official site and official textbook(Jim Pitman’s “Probability”).
  1. Some of the stuffs are from CS70’s class materials
  1. Some of the images and resources are taken from google image search, or other websites

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Last Update: 2022-05-08 21:00 PST ⇒ Updated through Ch 6.5 + MGF + Additional Topics(MLE and MAP)

😊
Happy RRR Week!

Notations

Sets

A|A| ⇒ Cardinality of set A, the number of elements in A

={}\emptyset = \{\} ⇒ Empty Set Notation

ABA \sube B or BAB \supe A ⇒ A is a subset of B, or B is a superset of A

ABA \sub B or BAB \sub A ⇒ A is a strict subset of B, or B is a strict superset of A

αA,βA\alpha \in A, \beta \notin Aα\alpha is a member in the set of AA, but β\beta is not a member in the set of A

ABA \cup B ⇒ The UNION of A and B(place taken by either A or B or both)

ABA \cap B ⇒ The INTERSECTION of A and B

AB=A \cap B = \emptyset ⇒ A and B are disjoint

BA=BAB - A = B \setminus A ⇒ Relative Complement of A in B, or Set Difference between B and A.

Note that A=A,A=AA \setminus \emptyset = A, A \setminus \emptyset = A

A=A=AA^\complement = \overline{A} = A' ⇒ A’s Complement

Significant Sets

N\mathbb{N} ⇒ Natural Numbers

Z,Z+,Z\mathbb{Z}, \mathbb{Z^+}, \mathbb{Z^-} ⇒ Integers, Positive Integers, Negative Integers

Q\mathbb{Q} ⇒ Rational Numbers, Q={ab:a,bZ,b0}\mathbb{Q} = \{\frac{a}{b}: a,b \in \mathbb{Z}, b \neq 0\}

R\mathbb{R} ⇒ Real Numbers

C\mathbb{C} ⇒ Complex Numbers, C={a+bi:a,bR}\mathbb{C} = \{a+bi:a,b \in \mathbb{R}\}

Boolean Logic

¬PP(P)\neg P \equiv \overline{P} \equiv (\sim P)

Not P

P×QPQPQP \times Q \equiv P \cdot Q \equiv P \wedge Q

P and Q

P+QPQP + Q \equiv P \vee Q

P or Q

PQP    Q¬PQ¬Q¬PP \rightarrow Q \equiv P \implies Q \equiv \neg P \vee Q \equiv \neg Q \rightarrow \neg P

P implies Q, equivalen to not P or Q, and equivalent to its contrapositive(not Q implies not P)

Probability Notations

Ω\Omega ⇒ Outcome Space

ω\omega ⇒ Single Event

Basic Principle of Probability

“Complement Rule”

P(A)=1P(A)P(\overline{A}) = 1 - P(A)

“Principle of Inclusion-Exclusion”

A1A2A3An=k=1n(1)k1S{1,,n}:S=kiSAi|A_1 \cup A_2 \cup A_3 \dots \cup A_n| = \sum_{k=1}^{n}(-1)^{k-1} \sum_{S\sube\{1,\dots, n\} : |S| = k} |\cap_{i \in S} A_i|
A1A2An=i=1nAii<jAiAj+i<j<kAiAjAk+(1)n1A1A2An|A_1 \cup A_2 \dots \cup A_n| = \sum_{i=1}^n |A_i| - \sum_{i<j} |A_i \cap A_j| + \sum_{i<j<k} |A_i \cap A_j \cap A_k| - \dots + (-1)^{n-1}|A_1 \cap A_2 \cap \dots \cap A_n|

Similarly, “Principle of Inclusion-Exclusion” of Probability...

P(A1A2An)=k=1n(1)k1S{1,,n}:S=kP(iSAi)P(A_1 \cup A_2 \cup \dots \cup A_n) = \sum_{k=1}^n (-1)^{k-1} \sum_{S \in \{1,\dots,n\}:|S|=k} P(\cap_{i \in S} A_i)
P(A1A2An)=i=1nP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)+(1)n1P(A1A2An)P(A_1 \cup A_2 \dots \cup A_n) = \sum_{i=1}^n P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \dots + (-1)^{n-1}P(A_1 \cap A_2 \cap \dots \cap A_n)

“Two Events are DISJOINT”

P(AB)=0P(A \cap B) = 0

“Two Events are INDEPENDENT”

P(AB)=P(A)P(B)P(A \cap B) = P(A) \cdot P(B)
P(AB)=P(AB)=P(A)P(A|B)=P(A|\overline{B}) = P(A)

Conditional Probability

P(AB)P(A|B) denotes the probability of event A happening when we know that event B is already happening.

P(AB)=1P(AB)P(A|B)=1-P(\overline{A}|B)
P(AB)=P(AB)P(B)P(A|B)=\frac{P(A \cap B)}{P(B)}
P(AB)P(B)=P(AB)P(A|B) \cdot P(B) = P(A \cap B)
P(A)=P(AB)P(B)+P(AB)P(B)P(A) = P(A|B) \cdot P(B) + P(A|\overline{B}) \cdot P(\overline{B})

Bayes’ Rule

P(AB)=P(BA)P(A)P(B)P(A|B)=\frac{P(B|A) \cdot P(A)}{P(B)}

Counting & Sampling

Counting

Counting Sequence(Order Matters) Without Replacement

First Rule of Counting: If an object can be made by a succession of kk choices, where there are n1n_1 ways of making the first choice, and for every way of making the first choice there are n2n_2 ways of making the second choice, and for every way of making the first and second choice there are n3n_3 ways of making the third choice, and so on up to the nkn_k-th choice, then the total number of distinct objects that can be made in this way is the product n1×n2×n3××nkn_1 \times n_2 \times n_3 \times \dots \times n_k.
nc=(N)k=n!(nk)!n_c = (N)_k = \frac{n!}{(n-k)!}

Counting Sequence With Replacement

nc=nkn_c = n^k

Counting Sets without replacement

“n choose k”

nc=(nk)=(nnk)=(N)kk!=n!(nk)!k!n_c={n \choose k} = {n \choose n-k} = \frac{(N)_k}{k!} = \frac{n!}{(n-k)!k!}
Second Rule of Counting: Assume an object is made by a succession of choices, and the order in which the choices are made does not matter. Let A be the set of ordered objects and let B be the set of unordered objects. If there exists an m-to-1 function f:ABf: A \rightarrow B, we can count the number of ordered objects (pretending that the order matters) and divide by mm (the number of ordered objects per unordered objects) to obtain B|B|, the number of unordered objects.

Counting Sets with replacement

Say you have unlimited quantities of apples, bananas and oranges. You want to select 5 pieces of fruit to make a fruit salad. How many ways are there to do this? In this example, S = {1, 2, 3}, where 1 represents apples, 2 represents bananas, and 3 represents oranges. k = 5 since we wish to select 5 pieces of fruit. Ordering does not matter; selecting an apple followed by a banana will lead to the same salad as a banana followed by an apple.

It may seem natural to apply the Second Rule of Counting because order does not matter. Let’s consider this method. We first pretend that order matters and observe the number of ordered objects is 35 as discussed above. How many ordered options are there for every un-ordered option? The problem is that this number differs depending on which unordered object we are considering. Let’s say the unordered object is an outcome with 5 bananas. There is only one such ordered outcome. But if we are considering 4 bananas and 1 apple, there are 5 such ordered outcomes (represented as 12222, 21222, 22122, 22212, 22221).

Assume we have one bin for each element of S, so n bins in total. For example, if we selected 2 apples and 1 banana, bin 1 would have 2 elements and bin 2 would have 1 element. In order to count the number of multisets, we need to count how many different ways there are to fill these bins with k elements. We don’t care about the order of the bins themselves, just how many of the k elements each bin contains. Let’s represent each of the k elements by a 0 in the binary string, and separations between bins by a 1.

Example of placement where |S| = 5 and k = 4

Counting the number of multisets is now equivalent to counting the number of placements of the k 0’s

The length of our binary string is k + n − 1, and we are choosing which k locations should contain 0’s. The remaining n − 1 locations will contain 1’s.

nc=(n+k1k)n_c = {n+k-1 \choose k}
Zeroth Rule of Counting: If a set A can be placed into a one-to-one correspondance with a set B (i.e. you can find a bijection between the two — an invertible pair of maps that map elements of A to elements of B and vice-versa), then |A| = |B|.

Sampling

A population of N total, G good and B bad. sample size of n = g + b, with 0 ≤ g ≤ n

Sampling Sets with replacement

P(g good and b bad)=(ng)GgBbNnP(\text{g good and b bad}) = {n \choose g}\frac{G^gB^b}{N^n}

Sampling Sets without replacement (Hypergeometric Dist.)

P(g good and b bad)=(ng)(G)g(B)b(N)n=(Gg)(Bb)(Nn)P(\text{g good and b bad}) = {n \choose g} \frac{(G)_g(B)_b}{(N)_n} = \frac{{G \choose g}{B \choose b}}{{N \choose n}}

Probability Concepts

Consecutive Odds Ratios

Mainly used for binomial distribution

“Analyze the chance of k successes with respect to k-1 successes”
R(k)=P(k successes)P(k1 successes)R(k) = \frac{P(k \text{ successes})}{P(k-1 \text{ successes})}

Law of Large Numbers

As nn \rightarrow \infin, the sampled mean will have a higher and higher probability of being in the actual distribution mean, P(1nSnμ<ϵ)1\mathbb{P}(|\frac{1}{n}S_n-\mu| < \epsilon) \rightarrow 1, no matter how small ϵ\epsilon is.

Random Variable

A random variable XX on a sample space Ω\Omega is a function X:ΩRX:\Omega \rightarrow R that assigns to each sample point ωΩ\omega \in \Omega a real number X(ω)X(\omega).
From CS70 Sp22 Notes 15

Probability of a value of a discrete random variable

P(X=x)=ω:X(ω)=xP(ω)P(X=x) = \sum_{\omega: X(\omega)=x} P(\omega)

Has to satisfy:

x,P(X=x)0xP(X=x)=1\forall x, P(X=x) \geq 0 \\ \sum_{x} P(X=x) = 1

Note:

For discrete r.v., the CDF(cumulative distribution function) would be a step function

Probability of a value of a continuous random variable

P(X=x)=0P(X=x)=0

Probability of a range of value of a continuous random variable

P(a<X<b)=abf(x)dxP(a<X<b)=\int_{a}^{b}f(x)dx

Has to satisfy:

f(x)dx=1x,f(x)0\int_{-\infin}^{\infin}f(x)dx=1 \\ \forall x, f(x) \ge 0

Continuous Random Variable

For continuous r.v.

There’s a Probability Density Function(PDF) and a Cumulative Distribution Function(CDF)

PDF: f(x)x,f(x)0 and f(x)dx=1\text{PDF: } f(x) \quad \\ \forall x, f(x) \ge 0 \text{ and } \int_{-\infin}^{\infin}f(x)dx = 1
CDF: F(x)=P(X<x)=xf(x)dxlimxF(x)=0 and limxF(x)=1\text{CDF: } F(x)=P(X<x)=\int_{-\infin}^{x}f(x)dx \\ \lim_{x \to -\infin}F(x) = 0 \text{ and } \lim_{x \to \infin} F(x)=1

Inverse Distribution Function

For what value of x is there probability 1/2 that X ≤ x?
x=F1(p)x = F^{-1}(p)

Either calculate the inverse function of F(x)F(x) or solve equation F(x)=pF(x) = p, treating pp as the variable.


Inverse CDF applied to standard norm

For any cumulative distribution function with inverse function F1F^{-1}, if UU has uniform (0,1) distribution, then F1(U)F^{-1}(U) has shape of CDF F(x)F(x).

Simulating binomial(n=2,p=0.5) with uniform distribution. g(u) here is equivalent to F(x) while g(U) here is equivalent to X

Probability Distribution

The distribution of a discrete random variable XX is the collection of values {(a,P[X=a]):aA}\{(a, P[X = a]): a \in A \}, where AA is the set of all possible values taken by XX.

The distribution of a continuous random variable XX is defined by its Probability Density Function f(x)f(x).

Example, For a dice, where X denotes the number on the dice when we roll the dice once.

P(X=1)=P(X=2)=P(X=3)=P(X=4)=P(X=5)=P(X=6)=16P(X=1) = P(X=2) = P(X=3) = P(X=4) = P(X=5) = P(X=6) = \frac16

Joint Distribution

The joint distribution for two discrete random variables XX and YY is the collection of values {((a,b),P[X=a,Y=b]):aA,bB}\{(\forall(a, b), P[X = a,Y = b]) : a \in A , b \in B\}, where AA is the set of all possible values taken by XX and BB is the set of all possible values taken by YY

And the joint distribution for two continuous random variables X and Y is defined by the joint probabilty density function f(x,y)f(x,y). It gives the density of probability per unit area for values of (X,Y) near the point (x,y). P(Xdx,Ydy)=f(x,y)dxdyP(X\in dx, Y \in dy)=f(x,y)dxdy.

In other words, list all probabilities of all posible X and Ys.

Has to satify the following constraint:

(a,b),P(X=a,Y=b)0a,bP(X=a,Y=b)=1\forall(a,b), P(X=a,Y=b) \geq 0 \\ \sum_{a,b} P(X=a,Y=b) = 1
f(x,y)0f(x,y)dxdy=1f(x,y) \ge 0 \\ \int_{-\infin}^{\infin}\int_{-\infin}^{\infin}f(x,y)dxdy=1

One useful equation:

P(X=a)=bBP(X=a,Y=b)P(X=a) = \sum_{b \in B} P(X=a, Y=b)

Symmetry of Joint Distribution

If X1,X2,,XnX_1,X_2,\dots,X_n be random variables with joint distribution defined by

P(x1,x2,,xn)=P(X1=x1,,Xn=xn)P(x_1,x_2,\dots,x_n)=P(X_1=x_1,\dots,X_n=x_n)

Then the joint distribution is symmetric if P(x1,,xn)P(x_1,\dots,x_n) is a symmetric function of (x1,,xn)(x_1,\dots,x_n). In other words, for the value of P(x1,,xn)P(x_1,\dots,x_n) is unchanged if we swap the positions of arbitrary number of parameters.

If the joint distribution is symmetric, then all n!n! possible orderings of the random variables X1,,XnX_1, \dots, X_n have the same joint distribution and this means that X1,,XnX_1, \dots, X_n are exchangeable. Exchangeable random variables have the same distribution.

Independence of random variables

Random variables X and Y on the same probability space are said to be independent if the events X=aX=a and Y=bY=b are independent for all values a, b. Equivalently, the joint distribution of independent r.v.’s decomposes as

(a,b),P[X=a,Y=b]=P[X=a]P[Y=b]\forall (a,b), P[X=a, Y=b] = P[X=a] \cdot P[Y=b]

Conditional Distribution

Usually used when two variables are dependent.

“Conditional Distribution of Y given X=x”

P(Y=yX=x)P(Y=y|X=x)

Continuous Variables:

fy(yX=x)=P(Y=yX=x)=P(Ydy,Xdx)P(Xdx)=f(x,y)fx(x)f_y(y|X=x)=P(Y=y|X=x)=\frac{P(Y \in dy, X \in dx)}{P(X \in dx)} = \frac{f(x,y)}{f_x(x)}
fy(yX=x)dy=1\int f_y(y|X=x) dy = 1
f(x,y), the probability density function for the joint distribution of x and y

Rule of Average Conditional Probabilities:

P(Y=y)=xP(Y=y,X=x)=xP(Y=yX=x)P(X=x)P(Y=y)=\sum_x P(Y=y, X=x)=\sum_x P(Y=y|X=x)P(X=x)

Conditional Expectation

E(YX)\mathbb{E}(Y|X) is defined by the function of X whose value is E(YX=x)\mathbb{E}(Y|X=x)

discrete case: E(YA)=yyP(Y=yA)continuous case: E(YA)=E(YX=x)fx(x)dx\text{discrete case: }\mathbb{E}(Y|A)=\sum_y yP(Y=y|A) \\ \text{continuous case: } \mathbb{E}(Y|A)=\int E(Y|X=x)f_x(x)dx

Properties of Conditional Expectation

E(X+YA)=E(XA)+E(YA)E(X+Y|A)=E(X|A)+E(Y|A)
E(Y)=E(E(YA))=iE(YAi)P(Ai)E(Y)=E(E(Y|A))=\sum_i E(Y|A_i)P(A_i)
P(Y)=iP(YAi)P(Ai)P(Y)=aP(YAi=a)fAi(a)dafy(y)=fy(yX=x)fx(x)dxP(Y)=\sum_i P(Y|A_i)P(A_i) \\ P(Y)=\int_a P(Y|A_i=a)f_{A_i}(a) da \\ f_y(y)=\int f_y(y|X=x)f_x(x)dx

Identical Distribution

X and Y has the same range, and for every possible value in the range,

v,P(X=v)=P(Y=v)\forall v, P(X=v) = P(Y=v)

If X and Y has same distribution, then...

  1. any statement of X has the same probability of the same statement of Y
  1. g(X) has the same distribution with g(Y)

Combination of Random Variables

P(X+Y=k)=xP(X=x,Y=kx)P(X+Y=k)=\sum_xP(X=x,Y=k-x)
fz(z)dz=xfx(x,zx)dxdydy=dz since y=zxf_z(z)dz=\int_x f_x(x,z-x)dxdy \\ dy = dz \text{ since } y=z-x
fz(z)=xfx(x)fy(zx)dxf_z(z)=\int_x f_x(x)f_y(z-x)dx

If Z=Y/XZ=Y/X, Then ZdzZ \in dz is drawn in the following diagram

z=yxy=zxz=\frac{y}{x} \rightarrow y=zx

Note the area of the heavily shaded region here. It looks similar to a parallelogram. So area would be dx[(z+dz)xzx]=dxdzxdx \cdot [(z+dz)x-zx]=dxdz|x|.

P(Xdx,Zdz)=f(x,xz)xdxdzP(X \in dx,Z \in dz)=f(x,xz)|x|dxdz

Therefore, if we integrate out X,

fz(z)=xf(x,xz)xdxf_z(z)=\int_x f(x,xz)|x|dx
🙄
If X and Y are independent positive random variables, fz(z)=0(z0)f_z(z)=0 \quad (z \le 0).

Equality of random variables

P(X=Y)=1X=YP(X=Y) = 1 \Leftrightarrow X=Y

“Equality implies identical distribution”

X=Y    v,P(X=v)=P(Y=v)X = Y \implies \forall v, P(X=v)= P(Y=v)

Probability of Events of two Random Variables

P(X<Y)=(x,y):x<yP(x,y)=xy:y>xP(x,y)P(X<Y) = \sum_{(x,y):x<y}P(x,y) =\sum_{x}\sum_{y:y>x}P(x,y)
P(X=Y)=(x,y):x=yP(x,y)=xP(X=x,Y=x)P(X=Y)=\sum_{(x,y):x=y}P(x,y)=\sum_{x}P(X=x,Y=x)

Symmetry of R.V.

The random variable X said to be symmetric around 0 if:

P(X=x)=P(X=x)-X has the same distribution as XP(Xa)=P(Xa)=P(Xa)P(X=-x)=P(X=x) \Leftrightarrow \text{-X has the same distribution as X} \Leftrightarrow P(X \ge a) = P(-X \le -a) = P(X \le -a)

Linear Function Mapping of Continuous Random Variable

suppose Y=aX+bY = aX+b and XX has PDF fX(x)f_X(x)

then YY has PDF fY(y)=1afX(yba)f_Y(y) = \frac{1}{|a|}f_X(\frac{y-b}{a})

One-to-one Differentiable Function of Continuous Random Variable

Let XX be a r.v. with density fX(x)f_X(x) on the range (a,b)(a,b). Let Y=g(x)Y = g(x) where gg is strictly increasing or decreasing on interval (a,b)(a,b).

For an infinitestimal interval dydy near yy, the event YdyY \in dy is identical to the event XdxX \in dx.

Thus fY(y)dy=fX(x)dxf_Y(y)|dy| = f_X(x)|dx|. The absolute values are added in because for a decreasing function g(x)g(x), only the ratio of the magnitudes of dxdx and dydy matters, thus:

fY(y)=fX(x)dydxf_Y(y)=\frac{f_X(x)}{|\frac{dy}{dx}|}

Change of Variable Principle

If XX has the same distribution as YY, then g(X)g(X) has the same distribution as g(Y)g(Y), for any function g()g(\cdot)

Max and Min of Independent R.V.s

CDF makes it easy to find dist of max and mins.

Xmax=max(X1,,Xn)andXmin=min(X1,,Xn)X_{max} = \max(X_1,\dots,X_n) \quad \text{and} \quad X_{min} = \min(X_1,\dots,X_n)

For any number x:

  1. Xmaxx(i,Xix)X_{max} \le x \equiv (\forall i, X_i \le x)
  1. Xminx(i,Xix)X_{min} \ge x \equiv (\forall i, X_i \ge x)

So...

Fmax(x)=P(Xmaxx)=P(X1x,X2x,,Xnx)=P(X1x)P(X2x)P(Xnx)(independence)=F1(x)F2(x)Fn(x)\begin{align} F_{max}(x) &= P(X_{max} \le x) \\ &=P(X_1 \le x, X_2 \le x, \dots, X_n \le x) \\ &=P(X_1 \le x)P(X_2 \le x)\dots P(X_n \le x) \quad \text{(independence)} \\ &=F_1(x)F_2(x) \cdots F_n(x) \end{align}
Fmin(x)=P(Xminx)=1P(Xmin>x)=1P(X1>x,X2>x,,Xn>x)=1P(X1>x)P(X2>x)P(Xn>x)(independence)=1(1F1(x))(1F2(x))(1Fn(x))\begin{align} F_{min}(x) &= P(X_{min} \le x) \\ &= 1 - P(X_{min}>x) \\ &=1 - P(X_1 > x, X_2 > x, \dots, X_n > x) \\ &=1 - P(X_1 > x)P(X_2 > x)\dots P(X_n > x) \quad \text{(independence)} \\ &=1 - (1-F_1(x))(1-F_2(x)) \cdots (1-F_n(x)) \end{align}

Expectation

For Discrete R.V.

E(X)=xxP(x)\mathbb{E}(X) = \sum_{x}x \cdot P(x)
E(X)=k=1P(Xk)(for X0)\mathbb{E}(X)=\sum_{k=1}^{\infin}P(X \ge k) \quad (\text{for } X \ge 0)

For continuous R.V.

E(X)=xf(x)dx\mathbb{E}(X) = \int_{-\infin}^{\infin}xf(x)dx

If we have a non-negative r.v.

E(X)=0P(Xx)dx=0(1F(x))dx\mathbb{E}(X)=\int_{0}^{\infin}P(X \ge x)dx=\int_{0}^{\infin}(1-F(x))dx

Linearity of Expectation

E(aX+bY+c)=aE(X)+bE(Y)+c,where a,b,cR\mathbb{E}(aX+bY+c) = a\mathbb{E}(X) + b\mathbb{E}(Y) + c, \text{where } a,b,c \in \mathbb{R}

Independent R.V. Expectation

For Independent R.V. X,YX, Y, we have:

E(XY)=E(X)E(Y)\mathbb{E}(XY) = \mathbb{E}(X)\mathbb{E}(Y)

Variance

Var(X)=σx2=E[(XE[X])2]=E(X2)E(X)2Var(X)=\sigma_x^2 = \mathbb{E}[(X-\mathbb{E}[X])^2] = \mathbb{E}(X^2) - \mathbb{E}(X)^2
👉
We use variance (squared error) instead of absolute error, E(Xμx)\mathbb{E}(|X-\mu_x|), because it is easier to calculate

Variance of the sum of n variables

Var(kXk)=kVar(Xk)+2j<kCov(Xj,Xk)Var(\sum_k X_k)=\sum_k Var(X_k)+2\sum_{j<k} Cov(X_j,X_k)

Coefficient Property

Var(cX+b)=c2Var(X),where b,cRVar(cX+b) = c^2Var(X), \text{where } b,c \in R

Independent R.V. Variance

For Independent R.V. X,YX, Y, we have:

Var(X+Y)=Var(X)+Var(Y)Var(X+Y) = Var(X)+Var(Y)

Standard Deviation σ\sigma

σx=Var(X)\sigma_x=\sqrt{Var(X)}

Standardizations of Random Variable

X=XμxσX^* = \frac{X-\mu_x}{\sigma}

“X in standard units” with E(X)=0,Var(X)=σx=1\mathbb{E}(X^*)=0, Var(X) = \sigma_x = 1

Skewness of R.V.

Let XX be a random variable with E(X)=μE(X) = \mu and Var(X)=σx2Var(X) = \sigma_x^2, Let X=XμσX^* = \frac{X-\mu}{\sigma} be XX in standard units.

Skewness(X)=E[(X)3]=E[(Xμ)3]σ3Skewness(X) = E[(X^{*})^3]= \frac{E[(X-\mu)^3]}{\sigma^3}

Markov’s Inequality

Note: For nonnegative R.V.s only

If X0,P(X>a)E(X)a\text{If } X \ge 0, \\ P(X>a) \le \frac{E(X)}{a}

Chebychev’s Inequality

P(XE(X)kσx)1k2P(Xμc)Var(X)c2P(|X-\mathbb{E}(X)| \ge k\sigma_x)\le \frac{1}{k^2} \\ P(|X-\mu| \ge c) \le \frac{Var(X)}{c^2}

Order Statistics

Let X1,X2,X3,,XnX_1, X_2, X_3, \dots, X_n be random variables with the independent and identical distribution pdf f(x)f(x) and cdf F(x)F(x)

Then if we denote X(1),X(2),,X(n)X_{(1)}, X_{(2)}, \dots, X_{(n)} being the smallest, the second smallest, etc. among X1,,XnX_1, \dots, X_n. Then what is the distribution of X(i)X_{(i)}?

f(k)(x)dx=P(X(k)dx)=P(one of the X’sdx,exactly k1 of others <x=nP(X1dx,exactly k1 of others <x)=nP(X1dx)P(exactly k1 of others <x)=nf(x)dx(n1k1)(F(x))k1(1F(x))nk(<x<)\begin{align} f_{(k)}(x)dx &=P(X_{(k)}\in dx) \\ &=P(\text{one of the X's} \in \text{dx,exactly } k-1 \text{ of others } < x \\ &=nP(X_1\in dx,\text{exactly } k-1 \text{ of others } < x) \\ &=nP(X_1\in dx)P(\text{exactly } k-1 \text{ of others } < x) \\ &=nf(x)dx {n-1 \choose k-1} (F(x))^{k-1}(1-F(x))^{n-k} \end{align} \\ (-\infin < x < \infin)

Covariance Cov(X,Y)Cov(X,Y)

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(XY)=Var(X)+Var(Y)2Cov(X,Y)Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y) \\ Var(X-Y)=Var(X)+Var(Y)-2Cov(X,Y)
X,Y independent    Cov(X,Y)=0X, Y \text{ independent} \implies Cov(X,Y)=0
Cov(X,Y)=E[(XE(X))(YE(Y))]=E(XY)E(X)E(Y)Cov(X,Y)=\mathbb{E}[(X-\mathbb{E}(X))(Y-\mathbb{E}(Y))] = \mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y)
ConditionDescription
Cov(X,Y)>0Cov(X,Y)>0X and Y are positively dependent, P(XY)>P(X),P(YX)>P(Y)P(X|Y)>P(X), P(Y|X)>P(Y)
Cov(X,Y)=0Cov(X,Y)=0X and Y are uncorrelated (covariance equal zero doesn’t always imply independence)
Cov(X,Y)<0Cov(X,Y)<0X and Y are negatively dependent, P(XY)<P(X),P(YX)<P(Y)P(X|Y)<P(X), P(Y|X)<P(Y)

Covariance of the same variable

Cov(X,X)=Var(X)Cov(X,X)=Var(X)

Bilinearity of Covariance

Standard Form:

Cov(iaiXi,jbjYj)=ijaibjCov(Xi,Yj)Cov(\sum_ia_iXi,\sum_j b_jY_j)=\sum_i \sum_j a_i b_j Cov(X_i,Y_j)

Simpler Form:

Cov(X,Y+Z)=Cov(X,Y)+Cov(X,Z)Cov(X,Y+Z)=Cov(X,Y)+Cov(X,Z)
Cov(W+X,Y)=Cov(W,Y)+Cov(X,Y)Cov(W+X,Y)=Cov(W,Y)+Cov(X,Y)

Correlation

Because it is hard to interpret magnitude of Cov(X,Y)Cov(X,Y), we standardize this to correlation

Corr(X,Y)=Cov(X,Y)σxσy[1,1]Corr(X,Y)=\frac{Cov(X,Y)}{\sigma_x \sigma_y} \in [-1,1]
🤧
If correlation is equal to ±1\pm1, it means the two variables have linear relationship. For details, see the proof above.

Moment Generating Function

Let XX be a random variable, then the MGF(Moment Generating Function) of XX is defined as

MX(t)=E(etX)M_X(t)=\mathbb{E}(e^{tX})

Important Properties

Equality of MGF means Equality of CDF

MX(t)=MY(t)    FX(x)=FY(x)M_X(t)=M_Y(t) \implies F_X(x)=F_Y(x)

Jensen’s Inequality

MX(t)eμt,μ=E(X)M_X(t) \ge e^{\mu t}, \mu = \mathbb{E}(X)

Upper tail of random variable using Markov’s Inequality

P(Xa)=P(etXeta)E(etX)eta=etaMX(t)\mathbb{P}(X\ge a)=\mathbb{P}(e^{tX} \ge e^{ta}) \le \frac{\mathbb{E}(e^{tX})}{e^{ta}}=e^{-ta}M_X(t)

Linear Transformation of Random Variable

MαX+β(t)=E(eαtX+βt)=eβtE(eαtX)=eβtMX(αt)M_{\alpha X + \beta}(t)=\mathbb{E}(e^{\alpha tX+\beta t})=e^{\beta t}\mathbb{E}(e^{\alpha tX})=e^{\beta t}M_X(\alpha t)

Linear Combination of Independent Random Variables

Mi=1naiXi(t)=i=0nMXi(ait)M_{\sum_{i=1}^n a_iX_i}(t)=\prod_{i=0}^n M_{X_i}(a_it)

Additional Topic - MLE and MAP

💡
Maximum Likelihood Estimation and Maximum A Posteriori are two topics that I personally got confused by during the study of CS189. So here I will give links to some great videos and a little bit mathematical formulas for MAP and MLE.
What are Maximum Likelihood (ML) and Maximum a posteriori (MAP)? ("Best explanation on YouTube")
Explains Maximum Likelihood (ML) and Maximum a posteriori (MAP) estimation/detection using a Gaussian measurement/sampling example. Related videos: (see: htt...
https://youtu.be/9Ahdh_8xAEI
Maximum Likelihood, clearly explained!!!
If you hang out around statisticians long enough, sooner or later someone is going to mumble "maximum likelihood" and everyone will knowingly nod. After this...
https://youtu.be/XepXtl9YKwc
(ML 6.1) Maximum a posteriori (MAP) estimation
Definition of maximum a posteriori (MAP) estimates, and a discussion of pros/cons.A playlist of these Machine Learning videos is available here:http://www.yo...
https://youtu.be/kkhdIriddSI

Say we have observations x\vec{x}, we want to estimate a parameter θ\theta, where θ\theta is a random variable.

MLE does the following:

θMLE=arg maxθP(xθ)\theta_{MLE} = \argmax_{\theta} \mathbb{P}(x|\theta)

MAP does the following:

θMAP=arg maxθP(θx)\theta_{MAP}=\argmax_{\theta} \mathbb{P}(\theta|x)
💡
We see that MLE maximizes the likelihood of having the observations given the parameter, while MAP maximizes the posterior probability of having the parameter giving the observation.

MLE - iterate through distribution parameters and find the distribution parameter such that producing x\vec{x} is most likely using the parameter.

MAP - iterate through distribution parameters and find the distribution parameter that is most likely to be right given the observed data.

By Bayes Theorem,

P(θx)=P(xθ)P(θ)P(x)\mathbb{P}(\theta|x)=\frac{\mathbb{P}(x|\theta)\mathbb{P}(\theta)}{\mathbb{P}(x)}

Note that

P(x)=E(P(xθ))=θf(xθ)f(θ)dθ\mathbb{P}(x) = \mathbb{E}(\mathbb{P}(x|\theta))=\int_{\theta} f(x|\theta)f(\theta)d\theta

Since P(x)\mathbb{P}(x) is independent of individual θ\theta, we can view P(x)\mathbb{P}(x) as a constant, and therefore,

P(θx)P(xθ)P(θ)=P(x,θ)\mathbb{P}(\theta|x) \propto \mathbb{P}(x|\theta)\mathbb{P}(\theta) = \mathbb{P}(x,\theta)

Therefore, MAP can also be written as

θMAP=arg maxθP(θx)=arg maxθP(xθ)P(θ)\theta_{MAP} = \argmax_{\theta} \mathbb{P}(\theta|x)=\argmax_{\theta} \mathbb{P}(x|\theta)\mathbb{P}(\theta)
💡
See the difference between the term optimized by MAP and MLE here? The MAP has an addition of P(θ)\mathbb{P}(\theta). If the prior, P(θ)\mathbb{P}(\theta) is the same across all θ\theta (uniform prior distribution), then MAP is identical with MLE.

Properties of MLE and MAP

  1. Note that both MLE and MAP are point estimators, that is if the parameter is continuous (and usually this is the case), then it picks the max point on the PDF function, not the midpoint of the points where their areas are the biggest.
  1. MLE is more prune to overfitting then MAP, since the prior in MAP kind of restricts the parameter estimated in an area.
  1. the asymptotic behavior of MAP and MLE are the same, that is as we collect more data, the result of MAP and MLE tends to converge.
  1. Special Property of MLE (not applicable to MAP): T=g(θ)    TMLE=g(θMLE)\Tau = g(\theta) \implies \Tau_{MLE} = g(\theta_{MLE})

Distribution

Bernoulli(Indicator) Distribution InBernoulli(p)I_n \sim Bernoulli(p)

P(In=1)=pP(In=0)=1pP(I_n=1) = p \\ P(I_n=0) = 1-p
E(In)=p\mathbb{E}(I_n) = p
Var(In)=p(1p)Var(I_n) = p(1-p)

Uniform Distribution on a finite set

Let we have a list of uniform events, Ω={ω1,ω2,,ωn}\Omega = \{\omega_1, \omega_2, \dots, \omega_n\}

i,P(ωi)=1n\forall i, P(\omega_i) = \frac{1}{n}

Uniform(a,b) distribution

Distribution of a point picked uniformly at random from the interval (a,b)

For a < x < y < b the probability that a point falls in the interval (x,y) is...

P(a<x<y)=yxbaP(a < x <y) = \frac{y-x}{b-a}

for b-a= 1, long-run frequency is almost exactly equal to y-x.

PDF: f(x)={1baif a<x<b0otherwise\text{PDF: } f(x) = \begin{cases} \frac{1}{b-a} &\text{if a<x<b} \\ 0 &\text{otherwise} \end{cases}

Empirical Distribution

Opposed to theoretical distribution, empirical distribution is the distribution of your observed data.

Suppose we have X={x1,x2,,xn}X = \{x_1, x_2, \dots, x_n\}

Pn(a,b)=i:1in,a<xi<bnP_n(a,b) = \frac{|i:1 \leq i \leq n, a \lt x_i \lt b|}{n}

In other words, Pn(a,b)P_n(a,b) gives the proportion of the numbers in the list that lies in-between range of (a,b)

Estimating Empirical Distribution With Continuous PDF

The distribution of a data list can be displayed in a histogram, and such histogram smoothes out the data to display the general shape of the empirical distribution.

Such histograms often follows a smooth curve f(x)f(x). And it is safe to assume x,f(x)0\forall x, f(x) \ge 0

Idea is that if (a,b) is a bin interval, then the area under the bar between (a,b) should roughly equal to the area under the curve between (a,b) ⇒ the proportion of data from a to b is roughly equal to the area under f(x)f(x).

Pn(a,b)abf(x)dxP_n(a,b) \approx \int_{a}^{b}f(x)dx

f(x)f(x) functions like a continuous PDF estimation for distribution of data.

Now we can also use an indicator distribution to estimate its average

I(a,b)(x)={1if x(a,b)0otherwiseI_{(a,b)}(x) = \begin{cases} 1 &\text{if } x \in (a,b) \\ 0 &\text{otherwise} \end{cases}

So I(a,b)(xi)I_{(a,b)}(x_i) is an indicator stating if xix_i is in range of (a,b)(a,b)

Pn(a,b)=1ni=1nI(a,b)(xi)abf(x)dx=I(a,b)(x)f(x)dxP_n(a,b) = \frac{1}{n}\sum_{i=1}^{n} I_{(a,b)}(x_i) \approx \int_{a}^{b}f(x)dx=\int_{-\infin}^{\infin}I_{(a,b)}(x)f(x)dx

Integration Approximation of Averages

If the empirical distribution of list (x1,x2,,xn)(x_1,x_2,\dots,x_n) is well approximated by the theoretical distribution with PDF f(x)f(x), then the average value of a function g(x)g(x) over the n values can be approximated by

1ni=1ng(xi)g(x)f(x)dx\frac{1}{n}\sum_{i=1}^ng(x_i)\approx\int_{-\infin}^{\infin}g(x)f(x)dx

Binomial Distribution XB(n,p)X \sim B(n,p)

“probability of k successes in n trials with success rate of p”

P(X=k)=(nk)pk(1p)nkP(X=k) = {n \choose k} p^k (1-p)^{n-k}
R(k)=[nk+1k]p1pR(k) = [\frac{n-k+1}{k}]\frac{p}{1-p}
E(X)=npVar(X)=np(1p)\mathbb{E}(X) = np \\ Var(X)= np(1-p)

Note: “Binomial Expansion”

(p+q)n=k=0n(nk)pkqnk(p+q)^n = \sum_{k=0}^n {n \choose k} p^k q^{n-k}

Square Root Law

For large n, in n independent trials with probability p of success on each trial:

Normal Distribution as an approximation for Binomial Distribution

👉
Used when p12p \approx \frac{1}{2}

provided n is large enough in XB(n,p)X \sim B(n,p), XX can be approximated with normal distribution

If approximating a to b success in a bionomial distribution, use b+12b + \frac12 and a12a - \frac12 as the boundary instead of using bb and aa, this is called continuity correction. Very important for small values of npq\sqrt{npq}.
P(aXb)Φ(b+12)Φ(a12)P(a \leq X \leq b) \approx \Phi(b+\frac12) - \Phi(a-\frac12)

Note: for f(x)f(x), use μ=E(X)\mu = \mathbb{E}(X) and σ2=Var(X)\sigma^2 = Var(X) of the Binomial distribution

👉
How good is the normal distribution? Its best when σ=npq\sigma = \sqrt{npq} is big and pp is close to 12\frac12
W(n,p)=max0abnP(a to b)N(a to b)11012pnpqW(n,p) = \max_{0 \leq a \leq b \leq n}|P(\text{a to b}) - N(\text{a to b})| \approx \frac1{10}\frac{|1-2p|}{\sqrt{npq}}

W(n,p) denotes the WORST ERROR in the normal approximation to the binomial distribution.

Poisson Distribution as an approximation for binomial distribution

When n is large and p is close to 0, normal distribution cannot properly estimate binomial distribution, so let’s use poisson! (if p is close to one, use the p = original q and mirror the approximation to get the approximation)

XB(n,p)Poisson(μ=np)X \sim B(n,p) \sim Poisson(\mu = np)

Probability of the Most Likely Number of Successes

Taken from wikipedia
m=np+p=np+p1m = \lfloor np+p \rfloor = \lceil np + p -1 \rceil
P(m)12πσP(m) \sim \frac1{\sqrt{2\pi}\sigma}

Normal Distribution XN(μ,σ2)X \sim N(\mu , \sigma^2)

PDF:f(x)=12πσe12((xμ)σ)2,<x<\text{PDF}: f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{(x-\mu)}{\sigma})^2}, -\infin < x < \infin

Standard Normal Distribution (μ=0,σ=1\mu = 0, \sigma = 1)

PDF:ϕ(z)=12πe12z2\text{PDF}: \phi(z) = \frac{1}{\sqrt{2\pi}}e^{-\frac12z^2}
CDF:Φ(z)=zϕ(y)dy\text{CDF}: \Phi(z) = \int_{-\infin}^z\phi(y)dy

Where z=xμσ=xz = \frac{x-\mu}{\sigma} = x

Skew Normal Aproximation

“Third derivative of ϕ(z)\phi(z)

ϕ(z)=(3zz3)ϕ(z)\phi'''(z) = (3z-z^3)\phi(z)

“Skew-normal PDF”

ϕ(z)ϕ(z)16Skewness(n,p)ϕ(z)\phi(z) \approx \phi(z) - \frac16Skewness(n,p)\cdot\phi'''(z)
Skewness(n,p)=12pnpq=12pσSkewness(n,p)=\frac{1-2p}{\sqrt{npq}} = \frac{1-2p}{\sigma}

“Skew-normal CDF”

Φ(z)Φ(z)16Skewness(X)(z21)ϕ(z)\Phi(z) \approx \Phi(z)-\frac{1}{6}Skewness(X) \cdot (z^2-1) \cdot \phi(z)

“0 to b success rate” for binomial distribution

P(0 to b success)Φ(z)16Skewness(n,p)(z21)ϕ(z)P(\text{0 to b success}) \approx \Phi(z) - \frac16Skewness(n,p)\cdot(z^2-1)\cdot \phi(z)

where z=b+12μσ,μ=np,σ=npqz = \frac{b+\frac12-\mu}{\sigma}, \mu=np, \sigma=\sqrt{npq}

Sum of Independent Normal Variables

If XN(λ,σ2)X \sim N(\lambda,\sigma^2) and YN(μ,r2)Y \sim N(\mu,r^2), then X+YN(λ+μ,σ2+r2)X+Y \sim N(\lambda + \mu,\sigma^2+r^2)

Joint Distribution for Independent Standard Normal Distributions

let X and Y be standard normal dist, that is X,YN(μ=0,σ2=1)X, Y \sim N(\mu=0,\sigma^2=1)

ϕ(z)=ce12z2(c=12π)\phi(z)=ce^{-\frac{1}{2}z^2} \quad (c=\frac{1}{\sqrt{2\pi}})
f(x,y)=ϕ(x)ϕ(y)=c2e12(x2+y2)f(x,y)=\phi(x)\phi(y)=c^2e^{-\frac{1}{2}(x^2+y^2)}
📎
Notice the rotational symmetry of this joint distribution.

Rayleigh Distribution RRayleighR \sim Rayleigh

Its the distribution of the above joint distribution of standard normal distribution along the radius.

PDF: fR(r)=re12r2(r>0)\text{PDF: } f_R(r)=re^{-\frac12r^2} \quad (r>0)
CDF: FR(r)=0rse12s2ds=1e12r2\text{CDF: } F_R(r)=\int_0^r se^{-\frac12s^2}ds=1-e^{-\frac12r^2}
E(R)=σπ2\mathbb{E}(R)=\sigma\sqrt{\frac{\pi}{2}}
Var(R)=4π2σ2Var(R)=\frac{4-\pi}{2}\sigma^2

Derivation of Rayleigh Distribution

R=X2+Y2R=\sqrt{X^2+Y^2}
ϕ(x,y)=c2e12(x2+y2)=c2e12r2\phi(x,y)=c^2e^{-\frac12(x^2+y^2)}=c^2e^{-\frac12r^2}
P(Rdr)=2πrdrc2e12r2P(R \in dr)=2 \pi r \cdot dr \cdot c^2e^{-\frac12r^2}

And therefore

fR(r)=2πrc2e12r2f_R(r)=2 \pi r c^2e^{-\frac12r^2}

Notice that fR(r)f_R(r) must integrate to 1 over (,)(-\infin,\infin), if we calculate this, we see that 0fR(r)dr=2πc2\int_{0}^{\infin}f_R(r)dr = 2\pi c^2, and therefore c=1/2πc=1/\sqrt{2\pi}.

Chi-Square Distribution

Joint density of n independent normal variables at every point on the sphere radius rr in n-dimensional space is:

fR(r)=(12π)nexp(12r2)f_R(r)=(\frac{1}{\sqrt{2\pi}})^nexp(-\frac12r^2)

For independent standard normal ZiZ_i, let the following denote the distance in n-dimensional space

Rn=Z12++Zn2R_n=\sqrt{Z_1^2+\cdots+Z_n^2}

So the n-dimensional volumnof a thin spherical shell of thickness drdr at radius rr is

cnrn1drc_nr^{n-1}dr

Where cnc_n is the (n-1) dimensional volumn of the “surface” of a sphere of radius 1 in n dimensions

c2=2π,c3=4π,c_2 = 2\pi, c_3 = 4\pi, \dots
P(Rndr)=cnrn1(1/2π)ne12r2dr(r>0)P(R_n \in dr)=c_nr^{n-1}(1/\sqrt{2\pi})^ne^{-\frac12r^2}dr \quad(r>0)

Through a change of variable and evaluating cnc_n we see that the density of Rn2Gamma(r=n/2,λ=1/2)R_n^2 \sim Gamma(r=n/2, \lambda = 1/2).

We call Rn2R_n^2 follows the distribution of the chi-square distribution with n degrees of freedom.

E(Rn2)=nVar(Rn2)=2n\mathbb{E}(R_n^2)=n \\ Var(R_n^2)=2n
Skewness(Rn2)=4/2nSkewness(R_n^2)=4/\sqrt{2n}

Standard Bivariate Normal Distribution

XX and YY have standard bivariate normal distribution with correlation ρ\rho iff:

Y=ρX+1ρ2ZY=\rho X + \sqrt{1-\rho^2}Z

where XX and ZZ are independent standard normal variables

Joint Density:

f(x,y)=12π1ρ2exp{12(1ρ2)(x22ρxy+y2)}f(x,y)=\frac{1}{2\pi \sqrt{1-\rho^2}}exp\{-\frac{1}{2(1-\rho^2)}(x^2-2\rho xy+y^2)\}

Properties:

MarginalsBoth XX and YY have standard normal distribution
Conditionals Given XGiven X=xX=x, YN(ρx,1ρ2)Y \sim N(\rho x, 1- \rho^2).
Conditionals Given YGiven Y=yY=y, XN(ρy,1ρ2)X \sim N(\rho y, 1-\rho^2)
IndependenceXX and YY are independent iff ρ=0\rho = 0

Bivariate Normal Distribution as a description for Linear Combinations of Independent Normal Variables

Let

V=iaiZi,W=ibiZiV=\sum_i a_iZ_i, W=\sum_i b_i Z_i

Where ZiN(μi,σi2)Z_i \sim N(\mu_i,\sigma_i^2) are independent normal variables.

Then the joint distribution of V,WV, W is bivariate normal.

Where

μV=iaiμi,μW=ibiμiσV2=iai2σi2,σW2=ibi2σi2Cov(V,W)=iaibiσi2ρ=Cov(V,W)/σVσW\mu_V=\sum_i a_i\mu_i, \mu_W=\sum_i b_i \mu_i \\ \sigma_V^2=\sum_i a_i^2\sigma_i^2, \sigma_W^2=\sum_i b_i^2\sigma_i^2 \\ Cov(V,W)=\sum_i a_i b_i \sigma_i^2 \\ \rho=Cov(V,W)/\sigma_V\sigma_W

Independence

Two linear combinations V=iaiZiV=\sum_i a_i Z_i and W=ibiZiW = \sum_i b_iZ_i of independent normal (μi,σi2)(\mu_i, \sigma_i^2) variables ZiZ_i are independent iff they are uncorrelated, that is, if and only if iaibiσi2=0\sum_i a_i b_i \sigma_i^2 = 0.

Bivariate Normal Distribution

Random Variables UU and VV have bivariate normal distribution with parameters μU,μV,σU2,σV2,ρ\mu_U, \mu_V, \sigma_U^2, \sigma_V^2, \rho iff the standardized variables

U=UμUσU,V=VμVσVU^*=\frac{U-\mu_U}{\sigma_U}, V^*=\frac{V-\mu_V}{\sigma_V}

have standard bivariate normal distribution with correlation ρ\rho. Then,

ρ=Corr(U,V)=Corr(U,V)\rho=Corr(U^*,V^*)=Corr(U,V)

and U,VU,V are independent iff ρ=0\rho = 0

Derivation

📎
Goal: to construct a pair of correlated standard normal variables.

We start with a pair of independent standard normal variables, XX and ZZ.

Let YY be the projection of (X,Z)(X,Z) onto an axis at an angle θ\theta to the X-axis,

We see on the diagram that Y=Xcos(θ)+Zsin(θ)Y=Xcos(\theta)+Zsin(\theta)

By rotational symmetry of the joint distribution of X,ZX,Z, the distribution of YY is standard normal.

E(X)=E(Y)=E(Z)=0E(X)=E(Y)=E(Z)=0
SD(X)=SD(Y)=SD(Z)=1SD(X)=SD(Y)=SD(Z)=1
ρ=Corr(X,Y)=E(XY)=E[X(Xcos(θ)+Zsin(θ))]=E(X2)cos(θ)+E(XZ)cos(θ)=cos(θ)\begin{align} \rho = Corr(X,Y)=E(XY) &=E[X(Xcos(\theta)+Zsin(\theta))] \\ &=E(X^2)cos(\theta)+E(XZ)cos(\theta) \\ &=cos(\theta) \end{align}

Since E(X2)=Var(X)=1,E(XZ)=E(X)E(Z)=0E(X^2)=Var(X)=1, E(XZ)=E(X)E(Z)=0.

Some special cases:

ConditionResult
θ=0\theta=0ρ=1,Y=X\rho=1, Y=X
θ=π2\theta=\frac{\pi}{2}ρ=0,Y=Z\rho=0, Y=Z
θ=π\theta=\piρ=1,Y=X\rho=-1,Y=-X

Since we have ρ=cos(θ)\rho = cos(\theta), θ=arccos(ρ)\theta = arccos(\rho).

Therefore, sin(θ)=1ρ2sin(\theta)=\sqrt{1-\rho^2}

And

Y=ρX+1ρ2ZY=\rho X + \sqrt{1-\rho^2}Z

Poisson Distribution XPoisson(λ)X \sim Poisson(\lambda)

P(X=k)=eλλkk!P(X=k) = e^{-\lambda}\frac{\lambda^k}{k!}
E(X)=λVar(X)=λ\mathbb{E}(X) = \lambda \\ Var(X) = \lambda

As μ\mu starts small the distribution is piled up on the side and as λ\lambda gets bigger and bigger, the poisson distribution becomes closer to the normal distribution (theres a proof that n,p=λn0n \rightarrow \infin, p =\frac{\lambda}{n} \rightarrow 0, the approximation becomes better and better)

Sum of Independent Poisson Variables

N1,...,NjN_1, ..., N_j are independent Poisson random variables with parameters μ1,...,μj\mu_1, ..., \mu_j, then S=N1+N2++NjS = N_1 + N_2 + \cdots + N_j, SPoisson(i=1jμi)S \sim Poisson(\sum_{i=1}^j\mu_i)

Skew-normal approximation for Poisson Distribution

If NμPoisson(μ)N_\mu \sim Poisson(\mu), then for b = 0, 1, ...

P(Nμb)Φ(z)16μ(z21)ϕ(z)P(N_\mu \le b) \approx \Phi(z)-\frac{1}{6\sqrt{\mu}}(z^2-1)\phi(z)

where ϕ(z)\phi(z) is the standard normal curve and Φ(z)\Phi(z) is the standard normal CDF.

Multinomial Distribution

“a generalization of the binomial distribution” ⇒ Bionmoial Distribution With Multiple Types of Outcomes(instead of 2)

Let NiN_i denote the number of results in category ii in a sequence of independent trials with probability pip_i for a result in the ithi^{th} category for each trial, 1im1 \le i \le m, where i=1mpi=1\sum_{i=1}^{m}p_i = 1. Then for every m-tuple of non-negative integers (n1,n2,,nm)(n_1,n_2,\dots,n_m) with sum n:

P(N1=n1,N2=n2,,Nm=nm)=n!n1!n2!nm!p1n1p2n2p3n3pmnmP(N_1=n_1,N_2=n_2,\dots,N_m=n_m) = \frac{n!}{n_1!n_2!\cdots n_m!}p_1^{n_1}p_2^{n_2}p_3^{n_3} \cdots p_m^{n_m}

Sum of Independent R.V.s

Let SnS_n be the sum, X~n=Snn\tilde X_n = \frac{S_n}{n} the average, of nn independent random variables X1,X2,,XnX_1, X_2, \dots, X_n, each with the same distribution as XX

Square Root Law

E(Sn)=nE(X),σ(Sn)=nσxE(X~n)=E(X),σ(X~n)=σxn\mathbb{E}(S_n) = n\mathbb{E}(X), \sigma(S_n)=\sqrt{n}\sigma_x \\ \mathbb{E}(\tilde X_n) = \mathbb{E}(X), \sigma(\tilde X_n)=\frac{\sigma_x}{\sqrt{n}}

Skewness

Skewness(Sn)=Skewness(X)nSkewness(S_n)=\frac{Skewness(X)}{\sqrt{n}}

With Chebychev’s Inequality

P(X~nμx<ϵ)σ2nϵ2P(|\tilde X_n-\mu_x|<\epsilon) \le \frac{\sigma^2}{n\epsilon^2}

Central Limit Theorem(Normal Approx.)

For large n, the distribution of SnS_n is approximately normal, which means:

E(Sn)=nμxσ2(Sn)=nσx2SnN(nμx,nσx2)E(S_n)=n\mu_x \\ \sigma^2(S_n) = n\sigma_x^2 \\ S_n \sim N(n\mu_x,n\sigma_x^2)

Note: To approximate Probability of SnS_n taking a specific range of values, we need to use continuity estimation(add and subtract 12×n\frac12 \times n on the upper and lower bound)

👉
Why 12×n\frac12 \times n instead of 12\frac12? Because nXnX can only take values that are multiples of n

For all aba \le b,

P(aSnnμxσnb)=Φ(b)Φ(a)P(a \le \frac{S_n-n\mu_x}{\sigma\sqrt{n}} \le b) = \Phi(b) - \Phi(a)
👉
SnS_n^*, the “SnS_n in standard units”, equals to SnE(Sn)SD(Sn)=Snnμxσn\frac{S_n-\mathbb{E}(S_n)}{SD(S_n)} = \frac{S_n-n\mu_x}{\sigma \sqrt{n}}

Skewed-Normal Approximation

Φ(z)Φ(z)16nSkewness(Xi)ϕ(z)\Phi(z) \approx \Phi(z)-\frac{1}{6\sqrt{n}}Skewness(X_i)\phi(z)

Hypergeometric Distribution

This is also the section of “sampling without replacement”

P(g good and b bad)=(ng)(G)g(B)b(N)n=(Gg)(Bb)(Nn)P(\text{g good and b bad}) = {n \choose g} \frac{(G)_g(B)_b}{(N)_n} = \frac{{G \choose g}{B \choose b}}{{N \choose n}}
P(Sn=g)=(Gg)(Bb)(Nn)P(S_n=g)=\frac{{G \choose g}{B \choose b}}{{N \choose n}}

where b=ngb = n-g

E(Sn)=npVar(Sn)=NnN1npq\mathbb{E}(S_n)=np \\ Var(S_n) = \frac{N-n}{N-1} \cdot npq

where p=GNp = \frac{G}{N} and q=BNq = \frac{B}{N}.

NnN1\sqrt{\frac{N-n}{N-1}} is the “finite population correction factor”

Exponential Distribution TExponential(λ)T \sim Exponential(\lambda)

A random time T has exponential distribution with rate λ\lambda ⇒ probability of death per unit time

PDF: f(t)=λeλt(t0)\text{PDF: } f(t)=\lambda e^{-\lambda t} \quad (t \ge 0)
CDF: F(t)=P(Tt)=0tf(t)dt=eλt0t=1eλt(t0)\text{CDF: } F(t)=P(T \le t)=\int_{0}^{t} f(t)dt = -e^{-\lambda t}\big|_{0}^{t}=1-e^{-\lambda t} \quad (t \ge 0)
E(T)=1λVar(T)=1λ2\mathbb{E}(T) = \frac{1}{\lambda} \\ Var(T) = \frac{1}{\lambda^2}

Memoryless Property

A positive random variable T has exponential(λ) distribution for some λ > 0 if and only if T has the memoryless property

P(T>t+sT>t)=P(T>s)(s0,t0)P(T>t+s|T>t)=P(T>s) \qquad (s \ge 0, t \ge 0)

“Given survival to time t, the chance of surviving a further time s is the same as the chance of surviving to time s in the first place.”

Relation to Geometric Distribution

The exponential distribution on (0,)(0, \infin) is the continuous analog of the geometric distribution on {1,2,3,}\{1,2,3,\dots\}

Relation to Poisson Arrival Process

A sequence of independent Bernoulli(Indicator) trials, with probability p of success on each trial, can be characterized in two ways:

  1. Counts of successes - number of successes in n trials is Binomial(n,p)
  1. Times between successes - distribution of the waiting time until the first success is Geometric(p), waiting times between each successes and the next are independent with the same geometric distribution.

These characterization of Bernoulli trials lead to the two descriptions in the next box of a Poisson Arrival Process With Rate λ.

👉
Arrivals are at times marked with X on the time line, think of arrivals representing something like incoming calls or customers entering a store

Gamma Distribution TrGamma(r,λ)T_r \sim Gamma(r,\lambda)

If TrT_r is the time of the r-th arrival after time 0 in a Poisson process with rate λ\lambda or if Tr=W1+W2++WrT_r = W_1 + W_2 + \cdots + W_r where the Wi are independent with WiExponential(λ)W_i \sim Exponential(\lambda) distribution, then TrGamma(r,λ)T_r \sim Gamma(r,\lambda)

PDF(t0)f(t)=P(Trdt)/dt=P(Nt=r1)λ=eλt(λt)r1(r1)!λ\text{PDF} (t \ge 0) \text{: } f(t)=P(T_r \in dt)/dt=P(N_t=r-1)\cdot\lambda=e^{-\lambda t}\frac{(\lambda t)^{r-1}}{(r-1)!}\cdot\lambda

Note: NtN_t is the number of arrivals by time t in the Poisson process with rate λ\lambda  (NtPoisson(μ=λt)N_t \sim Poisson(\mu=\lambda t) )

“The probability per unit time that the r-th arrival comes around time t is the probability of exactly r-1 arrivals by time t multiplied by the arrival rate”

P(Tr>t)=P(Ntr1)=k=0r1eλt(λt)kk!P(T_r>t)=P(N_t \le r-1) = \sum_{k=0}^{r-1}e^{-\lambda t}\frac{(\lambda t)^k}{k!}

Tr>tT_r > t iff there are at most r-1 arrivals in the interval (0, t]

CDF: P(Trt)=1P(Tr>t)\text{CDF: } P(T_r \le t) = 1-P(T_r>t)

Expectation and Variance

E(Tr)=rλVar(Tr)=σTr2=rλ2\mathbb{E}(T_r)=\frac{r}{\lambda} \\ Var(T_r) = \sigma_{T_r}^2=\frac{r}{\lambda^2}

General Gamma Distribution with rRr \in \mathbb{R}

In the previous PDF definition, we’ve only defined rZ+r \in \mathbb{Z^{+}}

PDF for rRr \in \mathbb{R}:

PDF(t0)fr,λ(t)={[Γ(r)]1λrtr1eλtt00t<0\text{PDF}(t \ge 0) \text{: } f_{r,\lambda}(t)=\begin{cases} [\Gamma(r)]^{-1}\lambda^r t^{r-1}e^{-\lambda t} &t\ge0 \\ 0 &t < 0 \end{cases}

Where

Γ(r)=0tr1etdt\Gamma(r) = \int_{0}^{\infin}t^{r-1}e^{-t}dt
rZ+,Γ(r)=(r1)!\forall r \in \mathbb{Z^+}, \Gamma(r) = (r-1)!

If we apply integration by parts,

Γ(r+1)=rΓ(r)\Gamma(r+1)=r\Gamma(r)

Geometric Distribution XGeo(p)X \sim Geo(p)

“number X of Bernoulli trials needed to get one success”

P(X=k)=(1p)k1p(k1)P(X=k)=(1-p)^{k-1}p \quad (k \ge 1)
E(X)=1pVar(X)=σx2=1pp2\mathbb{E}(X) = \frac{1}{p} \\ Var(X) = \sigma_x^2 = \frac{1-p}{p^2}
Skewness(X)=2p1pSkewness(X)=\frac{2-p}{\sqrt{1-p}}

Beta Distribution XBeta(r,s)X \sim Beta(r,s)

For r,s > 0, the beta distribution on (0,1) is defined by the density:

PDF: f(x)=1B(r,s)xr1(1x)s1(0<x<1)\text{PDF: } f(x)=\frac{1}{B(r,s)}x^{r-1}(1-x)^{s-1} \qquad (0<x<1)
E(X)=rr+s\mathbb{E}(X)=\frac{r}{r+s}
E(X2)=(r+1)r(r+s+1)(r+s)\mathbb{E}(X^2)=\frac{(r+1)r}{(r+s+1)(r+s)}
Var(X)=rs(r+s)2(r+s+1)Var(X)=\frac{rs}{(r+s)^2(r+s+1)}

where

B(r,s)=01xr1(1x)s1dxB(r,s)=\int_{0}^{1} x^{r-1}(1-x)^{s-1}dx

So we see that B(r,s)B(r,s) serves the purpose of normalizing the PDF to integrate to 1.

For all positive r,s

B(r,s)=01xr1(1x)s1dx=Γ(r)Γ(s)Γ(r+s)where Γ(r)=(r1)!B(r,s)=\int_0^1x^{r-1}(1-x)^{s-1}dx=\frac{\Gamma(r)\Gamma(s)}{\Gamma(r+s)} \\ \text{where } \Gamma(r) = (r-1)!

We see that

E(Xk)=01xkf(x)dx=01xk1B(r,s)xr1(1x)s1=1B(r,s)01xr+k1(1x)s1=B(r+k,s)B(r,s)\begin{align} \mathbb{E}(X^k)&=\int_0^1 x^kf(x)dx \\ &=\int_0^1x^k\frac1{B(r,s)}x^{r-1}(1-x)^{s-1} \\ &=\frac1{B(r,s)}\int_0^1 x^{r+k-1}(1-x)^{s-1} \\ &=\frac{B(r+k,s)}{B(r,s)} \\ \end{align}

Beta Distribution as a distribution to calculate Order Statistics

The kkth order statistic of nn independent uniform(0,1) random variables has Beta(r=k,s=nk+1)Beta(r=k,s = n-k+1) distribution.

Since f(k)f_{(k)} is a pdf it must integrate to 1 over [0,1], and therefore

01xk1(1x)nkdx=1n(n1k1)=(k1)!(nk)!n!\int_0^1x^{k-1}(1-x)^{n-k}dx=\frac{1}{n{n-1 \choose k-1}}=\frac{(k-1)!(n-k)!}{n!}