CS 170 Class Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley CS170.

Reference Notice: Material highly and mostly derived from the DPV textbook

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Big-O Notation (DPV 0)

Sloppiness in the analysis of running times can lead to unacceptable level of inaccuracy in the result
Also possible to be too precise - the time taken by one such step depends crucially on the particular processor and details such as caching strategy.

🔥
Instead of reporting that an algorithm takes, say 5n3+4n+35n^3+4n+3 steps on an input size of nn, it is much simpler to leave out lower order items and just say that the algorithm takes time O(n3)O(n^3) - big O of n3n^3.

Let f(n)f(n) and g(n)g(n) be functions from positive integers to positive reals. We say f=O(g)f = O(g) - f grows no faster than g - if there is a constant c>0c>0 such that f(n)cg(n)f(n) \le c \cdot g(n)

Just as O()O(\cdot) is an analog of \le, we can also define the analog of \ge and == as follows:

f=Ω(g) means g=O(f)f=Θ(g) means f=O(g) and f=Ω(g)f=\Omega(g) \text{ means } g = O(f) \\ f = \Theta(g) \text{ means } f = O(g) \text{ and } f = \Omega(g)

Reduction (DPV 7.1)

If any subroutine for task QQ can also be used to solve PP , we say PP reduces to QQ. Often, PP is solvable by a single call to QQ’s subroutine, which means any instance xx of PP can be transformed into an instance yy of QQ such that P(x)P (x) can be deduced from Q(y)Q(y)

Reductions enhance the power of algorithms: Once we have an algorithm for problem Q (which could be shortest path, for example) we can use it to solve other problems.

Graphs

How is a graph represented?

Adjacency matrix

We represent a graph by an adjacency matrix, let there be n=Vn = |V| vertices v1,,vnv_1, \dots, v_n

ai,j={1if there is an edge from vi to vj0otherwisea_{i,j} = \begin{cases} 1 \quad \text{if there is an edge from $v_i$ to $v_j$} \\ 0 \quad \text{otherwise} \end{cases}

For an undirected graph the matrix is symmetric since an edge {u,v}\{u, v\} can be taken in either direction

The presence of a particular edge can be cheked in constant time, but it takes O(n2)O(n^2) space.

Adjacency List

With size proportional to the number of edges, consists of V|V| linked lists, one per vertex. The linked list for vertex uu holds the names of vertices to which uu has an outgoing edge.

Total size of the data structure is O(E)O(|E|) but checking for an edge is no longer constant time(requires sifting through uu’s adjacency list)

Algorithms

Fibonacci(DPV 0)

Fn={Fn1+Fn2if n>1,1if n=1,0if n=0F_n = \begin{cases} F_{n-1}+F_{n-2} &\text{if } n>1, \\ 1 &\text{if } n = 1, \\ 0 &\text{if } n = 0 \end{cases}

Basic Arithmetic (DPV 1.1)

🔥
How much does the size of a number change when we change bases? Recall the rule for converting logarithms from base a to base b: logbN=(logaN)/(logab)\log_b N = (\log_a N )/(\log_a b). So the size of integer NN in base aa is the same as its size in base bb, times a constant factor logab\log_a b. In big-O notation, therefore, the base is irrelevant, and we write the size simply as O(logN)O(\log N).

Incidentally, this function logN\log N appears repatedly in our subject, in many guises. Here’s a sampling:

  1. logN\log N is the power to which you need to raise 22 in order to obtain NN.
  1. Going backward, it, logN\lceil \log N \rceil, can also be seen as the number of times you must halve NN to get down to 1. This is useful when a number is halved at each iteration of an algorithm
  1. It, log(N+1)\lceil \log (N+1) \rceil , is the number of bits in the binary representation of NN.
  1. It, logN\lfloor \log N \rfloor , is also the depth of a complete binary tree with NN nodes
  1. It is even the sum 1+12+13++1N1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{N}, to within a constant factor.

Addition

Basic property of numbers: The sum of any three single-digit numbers is at most two digits long (holds for any base b ≥ 2).

This simple rule gives us a way to add two numbers in any base: align their right-hand ends, and then perform a single right-to-left pass in which the sum is computed digit by digit, maintaining the overflow as a carry. Since we know each individual sum is a two-digit number, the carry is always a single digit, and so at any given step, three single-digit numbers are added.

We will not spell out the pseudocode because we are so familiar with it. We will go ahead and anlyze its efficiency: We want the answer expressed as a function of the size of the input: the number of bits of x and y, the number of keystrokes needed to type them in.

Suppose x and y are each n bits long; in this chapter we will consistently use the letter nn for the sizes of numbers. Then the sum of xx and yy is n+1n + 1 bits at most, and each individual bit of this sum gets computed in a fixed amount of time. The total running time for the addition algorithm is therefore of the form c0+c1nc_0 + c_1n, where c0c_0 and c1c_1 are some constants; in other words, it is linear. Instead of worrying about the precise values of c0c_0 and c1c_1, we will focus on the big picture and denote the running time as O(n)O(n).

Some readers may be confused at this point: Why O(n)O(n) operations? Isn’t binary addition something that computers today perform by just one instruction? There are two answers.

First, it is certainly true that in a single instruction we can add integers whose size in bits is within the word length of today’s computers—32 perhaps. But, as will become apparent later in this chapter, it is often useful and necessary to handle numbers much larger than this, perhaps several thousand bits long. Adding and multiplying such large numbers on real computers is very much like performing the operations bit by bit. Second, when we want to understand algorithms, it makes sense to study even the basic algorithms that are encoded in the hardware of today’s computers. In doing so, we shall focus on the bit complexity of the algorithm, the number of elementary operations on individual bits—because this accounting reflects the amount of hardware, transistors and wires, necessary for implementing the algorithm.

Multiplication

algorithm for multiplying two numbers xx and yy is to create an array of intermediate sums, each representing the product of xx by a single digit of yy. These values are appropriately left-shifted and then added up. Suppose for instance that we want to multiply 13×1113 \times 11, or in binary notation, x=1101x = 1101 and y=1011y = 1011.

If xx and yy are both nn bits, then there are nn intermediate rows, with lengths of up to 2n2n bits (taking the shifting into account). The total time taken to add up these rows, doing two numbers at a time, is:

O(n)+O(n)++O(n)undefinedn1 times=O(n2)\underbrace{O(n)+O(n)+\cdots +O(n)}_\text{$n-1$ times} = O(n^2)

But there are better ways to do this…

Division

Modular Arithmetic (DPV 1.2, unfinished)

Just like multiplication, division works on computers with limited bits.

Modular arithmetic is a system for dealing with restricted ranges of integers. We define x modulo Nx \text{ modulo } N to be the remainder when xx is divided by NN ; that is, if x=qN+rx = qN + r with 0r<N0 \le r < N , then x modulo Nx \text{ modulo } N is equal to rr. This gives an enhanced notion of equivalence between numbers: xx and yy are congruent modulo NN if they differ by a multiple of NN , or in symbols

xymodNN divides (xy)x \equiv y \mod N \longleftrightarrow \text{$N$ divides $(x-y)$}

Another interpretation is that modular arithmetic deals with all the integers, but divides them into NN equivalence classes, each of the form i+kN:kZ{i + kN : k \in Z} for some ii between 00 and NN

🔥
Two’s complement uses a lot of modular arithmetic.

It uses n bits to represent numbers in the range [2n1,2n11][−2^{n−1}, 2^{n−1} − 1] and is usually described as follows:

  1. Positive integers, in the range [0,2n11][0, 2^{n−1} − 1], are stored in regular binary and have a

leading bit of 0.

  1. Negative integers x−x, with 1x2n11 \le x \le 2n−1, are stored by first constructing xx in binary, then flipping all the bits, and finally adding 1. The leading bit in this case is 1.

Here’s a much simpler way to think about it: any number in the range 2n1−2^{n−1} to 2n112^{n−1} − 1 is stored modulo 2n2^n. Negative numbers x−x therefore end up as 2nx2^n − x. Arithmetic operations like addition and subtraction can be performed directly in this format, ignoring any overflow bits that arise.

Hashing (DPV 1.5)

Hash Tables

We’re talking about IPV4 here, IP: 0.0.0.0 ~ 255.255.255.255

We will have an array of size n=250n = 250 and each key kk is hashed by hash function h(k):kxh(k): k \rightarrow x where x{0,1,,n1}x \in \{0, 1, \dots, n-1\}. Each array element is a linked list

📌
To ensure that hash table operations have good performance: 1. have the size of the hash table be about twice as large as the number of items 2. assume that the size of the domain of all data items is N=nkN = n^k, a power of n, then each data item can be considered as a k-tuple of integers modulo nn, and H={ha:a{0,,n1}k}H = \{h_a : a \in \{0, \dots, n-1\}^k\} is a universal family of hash functions

Hash Functions

Let the number of buckets instead of n=250n=257n=250 \rightarrow n = 257, note that 257257 is a prime number

Let any IP address xx as a quadruple x=(x1,,x4)x = (x_1, \dots, x_4) and we will fix any four numbers a1,a2,a3,a4modna_1, a_2, a_3, a_4 \mod n and map the address xx to h(x)=(a1a1+a2x2+a3x3+a4x4)mod257h(x) = (a_1 a_1 + a_2 x_2 + a_3 x_3 + a_4 x_4) \mod 257.

We have defined the hash function

ha(x1,,x4)=i=14aiximodnh_a(x_1, \dots, x_4) = \sum_{i=1}^4 a_i \cdot x_i \mod n

With this hash function,

Property: Consider any pair of distinct IP addresses x=(x1,,x4)x = (x_1, \dots, x_4), y=(y1,,y4)y = (y_1, \dots, y_4). If the coefficients a=(a1,,a4)a = (a_1, \dots, a_4) are chosen uniformly at random from {0,1,,n1}\{0, 1, \dots, n-1\}, then

P{ha(x)=ha(y)}=1n\mathbb{P}\{ h_a(x) = h_a(y) \} = \frac{1}{n}
This condition guarantees that the expected lookup time for any item is small.

Proof:

Since x=(x1,,x4)x = (x_1, \dots , x_4) and y=(y1,,y4)y = (y_1, \dots , y_4) are distinct, these quadruples must differ in some component; without loss of generality let us assume that x4y4x_4 \ne y_4. We wish to compute the probability

Pr[ha(x1,,x4)=ha(y1,,y4)]Pr[h_a(x_1, \dots , x_4) = h_a(y_1, \dots , y_4)]

that is, the probability that

i=14aixii=14aiyimodn\sum_{i=1}^4 a_i \cdot x_i \equiv \sum_{i=1}^4 a_i \cdot y_i \mod n

This last equation can be rewritten as

i=13ai(xiyi)a4(y4x4)modn\sum_{i=1}^3 a_i \cdot (x_i - y_i) \equiv a_4 \cdot (y_4 - x_4) \mod n

Suppose that we draw a random hash function ha by picking a=(a1,a2,a3,a4)a = (a_1, a_2, a_3, a_4) at random. We start by drawing a1a_1, a2a_2, and a3a_3, and then we pause and think:

What is the probability that the last drawn number a4a_4 is such that equation above holds?

We know

Thus for previous equation to hold,

a4=c(y4x4)1modna_4 = c \cdot (y_4 - x_4)^{-1} \mod n

The probability of this happening is 1/n1/n

Universal Hash Functions

Family of hash functions that satisfy the following property:

For any two distinct data items xx and yy, exactly H/n|H|/n of all the hash functions in HH map xx and yy to the same bucket, where nn is the number of buckets

In our example, we used the family

H={ha:a{0,,n1}4}H = \{h_a : a \in \{0,\dots,n-1\}^4\}

Which is an group of universal hash functions

In other words, for any two data items, the probability these items collide is 1/n if the hash function is randomly drawn from a universal family.

📌
This property implies that hash table operations have good performance in expectation

Divide and Conquer (DPV 2)

Recurrence Relationships (DPV 2.2)

Divide and conquer often follow a generic pattern of runtime:

T(n)=aT(n/b)+O(nd)T(n)=aT(\lceil n/b \rceil)+O(n^d)
🔥
Master Theorem: If for some constants a>0,b>1,d0a>0, b>1, d\ge 0, T(n)=aT(n/b)+O(nd)T(n)=aT(\lceil n/b \rceil)+O(n^d), then: T(n)={O(nd)if d>logbaO(ndlogn)if d=logbaO(nlogba)if d<logbaT(n)=\begin{cases} O(n^d) &\text{if } d>\log_b a \\ O(n^d \log n) &\text{if } d = \log_b a \\ O(n^{\log_b a}) &\text{if } d < \log_b a \end{cases}

To prove master theorem,

we will assume that n is a power of b (this will not influence the final bound because n is at most a multiplicative factor of b away from some power of b)

Notice that the size of the subproblems decrease by a factor of b with each level of recursion, and therefore reaches the base case after logbn\log_b n levels. Its branching factor is aa, so the k-th level of the tree is made up of aka^k subproblems, each of size n/bkn/b^k. The total work done at this level is

ak×O(nbk)d=O(nd)×(abd)ka^k\times O(\frac{n}{b^k})^d=O(n^d)\times(\frac{a}{b^d})^k

As kk goes from 0 to logbn\log_b n, these numbers form a geometric series with ratio a/bda/b^d. Therefore we can find the sum by three cases:

  1. a/bd<1a/b^d < 1
    1. The series is decreasing, and its sum is just given by the first term O(nd)O(n^d)
  1. a/bd=1a/b^d = 1
    1. We will have O(logn)O(\log n) counts of the first term
    1. O(ndlogn)O(n^d \log n)
  1. a/bd>1a/b^d > 1
    1. The series is increasing, and its sum is given by its last term
    1. nd(abd)logbn=nd(alogbn(blogbn)d)=alogbn=a(logan)(logba)=nlogban^d (\frac{a}{b^d})^{\log_b n} = n^d (\frac{a^{\log_b n}}{(b^{\log_b n})^d}) = a^{\log_b n} = a^{(\log_a n)(\log_b a)} = n^{\log_b a}
    1. O(nlogba)O(n^{\log_b a})

Multiplication (DPV 2.1)

Gauss’s trick:

Product of two complex numbers:

(a+bi)(c+di)=acbd+(bc+ad)i(a+bi)(c+di)=ac-bd+(bc+ad)i

Can in fact be done with just three terms ac,bd,(a+b)(c+d)ac, bd, (a+b)(c+d) because:

bc+ad=(a+b)(c+d)acbdbc+ad=(a+b)(c+d)-ac-bd

Using the same idea, if xx and yy are two n-bit integers, we can represent them as left and right halves of bits

x= xL  xR =2n/2xL+xRy= yL  yR =2n/2yL+yRx = \fbox{ $x_L$ } \fbox{ $x_R$ } = 2^{n/2}x_L+x_R \\ y = \fbox{ $y_L$ } \fbox{ $y_R$ } = 2^{n/2}y_L+y_R \\
x×y=(2n/2xL+xR)(2n/2yL+yR)=2nxLyL+2n/2(xLyR+xRyL)+xRyR\begin{split} x\times y &= (2^{n/2}x_L+x_R)(2^{n/2}y_L+y_R) \\ &=2^nx_L y_L + 2^{n/2}(x_Ly_R+x_Ry_L)+x_Ry_R \end{split}

We have recursion expression of

T(n)=4T(n/2)+O(n)O(n2)T(n)=4T(n/2)+O(n) \in O(n^2)

for this multiplication relationship

But we can use Guss’s trick, we only need xLyL,xRyR,(xL+xR)(yL+yR)x_L y_L, x_R y_R, (x_L+x_R)(y_L+y_R) to compute the multiplication since xLyR+xRyL=(xL+xR)(yL+yR)xLyLxRyRx_Ly_R+x_Ry_L = (x_L+x_R)(y_L+y_R)-x_Ly_L-x_Ry_R

(Karatsuba Algorithm)

T(n)=3T(n/2+1)+O(n)O(nlog23)O(n1.59)T(n)=3T(n/2+1)+O(n) \in O(n^{\log_2 3}) \approx O(n^{1.59})

MergeSort

“Split the list into two halves, recursively sort each half, and merge the two sorted sublists”

The merge procedure does a constant amount of work per recursive call, thus merge()O(n)\text{merge}(\cdot) \in O(n) and the overall time taken by mergesort:

T(n)=2T(n/2)+O(n)O(nlogn)T(n) = 2T(n/2)+O(n) \in O(n \log n)

We will also define an iterative version of mergesort, where inject(array,element)\text{inject}(array,element) means to append an element to the end of an array and eject(array)\text{eject}(array) means to remove an element at the front of the array and return it.

🔥
There’s a O(nlogn)O(n \log n) lower bound for sorting, which means that merge sort is optimal!

Reason of why sorting using comparison must be at least Ω(nlogn)\Omega(n \log n) is because every permutation of the elements must be present in the leaves of the sorting algorithm and therefore the complexity of the tree representing the sorting algorithm must have depthnlogn\text{depth} \propto n\log n

Median

When looking for a recursive solution, it is paradoxically often easier to work with a more general version of the problem—for the simple reason that this gives a more powerful step to recurse upon. In our case, the generalization we will consider is selection

Divide and Conquer for selection:

For any number vv, imagine splitting list SS into three categories: elements smaller than vv, those equal to vv (there might be duplicates), and those greater than vv.

For instance, if the array SS:

is put to split on v=5v=5, then

We can check kk is in which sub-array by using length of each array

Worst case:

  1. We select vv to be the smallest/largest element of the array each time, causing array to only shrink by 1.
n+(n1)+(n2)++n2=Θ(n2)n+(n-1)+(n-2)+\cdots+\frac{n}{2}=\Theta(n^2)

We will define a good selection of vv to at least strip away 25% of the original array.

Therefore

T(n)T(3n/4)+O(n)O(n)T(n)\le T(3n/4)+O(n) \in O(n)

Matrix Multiplication (DPV 2.5)

Zij=k=1nXikYkjZ_{ij}=\sum_{k=1}^n X_{ik}Y_{kj}
👆
Formula implies an O(n3)O(n^3) algorithm, was wildly believed to be the best runtime possible but later German mathematician Volker Strassen announced a significantly more efficient algorithm based upon divide and conquer.

Matrix Multiplication can be viewed as blocks

But it is still O(n3)O(n^3), by:

T(n)=8T(n/2)+O(n2)T(n)=8T(n/2)+O(n^2)

Turns out XYXY can be computed from just seven n/2×n/2n/2 \times n/2 subproblems

XY=[P5+P4P2+P6P1+P2P3+P4P1+P5P3P7]XY = \begin{bmatrix} P_5+P_4-P_2+P_6 &P_1+P_2 \\ P_3+P_4 &P_1+P_5-P_3-P_7 \\ \end{bmatrix}

where

P1=A(FH)P5=(A+D)(E+H)P2=(A+B)HP6=(BD)(G+H)P3=(C+D)EP7=(AC)(E+F)P4=D(GE)\begin{split} P_1 = A(F-H) \quad &P_5 = (A+D)(E+H) \\ P_2 = (A+B)H \quad &P_6 = (B-D)(G+H) \\ P_3 = (C+D)E \quad & P_7 = (A-C)(E+F) \\ P_4 = D(G-E) \quad & \\ \end{split}

The new runtime is

T(n)=7T(n/2)+O(n2)O(nlog27)O(n2.81)T(n)=7T(n/2)+O(n^2)\in O(n^{\log_2 7}) \sim O(n^{2.81})

Fast Fourier Transform (DPV 2.6)

e^(iπ) in 3.14 minutes, using dynamics | DE5
https://youtu.be/v0YEaeIClKY
But what is the Fourier Transform? A visual introduction.
https://youtu.be/spUNpyF58BY
The Fast Fourier Transform (FFT): Most Ingenious Algorithm Ever?
https://youtu.be/h7apO7q16V0

Product of two degree-d polynomials is a polynomial of degree 2d:

A(x)=a0+a1x++adxd,B(x)=b0+b1x++bdxdA(x)=a_0+a_1x+\cdots+a_dx^d, \\ B(x)=b_0+b_1x+\cdots+b_dx^d

Then

C(x)=A(x)B(x)=c0+c1x++c2dx2d\begin{split} C(x)&=A(x)\cdot B(x)\\ &=c_0+c_1x+\cdots+c_{2d}x^{2d} \end{split}

and C(x)C(x) will have coefficients

ck=a0bk+a1bk1++akb0=i=0kaibkic_k = a_0b_k+a_1b_{k-1}+\cdots+a_kb_0=\sum_{i=0}^k a_ib_{k-i}

Where i>d,ai=0bi=0\forall i > d, a_i = 0 \wedge b_i = 0

Computing each ckc_k takes O(k)O(k) and results to an overall Θ(d2)\Theta(d^2) steps since we have 2d+12d+1 coefficients

So therefore do we have a faster algorithm for solving this problem?

🔥
Note: Coefficient is present in decreasing degree order

Here AA are coefficients of the functions (describing the function), and ω\omega are point to be evaluated at the function.

So now we have evaluted the polynomial at those representation points, we have multiplied the values to obtain (2n+1)(2n+1) values, how do we interpolate back?

The definitive FFT Algorithm:

The FFT takes as input a vector a=(a0,,an1)a = (a_0, \dots, a_{n-1}) and a complex number ω\omega whose powers 1,ω,,ωn11, \omega, \dots, \omega^{n-1} are complex n-th roots of unity.

[A(ω0)A(ω1)A(ωn1)]=[11111ω1ω2ωn11ωn1ω2(n1)ω(n1)(n1)][a0a1an1]\begin{bmatrix} A(\omega^{0}) \\ A(\omega^{1}) \\ \vdots \\ A(\omega^{n-1}) \end{bmatrix} = \begin{bmatrix} 1 &1 &1 &\cdots &1 \\ 1 &\omega^{1} &\omega^2 &\cdots &\omega^{n-1} \\ \vdots &\vdots &\vdots &\ddots &\vdots \\ 1 &\omega^{n-1} &\omega^{2(n-1)} &\cdots &\omega^{(n-1)(n-1)} \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_{n-1} \end{bmatrix}

We can apply divide and conquer to the matrix multiplication steps

The final product is the vector

Summary:

The divide-and-conquer step of the FFT can be drawn as a very simple circuit. Here is how a problem of size nn is reduced to two subproblems of size n/2n/2 (for clarity, one pair of outputs (j,j+n/2)(j, j + n/2) is singled out):

the edges are wires carrying complex numbers from left to right. A weight of jj means “multiply the number on this wire by ωj\omega^j .

  1. For nn inputs there are log2n\log_2 n levels, each with nn nodes, for a total of O(nlogn)O(n \log n) operations
  1. the inputs are arranged by increasing last bit of the binary representation of their index, resolving ties by looking at the next more significant bit(s).
    1. For n=8n=8, The resulting order in binary, 000, 100, 010, 110, 001, 101, 011, 111
  1. There is a unique path between each input aja_j and each output A(ωk)A(\omega^k)
    1. This path is most easily described using the binary representations of j and k (shown in Figure 10.4 for convenience). There are two edges out of each node, one going up (the 0-edge) and one going down (the 1-edge).
    1. To get to A(ωk) from any input node, simply follow the edges specified in the bit representation of k, starting from the rightmost bit
  1. On the path between aja_j and A(ωkA(\omega^k), the labels add up to jkmod8jk \mod 8.
    1. Contribution of input aja_j to output A(wk)A(w^k) is ajwjka_jw^{jk}
  1. Notice that the FFT circuit is a natural for parallel computation and direct implementation in hardware.

Graphs (DPV 3 + 4)

DFS (Depth-first Search) in undirected graphs (DPV 3.2)

Surprisingly versatile linear-time procedure that reveals a wealth of information about a graph

🔥
What parts of the graph are reachable from a given matrix?

Explore

explore(G,v)O(E)explore(G,v) \in O(|E|)

Depth-First Search Algorithm (DFS)

First step:

Marking visited to false, O(V)O(|V|)

Second step:

Every edge examined exactly twice (for an edge {x,y}E\{x,y\} \in E, we visit from explore(x)explore(x) and explore(y)explore(y)) ⇒ O(E)O(|E|)

Connected Components

We can also check what connected components exist in the graph by simply modifying a bit in the DFS algorithm

Where here ccnumccnum identifies the connect component to which vv belongs

cccc is initialized to zero and incremented each time DFSDFS calls exploreexplore

Previsit and postvisit ordering

We will also note down the times of two important events, the moemnt of first discovery and of final departure.

Directed Acyclic graphs (DAGs) (DPV 3.3.2)

Good for modeling relationships like causalities, hierarchies, and temporal dependencies.

All DAGs can be linearized. ⇒ perform tasks in decreasing order of their post numbers

only edges (u,v)(u,v) in a graph for which post(u)<post(v)post(u)< post(v) are back edges ⇒ and dag cannot have back edges ⇒ In a DAG, every edge leads to a vertex with a lower postpost number

So therefore

🔥
Acyclicity, linearizability, and the absence of back edges during a depth-first search - are in fact one and the same thing

Also, since every DAG has at least one source and one sink…We can do this linearization

  1. Find a source, output it, and delete it from graph
  1. Repeat until the graph is empty

Strongly Connected Components

Two nodes uu and vv of a directed graph are connected if there is a path from uu to vv and a path from vv to uu.
🔥
Every directed graph is a DAG of its strongly connected components

And… turns out the meta-graph can be found in linear time by making further use of DFS.

Property 1: If the exploreexplore subroutine is started at node uu, then it will terminate precisely when all nodes reachable from uu have been visited.

So if we call exploreexplore on a node that lies somewhere in a sinksink strongly connected component, then we will retrieve exactly that component.

Property 2: The node that receives the highest postpost number in a DFS must lie in a source strongly connected component.
Property 3: If CC and CC’ are strongly connected components, and there is an edge from a node in CC to a node in CC’, then the highest postpost number in CC is bigger than the highest postpost number in CC’.

Proof of property 3:

In proving Property 3, there are two cases to consider. If the depth-first search visits component CC before component CC’, then clearly all of CC and CC′ will be traversed before the procedure gets stuck (see Property 1). Therefore the first node visited in CC will have a higher post number than any node of CC′. On the other hand, if CC′ gets visited first, then the depth-first search will get stuck after seeing all of CC′ but before seeing any of C, in which case the property follows immediately.

We can use the reverse graph GRG^{R} and property 2 to find a node from the sinksink component of graph GG

Once we have found the first strongly connected component and deleted it from the graph, the node with the highest post number among those remaining will belong to a sink strongly connected component of whatever remains of GG. Therefore we can keep using the post numbering from our initial depth-first search on GRG^R to successively output the second strongly connected component, the third strongly connected component, and so on. The resulting algorithm is this.

  1. Run DFS on GRG^R
  1. Run the undirected connected components algorithm on GG, and during the DFS, process the vertices in decreasing order of their postpost numbers from step 1.

This is linear time, only the constant is about twice of normal DFS search

Paths in Graphs: Distances (DPV 4.1)

The distance between two nodes is the length of the shortest path between them.

To get a concrete feel for this notion, consider a physical realization of a graph that has a ball for each vertex and a piece of string for each edge. If you lift the ball for vertex s high enough, the other balls that get pulled up along with it are precisely the vertices reachable from s. And to find their distances from s, you need only measure how far below s they hang.

Breadth-First Search (BFS) (DPV 4.2)

Initially the queue QQ consists only of the root, ss, the one node at distance 00. And for each subsequent distance d=1,2,3,d = 1, 2, 3, \dots, there is a point in time at which QQ contains all the nodes at distance dd and nothing else. As these nodes are processed (ejected off the front of the queue), their as-yet-unseen neighbors are injected into the end of the queue.

Lengths on Edges (DPV 4.3)

We will annotate every edge eEe \in E with lel_e.

Dijkstra’s Algorithm (DPV 4.4)

we can think of Dijkstra’s algorithm as just BFS, except it uses a priority queue instead of a regular queue, so as to prioritize nodes in a way that takes edge lengths into account.

Runtime Analysis:

Dijkstra’s algorithm is structurally identical to breadth-first search. However, it is slower because the priority queue primitives are computationally more demanding than the constant-time ejecteject’s and injectinject’s of BFS. Since makequeue takes at most as long as V|V| insert operations, we get a total of V|V| deletemindeletemin and V+E|V | + |E| insertinsert/decreasekeydecreasekey operations. The time needed for these varies by implementation; for instance, a binary heap gives an overall running time of O((V+E)logV)O((|V | + |E|) log |V |).

Priority Queue Implemention in Dijkstra (DPV 4.5)

Array

The simplest implementation of a priority queue is as an unordered array of key values for all potential elements (the vertices of the graph, in the case of Dijkstra’s algorithm).

Initially, these values are set to \infin. An insert or decreasekey is fast, because it just involves adjusting a key value, an O(1)O(1) operation. To deletemindeletemin, on the other hand, requires a linear-time scan of the list.

Binary Heap

Elements are stored in a complete binary tree, in which each level is filled in from left to right, and must be full before the next level is started.

And special ordering constraint: the key value of any node of the tree is less than or equal to that of its children.

To insertinsert, place the new element at the bottom of the tree (in the first available position), and let it “bubble up.” That is, if it is smaller than its parent, swap the two and repeat (Figure 4.11(b)–(d)). The number of swaps is at most the height of the tree, which is log2n\lfloor \log_2 n \rfloor when there are nn elements.

A decreasekeydecreasekey is similar, except that the element is already in the tree, so we let it bubble up from its current position.

To deletemindeletemin, return the root value. To then remove this element from the heap, take the last node in the tree (in the rightmost position in the bottom row) and place it at the root. Let it “sift down”: if it is bigger than either child, swap it with the smaller child and repeat (Figure 4.11(e)–(g)). Again this takes O(log n) time.

The regularity of a complete binary tree makes it easy to represent using an array. The tree nodes have a natural ordering: row by row, starting at the root and moving left to right within each row. If there are n nodes, this ordering specifies their positions 1,2,,n1, 2, \dots , n within the array. Moving up and down the tree is easily simulated on the array, using the fact that node number jj has parent j/2\lfloor j/2 \rfloor and children 2j2j and 2j+12j + 1

d-ary Heap

A d-ary heap is identical to a binary heap, except that nodes have d children instead of just two. This reduces the height of a tree with nn elements to Θ(logdn)=Θ((logn)/(logd))\Theta(\log_d n) = \Theta((\log n)/(\log d)).

Inserts are therefore speeded up by a factor of Θ(logd)\Theta(log d). DeleteminDeletemin operations, however, take a little longer, namely O(dlogdn)O(d \log_d n)

Negative-value edges? (DPV 4.6.1)

Dijkstra’s method can be thought as a sequence of “Update” operations

But Dijkstra’s method will mark off vertices that have been used to calculate shorter distances, this usually cannot be the case with negative-value edges because now updated vertices might receive an update from a negative edge.

🔥
So can we just simply update ALL the edges V1|V|-1 times? Bellman-Ford Algorithm O(VE)\in O(|V||E|)

Detecting Negative Cycles (DPV 4.6.2)

In such situations, it doesn’t make sense to even ask about shortest paths. There is a path of length 2 from A to E. But going round the cycle, there’s also a path of length 1, and going round multiple times, we find paths of lengths 0, −1, −2, and so on.

Fortunately, it is easy to automatically detect negative cycles and issue a warning. Such a cycle would allow us to endlessly apply rounds of update operations, reducing dist estimates every time.

🔥
So instead of stopping after V1|V | − 1 iterations (in Bellman-Ford), perform one extra round. There is a negative cycle if and only if some dist value is reduced during this final round.

Shortest Paths in DAGs that doesn’t have negative cycles (DPV 4.7)

There are two subclasses of graphs that automatically exclude the possibility of negative cycles:

  1. graphs without negative edges
    1. Dijkstra
  1. graphs without cycles
    1. Can be solved in linear time!

Linearize (that is, topologically sort) the dag by depth-first search, and then visit the vertices in sorted order, updating the edges out of each.

Greedy Algorithms (DPV 5)

A player who is focused entirely on immediate advantage is easy to defeat. But in many other games such as Scrabble, it is possible to do quite well by simply making whichever move seems best at the moment

Minimum Spanning Trees

Suppose you are asked to network a collection of computers by linking selected pairs of them. Each link also has a maintenance cost, reflected in that edge’s weight. What is the cheapest possible network?

Trees

A tree is an undirected graph that is connected and acyclic

Property:

  1. A tree on nn nodes has n1n-1 edges
  1. Any connected, undirected graph G=(V,E)G=(V,E) with E=V1|E| = |V|-1 is a tree
  1. An undirected graph is a tree if and only if there is a unique path between any pair of nodes

🧙🏽‍♂️
Removing a cycle edge cannot disconnect a graph

The cut property:

Suppose edges XX are part of a minimum spanning tree of G=(V,E)G = (V, E). Pick any subset of nodes SS for which XX does not cross between SS and VSV − S, and let ee be the lightest edge across this partition. Then X{e}X \cup \{e\} is part of some MST.

最小生成树(Kruskal(克鲁斯卡尔)和Prim(普里姆))算法动画演示_哔哩哔哩_bilibili
最小生成树是一副连通加权无向图中一棵权值最小的生成树,本视频用动画的方式生动形象的演示构建最小生成树的kruskal(克鲁斯卡尔)算法或prim(普里姆)算法, 视频播放量 176820、弹幕量 1168、点赞数 8032、投硬币枚数 6321、收藏人数 7367、转发人数 2434, 视频作者 WAY_zhong, 作者简介 @kvisual ...
https://www.bilibili.com/video/BV1Eb41177d1/

Data structure for disjoint sets - Union by rank

Faster find:

When we want to merge sets, if the representatives (roots) of the sets are rxr_x and ryr_y , do we make rxr_x point to ryr_y or the other way around? Since tree height is the main impediment to computational efficiency, a good strategy is to make the root of the shorter tree point to the root of the taller tree.

Properties:

  1. x,rank(x)<rank(π(x))\forall x, \text{rank}(x) < \text{rank}(\pi(x))
  1. Any node of rank k has at least 2k2^k nodes in its tree
  1. If there are nn elements overall, there can be at most n/2kn/2^k nodes of rank kk

Kruskal’s Method

Sort the edges by their weights, everytime check if the smallest edge would add a cycle to the graph. If it doesn't then it gets added to the graph.

O(ElogE)O(ElogV)O(|E| \log |E|) \approx O(|E| \log |V|) for sorting edges and O(ElogV)O(|E| \log |V|) for unino and find operations

But if the weights are small so that sorting can be done in linear time? Can we perform unionunion and findfind faster than logn\log n?

during each findfind, when a series of parent pointers is followed up to the root of a tree, we will change all these pointers so that they point directly to the root ⇒ use the faster findfind

Now the amortized cost of findfind and unionunion is O(1)O(1)

Prim’s Method

Start from an arbitraty node, at each step we added the shortest edge connecting some node already in the tree to one that isn't yet.

So obtain a subgroup and another subgroup that contains vertices not selected in the first subgroup. Whatever edges that hasn't been selected yet that connects the first group with the second and has the smallest weight will be selected.

Huffman Encoding O(nlogn)O(n \log n)

What is the most economic way of writing strings consisting of letters A,B,C,D?

The obvious choice is to use 22 bits per symbol—say codeword 0000 for AA, 0101 for BB, 1010 for CC, and 1111 for DD - in which case 260260 megabits are needed in total. Can there possibly be a better encoding than this?

In search of inspiration, we take a closer look at our particular sequence and find that the four symbols are not equally abundant.

Is there some sort of variable-length encoding, in which just one bit is used for the frequently occurring symbol A, possibly at the expense of needing three or more bits for less common symbols?

A danger with having codewords of different lengths is that the resulting encoding may not be uniquely decipherable. For instance, if the codewords are {0,01,11,001}\{0, 01, 11, 001\}, the decoding of strings like 001001 is ambiguous. We will avoid this problem by insisting on the prefix-free property: no codeword can be a prefix of another codeword.

🔥
Any prefix-free encoding can be represented by a full binary tree ⇒ A binary tree in which every node has zero or two children ⇒ every codeword is generated y a path from root to leaf.

To measure how “good” a coding tree is, we want a tree whose leaves each correspond to a symbol and minimizes the overall length of the encoding

cost of tree=i=1nfiundefinedfrequency of the i-th symbol×depth of i-th symbol in tree\text{cost of tree} = \sum_{i=1}^n \underbrace{f_i}_{\text{frequency of the i-th symbol}} \times \text{depth of i-th symbol in tree}
Tells us that the two symbols with the smallest frequencies MUST be at the bottom of the optimal tree, as children of lowest internal node. ⇒ because otherwise swapping whatever node with either of those two would increase performance.

This suggests that we start constructing the tree greedily: find the two symbols with the smallest frequencies, say ii and jj, and make them children of a new node, which then has frequency fi+fjf_i + f_j

There is another way to write this cost function that is very helpful. Although we are only given frequencies for the leaves, we can define the frequency of any internal node to be the sum of the frequencies of its descendant leaves; this is, after all, the number of times the internal node is visited during encoding or decoding. During the encoding process, each time we move down the tree, one bit gets output for every nonroot node through which we pass.

And therefore

The cost of a tree is the sum of the frequencies of all leaaves and internal nodes, except the root


Entropy

Average number of bits needed to encode a single draw from the distribution
number of bits needed to encode sequence=i=1nmpilog(1/pi)\text{number of bits needed to encode sequence} = \sum_{i=1}^n m p_i \log(1/p_i)

Measures the randomness of a set of mixed elements

entropy=i=1npilog(1pi)\text{entropy} = \sum_{i=1}^n p_i \log(\frac{1}{p_i})

Horn Formula

A framework for doing computer doing logical reasoning!

A literal is either: a variable or a variable’s negation

Then we have two kinds of caluses:

  1. Implications, left hand side is and AND of any number of positive literals and right hand is a single positive iteral a    ba \implies b
  1. Pure negative clauses, consisting of an OR of any number of negative literals (uˉvˉyˉ)(\bar{u} \vee \bar{v} \vee \bar{y})

Given a set of clauses, the goal is to do satisfying assignment - determine whether there is a consistent explanation: an assignment of true/falsetrue / false values to the variables that satisfies all the clauses.

Our approach:

  1. Start with all variables falsefalse
  1. and proceed to set some of them to truetrue
    1. one by one and only if we absolutely have to because an implication would be violated

Horn formulas lie at the heart of Prolog (“programming by logic”), a language in which you program by specifying desired properties of the output, using simple logical expressions

To see why the algorithm is correct, notice that if it returns an assignment, this assignment satisfies both the implications and the negative clauses, and so it is indeed a satisfying truth assignment of the input Horn formula. So we only have to convince ourselves that if the algorithm finds no satisfying assignment, then there really is none. This is so because our “stingy” rule maintains the following invariant:

If a certain set of variables is set to true, then they must be true in any satisfying assignment.

Set cover

County is in its early stages of planning and is deciding where to put schools. There are only two constraints: each school should be in a town, and no one should have to travel more than 30 miles to reach one of them.

What is the minimum number of schools needed?

Let SxS_x be the set of towns within 30 miles of town xx, a school at xx will esentially “cover”these other towns. Then how many sets SxS_x must be picked in order to cover all the towns in the country?

Greedy solution:

  1. Repeat until all elements of BB are covered
    1. Pick the set SiS_i with the largest number of uncovered elements

But we will see that this is not optimal, but it’s not too far off, so let’s modify this.


Suppose BB contains nn elements and that the optimal cover consists of kk sets, then the greedy algorithm will use at most klnnk \ln n sets.

Let ntn_t be the number of elements still not covered after tt iterations of the greedy algorithm, since these remaining elements are covered by the optimal kk sets

there must be some set with at least nt/kn_t/k of them.

Therefore, the greedy strategy will ensure that

nt+1ntntk=nt(11k)n_{t+1} \le n_t - \frac{n_t}{k}=n_t(1-\frac{1}{k})

Which implies ntn0(11/k)tn_t \le n_0(1-1/k)^t.

If we use an inequality

1x{<exx0=exif x=01-x \begin{cases} < e^{-x} &\quad \forall x \ne 0 \\ =e^{-x} &\quad \text{if } x =0 \end{cases}

Therefore

ntn0(11k)t<n0(e1/k)t=net/kn_t \le n_0(1-\frac{1}{k})^t < n_0(e^{-1/k})^t = ne^{-t/k}

Dynamic Programming (DPV 6)

Shortest Paths in DAGS

👉
Every DAG can be expressed by linearized graph

In this particular graph, we see that in order for us to find shortest distance from SS to DD

dist(D)=min{dist(B)+1,dist(C)+3}dist(D) = \min\{dist(B)+1, dist(C)+3\}

We can compute all distances in a single pass:

Longest Increasing Subsequences

String Edit Distance O(nm)O(nm)

Subproblem of string edit distance E(i1,j),E(i,j1),E(i1,j1)E(i − 1, j), E(i, j − 1), E(i − 1, j − 1)
E(i,j)=min{1+E(i1,j),1+E(i,j1),diff(i,j)+E(i1,j1)}E(i, j) = \min\{1 + E(i − 1, j), 1 + E(i, j − 1), diff(i, j) + E(i − 1, j − 1)\}

Knapsack

During a robbery, a burglar finds much more loot than he had expected and has to decide what to take. His bag (or “knapsack”) will hold a total weight of at most WW pounds. There are nn items to pick from, of weight w1,,wnw_1, \dots , w_n and dollar value v1,,vnv_1, \dots , v_n. What’s the most valuable combination of items he can fit into his bag?

Knapsack with repetition O(nW)O(nW)

If there are unlimited quantities of each item available
K(w)=maximum value achievable with a knapsack of capacity wK(w) = \text{maximum value achievable with a knapsack of capacity $w$}

If the optimal solution to K(w)K(w) includes item i, then removing this item from the knapsack leaves an optimal solution to K(wwi)K(w − w_i).

K(w)=maxi{K(wwi)+vi}K(w) = \max_i \{ K(w - w_i) + v_i \}

Kanpsack without repetition

K(w,j)=maximum value achievable using a knapsack of capacity w and items 1,,jK(w, j) = \text{maximum value achievable using a knapsack of capacity w and items } 1, \dots , j
K(w,j)=max{K(wwj,j1)+vj,K(w,j1)}K(w, j) = \max\{K(w − w_j , j − 1) + v_j , K(w, j − 1)\}

(Chain) Matrix Multiplication

Observation: Order of multiplications make a big difference in the final runtime!
👉
Notice that a particular parathesization can be represented very naturally by a binary tree in which the individual matrices correspond to the leaves, the root is the final product, an interior nodes are intermediate products
For a tree to be optimal, its subtrees must also be optimal.

Let the subproblems corresponding to the subtrees be in the form Ai×Ai+1××AjA_i \times A_{i+1} \times \cdots \times A_{j}. For 1ijn1 \le i \le j \le n, we define a function

C(i,j)=minimum cost of multiplying Ai×Ai+1××AjC(i,j) = \text{minimum cost of multiplying } A_i \times A_{i+1} \times \cdots \times A_j

When there’s nothing to multiply ⇒ C(i,i)=0C(i,i) = 0

We can split the subproblem into smaller subproblems

C(i,j)=minik<j{C(i,k)+C(k+1,j)+mi1mkmj}C(i,j) = \min_{i \le k < j} \{ C(i,k)+C(k+1,j) + m_{i-1} \cdot m_k \cdot m_j \}

The subproblems constitute a two-dimensional table, each of whose entries take O(n)O(n) time to compute, the overall runtime is thus O(n3)O(n^3).

Shortest Paths

We started this chapter with a dynamic programming algorithm for the elementary task of finding the shortest path in a dag. We now turn to more sophisticated shortest-path problems and see how these too can be accommodated by our powerful algorithmic technique.

Shortest Reliable Paths O(VE)\in O(|V||E|)

In a communications network, for example, even if edge lengths faithfully reflect transmission delays, there may be other considerations involved in choosinga path. For instance, each extra edge in the path might be an extra “hop” fraught with uncertainties and dangers of packet loss.

dist(v,i)=min(u,v)E{dist(u,i1)+l(u,v)}dist(v,i) = \min_{(u,v) \in E} \{ dist(u,i-1) + l(u,v) \}

All-pairs shortest paths (Flyd-Warshall Algorithm) O(V3)\in O(|V|^3)

What if we want to find the shortest path not just between s and t but between all pairs of vertices? One approach would be to execute our general shortest-path algorithm from Bellman-Ford (since there may be negative edges) V|V | times, once for each starting node. The total running time would then be O(V2E)O(|V|^2|E|). We’ll now see a better alternative, the O(V3)O(|V|^3) dynamic programming-based Floyd-Warshall algorithm.

👉
If we disallow intermediate nodes then shortest path from uu to vv is simply (u,v)(u,v) ⇒ So can we just slowly expand the allowed intermediates, one at a time, updating the shortest path lengths at each step, and once we’ve expanded to the whole set of vertices, we have found the true shortest paths!

If we number the vertices VV as {1,2,,k}\{ 1,2,\dots,k \}, and denote dist(i,j,k)dist(i,j,k) as the length of the shortest path from ii to jj in which only nodes {1,2,,k}\{1,2,\dots, k\} can be used as intermediates.

What happens when we expand the intermediate set to include an extra node k? We must reexamine all pairs i, j and check whether using k as an intermediate point gives us a shorter path from i to j.

Then using kk gives us a shorter path from ii to jj iff

dist(i,k,k1)+dist(k,j,k1)<dist(i,j,k1)dist(i,k,k-1) + dist(k,j,k-1) < dist(i,j,k-1)

So the algorithm(Floyd-Warshall Algorithm O(V3)\in O(|V|^3))

Traveling Salesman Problem (TSP) O(n22n)\in O(n^2 2^n)

A traveling salesman is getting ready for a big sales tour. Starting at his hometown, he will conduct a journey in which each of his target cities is visited exactly once before he returns home. Given the pairwise distances between cities, what is the best order in which to visit them, so as to minimize the overall distance traveled?

If we denote the cities by 1,,n1, \dots, n, the salesman’s hometown being 1, and let D=(dij)D = (d_{ij}) be the matrix of intercity distances, the goal is to design a tour that starts and ends at 1, including all other cities exactly once, and has minimum total length.

Bruteforce takes O(n!)O(n!) time, and although DP improves this runtime a lot it is still not possible to solve this in polynomial time.

Subproblems:

For a subset of cities S={1,j,}{1,2,,n}S = \{1,j, \dots\} \sube \{1,2, \dots, n\} that includes 1, and jSj \in S, let C(S,j)C(S,j) be the length of the shortest path visiting each node in SS exactly once, starting at 1 and ending at jj.

When S>1|S|>1, wedefine C(S,1)=C(S,1) = \infin since the path cannot both start and end at 1.

C(S,j)=miniS:ijC(S{j},i)+dijC(S,j) = \min_{i\in S: i \ne j} C(S-\{j\},i)+d_{ij}

The subproblems are ordered by S|S|, the whole algorithm is:

There are at most 2n×n2^n \times n subproblems, and each one takes linear time to solve, so the overall complexity is O(n22n)O(n^2 2^n)

Independent Sets in Trees O(V+E)\in O(|V|+|E|)

A subset of nodes SVS \sub V is an independent set of graph G=(V,E)G = (V,E) if there are no edges between them.

Like before, we can form this problem into a tree of nodes representing sub-problems

We will define the subproblem(uu is a node in the tree, which also represents a node in graph):

We will simply select any node in graph vertices VV as the root of the tree
I(u)=size of largest independent set of subtree hanging from uI(u) = \text{size of largest independent set of subtree hanging from $u$}

Our goal is to find I(r)I(r)rr represents the root of the tree

I(u)=max{1+wgrandchildren(u)I(w),wchildren(u)I(w)}I(u) = \max \{1 + \sum_{\forall w \in grandchildren(u)} I(w), \sum_{\forall w \in children(u)} I(w) \}

If the independent set includes uu, then we get one point for it, but we aren’t allowed to include the children of uu (since they are directly connected to uu)—therefore we move on to the grandchildren.

The number of subproblems is exactly the number of vertices and with a little care the runtime can be made linear O(V+E)O(|V|+|E|)

Linear Programming and Reductions (DPV 7)

Linear Programming (DPV 7.1 + 7.4)

We are given a set of variables, and we want to assign real values to them so as to

  1. satisfy a set of linear equations / inequalities involving these variables
  1. maximize or minimize a given linear objective function
A typical linear program

An inequality designates a half-space, the region on one side of the line.

Thus the set of all feasible solutions of this linear program, that is, the points (x1, x2) which satisfy all constraints, is the intersection of five half-spaces, and is a convex polygon.

We want to find the point in this polygon at which the objective function—the profit—is maximized. The points with a profit of c dollars lie on the line x1+6x2=cx_1 + 6x_2 = c, which has a slope of 1/6−1/6.

Since the goal is to maximize c, we must move the line as far up as possible, while still touching the feasible region. The optimum solution will be the very last feasible point that the profit line sees and must therefore be a vertex (or edge) of the polygon, as shown in the figure.

Standard Form of LP (DPV 7.1)

  1. All variables are nonnegative
  1. Constraints are all equations
  1. objective function is to be minimized

Duality of LP (DPV 7.4)

Duality theorem: If a linear program has a bounded optimum, then so does its dual, and the two optimum values coincide
🤔
When the primal is the LP that expresses the max-flow problem, it is possible to assign interpretations to the dual variables that show the dual to be non other than the minimum-cut problem

Simplex Algorithm

Fact: each vertex is the unique point at which some subset of hyperplanes meet.

So to find a vertex:

Suppose we have a generic LP

max{cx}Axbx0\max \{c^\top x\} \\ Ax \le b \\ x \ge 0

For each iteration, we want to

  1. Check if the current vertex is optimal
  1. Determine where to move next

Note:

The origin is a vertex that we can start with.

  1. The current vertex(origin) is optimal iff ci0\forall c_i \le 0
  1. We can increase some xix_i for which ci>0c_i > 0 (release the tight constraint x0x \ge 0 ⇒ until we hit some other constraint (loose ⇒ tight)

Now we arrived at some other vertex uu

Then we transform uu into the origin(shifting the coordinate system from the usual to the “local” view of uu)

These local coordinates consist of (appropriately scaled) distances y1,,yny_1, \dots , y_n to the nn hyperplanes (inequalities) that define and enclose uu:

The revised “local” LP has the following three properties:

  1. It includes the inequalities y0y \ge 0, which are simply the transformed versions of the inequalities defining uu
  1. uu is the origin in y-space
  1. Cost function becomes maxcu+c~y\max c_u + \tilde{c}^\top y where cuc_u is the value of the objective function at uu and c~\tilde{c} is a transformed cost vector

The simplex function moves from vertex to neighboring vertex, stopping when the objective function is locally optimal, that is, when the coordinates of the local cost vector are all zero or negative.


Runtime Analysis

Suppose there are nn variables and mm inequalities

At each iteration:

  1. Each neighbors share n1n-1 inequalities with the current vertex, at most nmn \cdot m neighbors
    1. Choose which inequality to drop (nn) and which new one to add (mm)
  1. Naive way to perform each iteration
    1. Check each potential neighbor to see whether it really is a vertex of the polyhedron and to determine its cost
      1. Finding cost is just a dot product
      1. Checkingif it is a true vertex involves a solving a system of linear equations with nn equations in nn unknowns
        1. by Gaussian Elimination O(n3)O(n^3)
    1. O(mn4)\in O(mn^4) per iterations
  1. Not-naive way to perform each iteration
    1. Turns out that the per-iteration overhead of rewriting the LP in terms of the current local coordinates is just O((m+n)n)O((m + n)n)
      1. exploits the fact that the local view changes only slightly between iterations, in just one of its defining inequalities
    1. To select the best neighbor
      1. we recall that the (local view of) the objective function is of the form maxcu+c~y\max c_u +\tilde{c} \cdot y where cuc_u is the value of the objective function at uu.
      1. We pick any c~i>0\tilde{c}_i > 0 (if there is none, then the current vertex is optimal and simplex halts)
    1. O(mn)\in O(mn) per iteration
  1. At most (m+nn)m+n \choose n iterations (vertices)
    1. Exponential in nn
    1. In practice, such exponential examples do not occur

Finding a starting vertex for simplex

It turns out that finding a starting vertex can be reduced to an LP and solved by simplex!

Recall standard form:

mincxs.t. Ax=bx0\min c^\top x \\ \text{s.t. } Ax=b \wedge x \ge 0
  1. First make sure that RHS of equations are all non-negative
    1. If bi<0b_i < 0, just multiply both sides of the ii-th equation by 1-1
    1. Create LP as follows
      1. Create artificial variables z1,,zm0z_1, \dots, z_m \ge 0, where mm is number of equations
      1. Add ziz_i to the LHS of the i-th equation
      1. Let the objective to be minimized be izi\sum_i z_i
    1. It’s easy to come up with a starting vertex
      1. zi=bi\forall z_i = b_i and other variables 0
    1. If the optimum value of z1++zmz_1 + \cdots + z_m is zero, then all ziz_i’s obtained by simplex are zero, and hence from the optimum vertex of the new LP we get a starting feasible vertex of the original LP, just by ignoring the ziz_i
    1. If izi\sum_i z_i is not zero ⇒ the original linear program is infeasible

Degeneracy in Simplex Algorithm

When a vertex has more than nn(dimension of variables) constraints intersecting, then choosing any one of sets of nn inequalties would give us the same solution

Unboundedness of LP

Network Flow (Maximum Flow O(VE2)\in O(|V| \cdot |E|^2)) (DPV 7.2)

The networks we are dealing with consist of a directed graph G=(V,E)G = (V, E); two special nodes s,tVs, t \in V , which are, respectively, a source and sink of G; and capacities ce>0c_e > 0 on the edges.

We would like to send as much things as possible from ss to tt without exceeding the capacities of any of the edges. A particular shipping scheme is called a flow and consists of a variable fef_e for each edge ee of the network, satisfying the following two properties:

  1. Doesn’t violate edge capacities 0fece0 \le f_e \le c_e
  1. For all nodes uu except ss and tt, the amount of flow entering and the amount of flow leaving uu should be the same ⇒ flow is conserved
    1. (w,u)Efwu=(u,z)Efuz\sum_{(w,u) \in E} f_{wu} = \sum_{(u,z) \in E} f_{uz}

The size of a flow is the total quantity sent form ss to tt and is the quantity leaving ss

size(f)=(s,u)Efsusize(f) = \sum_{(s,u) \in E} f_{su}

We can solve max flow by Ford-Fulkerson

Ford-Fulkerson in 5 minutes
Step by step instructions showing how to run Ford-Fulkerson on a flow network. Sources: 1. http://www.win.tue.nl/~nikhil/courses/2WO08/07NetworkFlowI.pdf LinkedIn: https://www.linkedin.com/in/michael-sambol-076471ba
https://www.youtube.com/watch?v=Tl90tNtKvxs
  1. Find an augmenting path in the graph
    1. BFS(Edmonds–Karp), A* search, RL Model
  1. Construct the residual graph by augmenting the path based on the original graph and augmenting path
  1. Do this until no augmenting path can be found.
cs170-coding-notebooks-fa22/hw8.ipynb at master · ToiletCommander/cs170-coding-notebooks-fa22
Coding Jupyter Notebooks for Fall 2022 Iteration of CS 170 - cs170-coding-notebooks-fa22/hw8.ipynb at master · ToiletCommander/cs170-coding-notebooks-fa22
https://github.com/ToiletCommander/cs170-coding-notebooks-fa22/blob/master/hw8/hw8.ipynb

The max flow problem can also be reduced to a LP problem

Once we have studied the simplex algorithm in more detail (DPV 7.6), we will be in a position to understand exactly how it handles flow LPs, which is useful as a source of inspiration for designing direct max-flow algorithms. It turns out that in fact the behavior of simplex has an elementary interpretation:

in each iteration simplex looks for an sts \rightarrow t path whose edges (u,v)(u, v) can be of two types

  1. (u,v)(u,v) is in the original network, and is not yet at full capacity
  1. The reverse edge (v,u)(v,u) is in the original network, and there is some flow along it

If the current flow is f , then in the first case, edge (u,v)(u, v) can handle up to cuvfuvc_{uv} − f_{uv} additional units of flow, and in the second case, upto fvu additional units (canceling all or part of the existing flow on (v,u)(v, u)). These flow-increasing opportunities can be captured in a residual network Gf=(V,Ef)G^f = (V, E^f), which has exactly the two types of edges listed, with residual capacities cfc^f:

cf={cuvfuvif (u,v)E and fuv<cuvfuvif (v,u)E and fvu>0c_f = \begin{cases} c_{uv} - f_{uv} &\quad \text{if $(u,v) \in E$ and $f_{uv} < c_{uv}$} \\ f_{uv} &\quad \text{if $(v,u) \in E$ and $f_{vu} > 0$} \end{cases}

Thus we can equivalently think of simplex as choosing an sts − t path in the residual network

🧙🏽‍♂️
Not only does simplex correctly compute a maximum flow, but it also generates a short proof of the optimality of this flow!

Partition the nodes of the oil network (Figure 7.4) into two groups, L={s,a,b}L = \{s, a, b\} and R={c,d,e,t}R = \{c, d, e, t\}:

Any oil transmitted must pass from LL to RR. Therefore, no flow can possibly exceed the total capacity of the edges from LL to RR, which is 7. But this means that the flow we found earlier, of size 7, must be optimal

🔥
More generally, an (s,t)(s, t)-cut partitions the vertices into two disjoint groups LL and RR such that ss is in LL and tt is in RR. Its capacity is the total capacity of the edges from LL to RR, and as argued previously, is an upper bound on any flow: Pick any flow ff and any (s,t)(s, t)-cut (L,R)(L, R). Then size(f)capacity(L,R)size(f ) \le capacity(L, R).

Some cuts are large and give loose upper bounds— cut({s,b,c},{a,d,e,t})cut(\{s, b, c\}, \{a, d, e, t\}) has a capacity of 19. But there is also a cut of capacity 7, which is effectively a certificate of optimality of the maximum flow. Such a cut always exists.

Max-flow min-cut theorem

The size of the maximum flow in a network equals the capacity of the smallest (s,t)(s, t)-cut. Moreover, our algorithm automatically finds this cut as a by-product!

Suppose ff is the final flow when the algorithm terminates. We know that node tt is no longer reachable from ss in the residual network GfG^f . Let LL be the nodes that are reachable from ss in GfG^f , and let R=VLR = V − L be the rest of the nodes. Then (L,R)(L, R) is a cut in the graph G:

We claim that:

size(f)=capacity(L,R)size(f) = capacity(L,R)

Runtime:

Each iteration requires O(E)O(|E|) time if a DPS or BPS is used to find an sts-t path.

Suppose all edges in the original network have integer capacities C\le C. Then an inductive argument shows that on each iteration of the algorithm, the flow is always an integer and increases by an integer amount. Therefore, since the maximum flow is at most CEC|E| ⇒ the number of iterations is at most this much.

But what if CC is big? However, if paths are chosen in a sensible manner—in particular, by using BFS, which finds the path with the fewest edges—then the number of iterations is at most O(VE)O(|V ||E|)

Bipartite Matching

There is an edge between a boy and girl if they like each other (for instance, Al likes all the girls). Is it possible to choose couples so that everyone has exactly one partner, and it is someone they like? In graph-theoretic jargon, is there a perfect matching?

Create a new source node, s, with outgoing edges to all the boys; a new sink node, t, with incoming edges from all the girls; and direct all the edges in the original bipartite graph from boy to girl (Figure 7.8). Finally, give every edge a capacity of 1. Then there is a perfect matching if and only if this network has a flow whose size equals the number of couples.

Nice property about max flow problem: if all edge capacities are integers, then the optimal flow found by our algorithm is integral

Zero-sum Games (DPV 7.5)

Rock-paper-scissors game’s payoff matrix illustrated as Row’s gain

Now suppose the two of them play this game repeatedly. If Row always makes the same move, Column will quickly catch on and will always play the countermove, winning every time. Therefore Row should mix things up: we can model this by allowing Row to have a mixed strategy, in which on each turn she plays rr with probability x1x_1, pp with probability x2x_2, and ss with probability x3x_3. This strategy is specified by the vector x=(x1,x2,x3)x = (x_1, x_2, x_3). Similarily, Column’s mixed strategy is some y=(y1,y2,y3)y = (y_1, y_2, y_3).

On any given round of the game, there is an xiyjx_i \cdot y_j chance that Row and Column will play the i-th and j-th moves, respectively.

Therefore the expected(avergae) gain of Row is:

i,jGijxiyj\sum_{i,j} G_{ij} \cdot x_i y_j
📌
Notice that Column can force this expected gain by choosing y=(1/3,1/3,1/3)y = (1/3, 1/3, 1/3), no matter what strategy Row chooses to play. So can the row do the same. In short, the best each player can do is to play completely randomly, with an expected payoff of zero.

If we investigate with a nonsymmetric game ⇒ then the “best” each player can do might be different

Imagine a presidential election scenario in which there are two candidates for office, and the moves they make correspond to campaign issues on which they can focus (the initials stand for economy, society, morality, and tax cut). The payoff entries are millions of votes lost by Column.

Suppose now Row’s strategy is r=(x1,x2)r = (x_1, x_2) The objective of Column c=(y1,y2)c = (y_1, y_2):

minc=(y1,y2){3x12x2,x1+x2}\min_{c = (y_1, y_2)} \{3x_1 - 2x_2, -x_1 + x_2\}

Since Row knows that it is forced to announce its strategy,

maxrminc{3x12x2,x1+x2}\max_{r} \min_{c} \{3x_1 - 2x_2, -x_1 + x_2\}

We can transform this problem into an LP:

Objective of Column
Objective of Row

Dual of this LP: If Column anounces strategy first and then Row chooses

Multiplicative Weights Algorithm / Hedge Algorithm

Per-step regret RTTO(lnnT)\frac{R_T}{T} \in O(\sqrt{\frac{\ln n}{T}})

Consider the following online optimization setup:

The algorithm:

Multiplicative Weights in Minmax Games

Bounds:

Circuit Evaluation (DPV 7.7)

We are given a Boolean circuit, a DAG of gates of the following types

Additionally, one of the gates is designated as the output.

CIRCUIT VALUE problem: apply Boolean logic to the gates in topological order, does the output evaluate to true?

🔥
We can translate this problem into a linear program, by creating a variable xgx_g for each gate gg, with constraints 0xg10 \le x_g \le 1

⬆️Additional Constraints imposed by different gates

Those constraints forces each xgx_g to be 00(false) or 11(true).

Turns out the CIRCUIT VALUE problem is the most general problem solvable in polynomial time, every algorithm will eventually run on a computer, and the computer is ultimately a Boolean combinational circuit implemented on a chip. If the algorithm runs in polynomial time, it can be rendered as a Boolean circuit consisting of polynomially many copies of the computer’s circuit, one per unit of time.

Hence, the fact that CIRCUIT VALUE reduces to linear programming means that all problems that can be solved in polynomial time do!

NP-Complete Problems (DPV 8)

Search Problems (DPV8.1)

A search problem is specified by an algorithm CC that takes two inputs, an instance II and a proposed solution SS, and runs in time polynomial in I|I|. We say SS is a solution to II if and only if C(I,S)=trueC(I,S) = \text{true}.

SAT(Satisfiability) problem

(xyz)(xyˉ)(yzˉ)(zxˉ)(xˉyˉzˉ)(x \vee y \vee z) (x \vee \bar{y})(y \vee \bar{z})(z \vee \bar{x})(\bar{x} \vee \bar{y} \vee \bar{z})

Given a Boolean Formula in Conjunctive Normal Form (CNF), which is a collection of clauses each containing of disjunction of several literals(either a Boolean variable or the negation of one), find a satisfying truth assignment.

Horn Formula:

2-literals case:

Traveling salesman problem (TSP)

Given nn vertices 1,,n1, \dots, n and all n(n1)/2n(n-1)/2 distances between them, as well as budget bb. We want to find a tour, a cycle that passes through every vertex exactly once, of total cost bb or less (or report that no such tour exists)

In other words (in search problem’s perspective), we want to find a permutation τ(1),,τ(n)\tau(1), \dots, \tau(n) such that when they are toured in this order, the total distance covered is at most bb:

dτ(1),τ(2)+dτ(2),τ(3)++dτ(n),τ(1)bd_{\tau(1), \tau(2)} + d_{\tau(2), \tau(3)} + \cdots + d_{\tau(n), \tau(1)} \le b

There’s a budget ⇒ solution to a search problem should be easy to recognize (polynomial checkable).

Euler and Rudrata

Euler’s Bridge problem: Is there a path that visits each edge exactly once?

Or in other words: When can a graph be drawn without lifting the pencil from the paper?

🤔
Euler’s Answer: Iff 1. The graph is connected 2. Every vertex, with the exception of two vertices (start and final vertices of the walk), has evven degree

Rudrata’s Cycle problem: Can one visit all the squares of the chessboard, without repeating any square, in one long walk that ends at the starting square and at each step makes a legal knight move?

In other words: Given a graph, find a cycle that visits each vertex exactly once - or report that no such cycle exists ⇒ Oops! Looks similar to TSP!

Cuts and Bisections

Balanced Cut problem: Given a graph with nn vertices and a budget bb, partition the vertices into two sets SS and TT such that S,Tn/3|S|, |T| \ge n/3 and such that there are at most bb edges between SS and TT

Applications: Clustering (problem of segmenting an image into its constituent components (say, an elephant standing in a grassy with a clear blue sky above) ⇒ Put a node for each pixel and put an edge between nodes whose corresponding pixels are spatially close together and also similar in color.

Integer Linear Programming

We mentioned in DPV 7 that there is a different, polynomial algorithm for linear programming. But the situation changes completely if in addition to specifying a linear objective function and linear inequalities, we also constrain the solution(values for the variables) to be integer.

In order to define a search problem, we define goal gg (counterpart of budget constraint in maximization problems) Then we have:

maxxcxs.t.Axbcxg\max_{x} c^\top x\\ \text{s.t.} \\ Ax \le b \\ c \cdot x \ge g

Special case of ILP (Zero-one Equations [ZOE]):

Three-Dimensional Matching

nn boys, nn girls and nn pets, and the compatibilities among them are specified by a set of triples {b,g,p}\{ b, g, p \}, each containing a boy, a girl, and a pet. We want to find nn disjoint triples and thereby create nn harmonious households

Each triple is shown as a triangular-shaped node joining boy, girl, and pet.

Independent set, vertex cover, and clique

Independent Set Problem: Given a graph and an integer gg, find gg vertices that are independent (no two of which have an edge between them)

Vertex Cover (Special Case of Set Cover): input is a graph and a budget bb, find bb vertices that cover every edge

Set Cover: Given a set EE and several subsets of it, S1,,SmS_1, \dots, S_m, along with a budget bb. Select bb of subsets so that their union is EE.

CLIQUE problem: Given a graph and a goal gg, find a set of gg vertices such that all possible edges between them are present.

Longest Path: Given a graph GG with nonnegative edge weights ad two distinguished vertices ss and tt, along with a goal gg, we want to find a path from ss to tt with total weight at least gg.

Knapsack and subset sum

KNAPSACK problem (DPV 6.4, DP solution O(nW)\in O(nW)): Given integer weights w1,,wnw_1, \dots, w_n and integer values v1,,vnv_1, \dots, v_n for nn items. Also given a weight capacity WW and goal gg, find a set of items whose total weight is at most WW and total value is at least gg.

UNARY KNAPSACK: KNAPSACK with unary-coded integers (IIIIIIIIIIII=12IIIIIIIIIIII = 12)

SUBSET SUM: Each item’s value is equal to its weight (given in binary), goal gg is the same as the capacity WW ⇒ finding a subset of given set of integers that adds up to exactly WW

NP-Complete Problems (DPV 8.2)

The world is full of search problems, some can be solved efficiently, some seem to be very hard

Hard Problems (NP-Complete)Easy Problems (in P)
3 SAT2 SAT, HORN SAT
Traveling Salesman ProblemMinimum Spanning Tree
Longest PathShortest Path
3D MatchingBipartite Matching
KnapsackUnary Knapsack
Independent SetIndependent Set on Trees
Integer LPLinear Programming
Rudrata PathEuler Path
Balanced CutMinimum Cut

P and NP

Search Problems: There is an efficient checking algorithm CC that takes input a given input II (data specifying the problem) and a potential solution SS, and outputs truetrue iff SS is a solution to instance II in polynomial time I|I|.

We denote the class of all search problems by NPNP (nondeterministic polynomial time).

We also denote the class of all search problems that can be solved in polynomial time by PNPP \sube NP

🔥
Is P=NPP = NP? Most researchers think PNPP \ne NP.

if P=NPP = NP, there’s an efficient method to prove any mathematical statement, thus eliminating the need for mathematicians! ⇒ this is a nice intuition of why PNPP \ne NP, however proving this is extremely difficult.

Reductions in search problems

The problem on the left side of the table are all in some sense - exactly the same problem, except that they are stated in different languages. We will use reductions to show that those problems are the hardest problems in NPNP - if even one of them has a polynomial time algorithm, then every problem in NPNP has a polynomial time algorithm.

A reduction from search problem AA to search problem BB is a polynomial-time algorithm ff that transforms any instance II of AA into an instance f(I)f(I) of BB, together with another polynomial-time algorithm hh that maps any solution SS of f(I)f(I) back into a solution h(S)h(S) of II. If f(I)f(I) has no solution, then neither does II.

Now let’s define the class of hardest search problems

A search problem is NP-complete if all other search problems reduce to it.

Nice properties about reduction:

A side note on factoring:

Factoring is the task of finding all prime factors of a given integer, but the difficulty of factoring is of a different nature and nobody believes that factoring is NP-complete. One major difference is there’s no scenario such that no solution exists, and the problem succumbs to the power of quantum computation.

The Reductions (DPV 8.3)

Coping with NP-Completeness (DPV 9)

If we have a hard problem

  1. Start by proving that it is NP-Complete
    1. A proof by generalization
    1. Simple reduction from another NP-Complete problem to this problem
  1. Once problem is proved NP-Complete
    1. Special cases of the problem might be solved in linear time
    1. Graph problems can be solved by DP in linear time
    1. approximation algorithm
      1. falls short of the optimum but never by too much
      1. Greedy algorithm for set cover O(logn)O(\log n)
    1. heuristics
      1. algorithms with no guarantees on either the running time or the degree of approximation
      1. rely on ingenuity, intuition, a good understanding of the application, meticulous experimentation to tackle the problem

Intelligent Exaustive Search (DPV 9.1)

Backtracking

It is often possible to reject a solution by looking at just a small portion of it

By quickly checking and discrediting some partial assignment, we are able to prune some portion of the search space.

Consider the Boolean formula ϕ(w,x,y,z)\phi(w,x,y,z) specified by the set of clauses

(wxyz),(wxˉ),(xyˉ),(yzˉ),(zwˉ),(wˉzˉ)(w \vee x \vee y \vee z), (w \vee \bar{x}), (x \vee \bar{y}), (y \vee \bar{z}), (z \vee \bar{w}), (\bar{w} \vee \bar{z})

We will start by branching on any one variable (ww)

No clause is immediately violated, thus neither of those partial assignments can be eliminated outright. Let’s keep branching on xx with w=0w=0

Now we see that w=0,x=1w=0, x=1 violates (wxˉ)(w \vee \bar{x}) and this helps us prune a good chunk of search space.

In the case of Boolean satisfiability, each node of the search tree can be described either by a partial assignment or by the clauses that remain when those values are plugged into the original formula.

For instance, if w=0w = 0 and x=0x = 0 then any clause with ww or xx is instantly satisfied and any literal ww or xx is not satisfied and can be removed. What’s left is

(yz),(yˉ),(yzˉ)(y \vee z), (\bar{y}), (y\vee\bar{z})

Likewise for w=0,x=1w=0, x=1

(),(yzˉ)(), (y \vee \bar{z})

“empty clause” rules out satisfiability.

Thus the nodes of the search tree, representing partial assignments, are themselves SAT sub-problems. This alternative representation is helpful for making the two decisions that repeatedly arise: which subproblem to expand next, and which branching variable to use.

Since the benefit of backtracking lies in its ability to eliminate portions of the search space, and since this happens only when an empty clause is encountered, it makes sense to choose the subproblem that contains the smallest clause and to then branch on a variable in that clause.

More abstractly, a backtracking algorithm requires a test that looks like a subproblem and quickly declares one of three outcomes

  1. Failure: the subproblem has no solution
  1. Success: A solution to the subproblem is found
  1. Uncertainty

In the case of SAT, this test declares failure if there is an empty clause, success if there are no clauses, and uncertainty otherwise.

Branch-and-bound

Same principle of backtracing algorithm can be generalized from search problem to optimization problems.

Suppose we have a minimization problem

As before, we need a basis for eliminating partial solutions, since there is no other source of efficiency in our method. To reject a subproblem, we must be certain that its cost exceeds that of some other solution we have already encountered. But its exact cost is unknown to us and is generally not efficiently computable. So instead we use a quick lower bound on this cost.

Use this on TSP problem with graph G=(V,E)G = (V,E) with edge lengths de>0d_e > 0:

Approximation Algorithms (DPV 9.2)

Suppose we have an algorithm AA for our problem which given the instance II returns a solution with value A(I)A(I), the approximation ratio of algorithm AA is defined as

αA=maxIA(I)OPT(I)\alpha_A = \max_I \frac{A(I)}{OPT(I)}

αA\alpha_A measures by the factor by which the output of algorithm AA exceeds the optimal solution, on the worst-case input.

When faced with an NP-complete optimization problem, a reasonable goal is to look for an approximation algorithm AA whose αA\alpha_A is as small as possible.

Vertex Cover

Given an undirected graph G=(V,E)G = (V,E), find a subset of the vertices SVS \sube V that touches every edge and S|S| is minimized.

Matching: A subset of edges have no vertices in common are matched together. A matching is maximal if no more edges can be added to it.

🧙🏽‍♂️
Maximal matchings will help us find good vertex covers, and moreover, they are easy to generate: repeatedly pick edges that are disjoint from the ones chosen already, until this is no longer possible.

any vertex cover of a graph GG must be at least as large as the number of edges in any matching in GG; that is, any matching provides a lower bound on OPT. Simply because each edge of the matching must be covered by one of its endpoints in any vertex cover

let SS be a set that contains both endpoints of each edge in a maximal matching MM . Then SS must be a vertex cover—if it isn’t, that is, if it doesn’t touch some edge eEe \in E, then MM could not possibly be maximal since we could still add ee to it. But our cover SS has 2M2|M| vertices. And from the previous paragraph we know that any vertex cover must have size at least M|M|, so we are done (SS is our approximation of vertex cover)!

This simple procedure always returns a vertex cover whose size is at most twice optimal!

Clustering

We have some data thaty we want to divide into groups ⇒ first define the notion of “distance”, which has to satisfy the usual “metric” properties

  1. d(x,y)0d(x,y) \ge 0 for all x,yx,y
  1. d(x,y)=0d(x,y) = 0 iff x=yx=y
  1. d(x,y)=d(y,x)d(x,y) = d(y,x)
  1. d(x,y)d(x,z)+d(z,y)d(x,y) \le d(x,z) + d(z,y)
    1. Triangle inequality

The problem is NP-hard, but there’s an approximation algorithm:

🧙🏽‍♂️
The resulting diameter is guaranteed to be at most twice optimal

Let xXx \in X be the point farthest from μ1,,μk\mu_1, \dots , \mu_k (in other words the next center we would have chosen, if we wanted k+1k+1 of them), and let rr be its distance to its closest center. Then every point in XX must be within distance rr of its cluster center. By triangular inequality, this means every cluster has diameter at most 2r2r.

we have identified k+1k + 1 points {μ1,μ2,,μk,x}\{\mu_1, \mu_2, \dots , \mu_k, x\} that are all at a distance at least rr from each other ⇒ Any partition into kk clusters must put two of these points in the same cluster and must therefore have diameter at least rr.

TSP

We saw that the triangle inequality helped making the k-CLUSTER problem approximable; In TRAVELLING SALESMAN PROBLEM, if the distances between cities satisfy the metric properties, then there is an algorithm that outputs a tour of length at most 1.51.5 times optimal. We will look at a weaker result that achieves a factor of 22.

Is there a structure that is easy to compute and plausibly related to the best traveling salesman tour? Turns out to be the minimum spanning tree.

Removing any edge from a traveling salesman tour leaves a path through all the vertices, with is a spanning tree, therefore

TSP costcost of this pathMST cost\text{TSP cost} \ge \text{cost of this path} \ge \text{MST cost}

Now a MST is not necessarily a tour. In order to make this a tour, try use each edge twice, then follow the shape of the MST we end up with a tour that visits all the cities, some of them more than once.

Therefore this tour has a length at most twice the MST cost

To fix the problem of visiting some cities more than once, the tour should simply skip any city it is about to revisit, and instead move directly to the next new city in its list (By the triangle inequality, these bypasses can only make the overall tour shorter).

General TSP

Much harder to approximate if TSP does not satisfy the triangle inequality!

Recall that we have a polynomial-time reduction from TSP I(G,C)I(G,C) instance:

This means even an approximate solution to TSP would enable us to solve RUDRATA PATH!

So unless P=NPP = NP, there cannot exist an efficient approximation algorithm for the general TSP.

Knapsack

This approximation: Given any ϵ>0\epsilon > 0, it will return a solution of value at least (1ϵ)(1-\epsilon) times the optimal value, in time it only scales polynomially in input size and in 1/ϵ1/\epsilon.

We earlier have a solution to the Knapsack problem with O(nW)O(nW). Using a similar technique, a running time of O(nV)O(nV) can also be achived. Neither of those running times is polynomial because WW and VV can be very larger, exponential in size of the input.

Let’s consider the O(nV)O(nV) algorithm. In the bad case when VV is large, what if we simply scale down all the values in some way?

Let’s see why this works. First of all, since the rescaled values v^i\hat{v}_i are all at most n/ϵn/\epsilon, the dynamic program is efficient, running in time O(n3/ϵ)O(n^3/\epsilon).

Now suppose the optimal solution to the original problem is to pick some subset of items SS, with total value KK^*, the rescaled value of this same assignment is

iSv^i=iSvinϵvmaxiS(vinϵvmax1)Knϵvmaxn\sum_{i \in S} \hat{v}_i = \sum_{i \in S} \lfloor v_i \cdot \frac{n}{\epsilon v_{\max}} \rfloor \ge \sum_{i \in S} (v_i \cdot \frac{n}{\epsilon v_{\max}} - 1) \ge K^* \cdot \frac{n}{\epsilon v_{\max}} -n

Therefore the optimal assignment for the shrunken problem, call it S^\hat{S}, has a rescaled value of at least this much. In terms of the original values, assignment S^\hat{S} has a value of at least

iS^viiS^v^iϵvmaxn(Knϵvmaxn)ϵvmaxn=KϵvmaxK(1ϵ)\sum_{i \in \hat{S}} v_i \ge \sum_{i \in \hat{S}} \hat{v}_i \cdot \frac{\epsilon v_{\max}}{n} \ge (K^* \cdot \frac{n}{\epsilon v_{\max}} - n) \cdot \frac{\epsilon v_{\max}}{n} = K^* - \epsilon v_{\max} \ge K^*(1-\epsilon)

The approximability hierarchy

NP-Complete optimization problems are classified as follows (assuming PNPP \ne NP, if P=NPP = NP then all classes collapse into one):

Local Search Heuristics (DPV 9.3)

For minimization problem, “try out” local solutions, keep them if they work well

On each iteration, the current solution is replaced by a better one close to it, in its neighborhood. This neighborhood structure is something we impose upon the problem and is the central design decision in local search.

TSP with local search

Assume we have all interpoint distances between nn cities, giving a search space of (n1)!(n − 1)! different tours. What is a good notion of neighborhood?

Graph Partitioning

Removing the restriction on the sizes of AA and BB would give the MINIMUM CUT problem, which can be efficiently solved using flow techniques.

However, this present variant is NP-hard.

Let (A,B)(A,B) with A=B|A| = |B| be a candidate solution; We will define its neighbors to be all solutions obtainable by swapping one pair of vertices across the cut: all solutions of the form (A{a}+{b},B{b}+{a})(A - \{a\} + \{b\}, B - \{b\} + \{a\}) where aA,bBa \in A, b \in B.

How do we improve local search procedures?

Dealing with Local Optima

Randomization and restarts

Randomization

When there are many local optima, randomization is a way of making sure that there is at least some probability of getting to the right one.

Simulated Annealing

As the problem size grows, the ratio of bad to good local optima often increases, sometimes to the point of being exponentially large. In such cases, simply repeating the local search a few times is ineffective.

Randomized Algorithms

We now study randomized algorithms instead of a deterministic algorithm. A randomized algorithm AA can be seen as a regular algorithm but takes two inputs: usual input xx and uniformly random string r{0,1}Rr \in \{0,1\}^R for some RR. We generally categorize randomized algorithms into two classes

Primality Testing (DPV 1.3)

Is there some test that tells us whether a number is prime without factoring out the number?

Fermat’s Little Theorem (assuming pp is prime)

ap11modpa^{p-1} \equiv 1 \mod p

Proof:

a=3, p=7

Let SS be the nonzero integers modulo pp; that is, S={1,2,,p1}S = \{1, 2, \dots , p − 1\}. One crucial observation: the effect of multiplying these numbers by a (modulo p) is simply to permute them.

Let’s expand this equivalence to a set

{1,2,,6}={3×1mod7,3×2mod7,,3×6mod7}\{1,2,\dots,6\} = \{3\times 1 \mod 7, 3 \times 2 \mod 7, \dots, 3 \times 6 \mod 7\}

If we multiply everything in the set together,

6!366!mod76! \equiv 3^6 \cdot 6! \mod 7

Divide by 6!6! on each side, we get exactly Fermat’s Little Theorem

361mod73^6 \equiv 1 \mod 7

We’ll prove that when the elements of S are multiplied by aa modulo pp, the resulting numbers are all distinct and nonzero. And since they lie in the range [1, p − 1], they must simply be a permutation of S.

The numbers a×imodpa \times i \mod p are distinct because if aiajmodpa \cdot i \equiv a \cdot j \mod p, then dividing both sides

by a gives ijmodpi \equiv j \mod p.

They are nonzero because ai0a \cdot i \equiv 0 similarly implies i0i \equiv 0. (And we can divide by aa, because by assumption it is nonzero and therefore relatively prime to pp)

Therefore we can write SS

S={1,2,,p1}={a1modp,a2modp,,a(p1)modp}S = \{1, 2, \dots , p − 1\} = \{a \cdot 1 \mod p, a \cdot 2 \mod p, \dots , a \cdot (p − 1) \mod p\}

Multiply each element in SS

(p1)!ap1(p1)!modp(p-1)! \equiv a^{p-1}(p-1)! \mod p

Dividing by (p1)!(p − 1)! (which we can do because it is relatively prime to p, since p is assumed prime) then gives the theorem.

So Fermat’s theorem suggests a “factorless” test

The problem is that Fermat’s theorem is not an if-and-only-if condition; it doesn’t say what happens when N is not prime, so in these cases the preceding diagram is questionable. In fact, it is possible for a composite number N to pass Fermat’s test (that is, aN11modNa^{N −1} \equiv 1 \mod N) for certain choices of aa.

It turns out that certain extremely rare composite numbers N , called Carmichael numbers, pass Fermat’s test for all aa relatively prime to NN . On such numbers our algorithm will fail; but they are pathologically rare, and we will later see how to deal with them.

We will ignore those Carmichael numbers for now; We will show that NN fails Fermat’s test for at least half of the possible values of aa

Lemma: If aN1≢1modNa^{N −1} \not\equiv 1 \mod N for some aa relatively prime to NN, then it must hold for at least half the choices of a<Na < N.

Proof: Fix some value of aa for which aN1≢1modNa^{N-1} \not\equiv 1 \mod N, the key is to notice that every element b<Nb < N that passes Fermat’s test with respect to NN has a twin, aba \cdot b that fails the test:

(ab)N1aN1bN1aN1≢1modN(a \cdot b)^{N-1} \equiv a^{N-1} \cdot b^{N-1} \equiv a^{N-1} \not\equiv 1 \mod N

Moreover, all these elements aba \cdot b, for fixed aa but different choices of bb, are distinct, for the same reason ai≢aja \cdot i \not\equiv a \cdot j in the proof of Fermat’s test: just divide by aa.

The one-to-one function babb \rightarrow a \cdot b shows that at least as many elements fail the test as pass it

Therefore the algorithm in Figure 1.7 has the following property

Generating Random Primes

Let π(x)\pi(x) be the number of primes x\le x, then π(x)xlnx\pi(x) \approx \frac{x}{\ln x}, more precisely

limxπ(x)x/ln(x)=1\lim_{x \to \infin} \frac{\pi(x)}{x / \ln(x)} = 1

Such abundance makes it simple to generate a random n-bit prime:

Chamichael numbers

The smallest Carmichael number is 561=31117561 = 3 \cdot 11 \cdot 17, it is not a prime; Yet it fools the fermat test because a5601mod561a^{560} \equiv 1 \mod 561 for all values of aa relatively prime to 561561. For a long time it was thought that there might be only finitely many numbers of this type, but we know that they are infinite, but exceedingly rare.

There is a way around Carmichael numbers, using a slightly more refined primality test due to Rabin and Miller.

Write N1N-1 in the form 2tu2^t u, we will choose a random base aa and check the value of aN1modNa^{N-1} \mod N. Perform this computation by first determining aumodNa^u \mod N and then repeatedly squaring, to get the sequence:

aumodN,a2umodN,,a2tu=aN1modNa^u \mod N, a^{2u} \mod N, \dots, a^{2^tu} = a^{N-1} \mod N

If aN1≢1modNa^{N-1} \not\equiv 1 \mod N, then NN is composite by Fermat’s little theorem, and we’re done. If aN11modNa^{N-1} \equiv 1 \mod N, we conduct a little follow-up test

Quicksort

The key insight is that the runtime t(A,r)t(A, r) is, up to a constant factor, equal to the number of comparisons performed by QuickSort.

This is because for every recursive call on some array of size, say, ll, the total time spent at that level of recursion (steps 1 through 3 above) is Θ(l)\Theta(l), whereas the number of comparisons is also l1=Θ(l)l − 1 = \Theta(l).

Furthermore, for any pair of indices ij{1,,n}i \ne j \in \{1, \dots , n\}, A[i]A[i] is compared to A[j]A[j] either 00 or 11 times throughout the entire run of the algorithm; this is because for two items to be compared, one of them had to have been the pivot, in which case the non-pivot item is then put into either L or R and never has a chance to be compared with the pivot again.

Thus if we define an indicator random variable Xi,jX_{i,j} which is 11 if the i-th smallest element is ever compared with the j-th smallest and 00 otherwise, we have that

t(A,r)Ci<jXi,j t(A,r) \le C \sum_{i<j} X_{i,j}

where CC is a constant

Applying linearity of expectation,

Ert(A,r)Er[Ci<jXi,j]=Ci<jErXi,j\mathbb{E}_r t(A,r) \le \mathbb{E}_r [C \sum_{i < j} X_{i,j}] = C\sum_{i<j} \mathbb{E}_r X_{i,j}

Let’s try to bound ErXi,j\mathbb{E}_r X_{i,j}

What is the probability that the ith and jth smallest elements (call them ai,aja_i, a_j) are ever compared? If we trace the QuickSort recursion tree from the root, both elements start off being in the same recursive subproblem, then when the root recursion node picks a random pivot, if that pivot is either <ai< a_i or >aj> a_j then aia_i and aja_j either both go to LL or both go to RR (i.e. stay in the same recursive subproblem). Otherwise, if the pivot is in S={ai,ai+1,,aj}S =\{a_i, a_{i+1}, \dots , a_j \}, then they do not go to the same recursive subproblem.

Thus what decides whether Xi,jX_{i,j} equals 0 or 1 is the following: it is equal to 1 iff the first time a pivot is picked in the set S, it is picked to be either aia_i or aja_j ; otherwise Xi,j=0X_{i,j} = 0 since ai,aja_i, a_j will be sent separately to LL and RR and never have a chance to be compared again. Thus P(Xi,j=1)=2/(ji+1)\mathbb{P}(X_{i,j} = 1) = 2/(j − i + 1).

Ert(A,r)Ci<j2ji+1\mathbb{E}_r t(A,r) \le C \sum_{i<j} \frac{2}{j-i+1}

Doing some algebraic manipulation

Ert(A,r)Ci<j2ji+1i=1n1j=i+1n2ji+12Ck=1n1knk+12Cnk=1n11nk+1(since kn)=2Cnt=2n1t2Cn1n1tdt=2Cnlnn\begin{split} \mathbb{E}_r t(A,r) &\le C \sum_{i<j} \frac{2}{j-i+1} \\ &\le \sum_{i=1}^{n-1} \sum_{j=i+1}^n \frac{2}{j-i+1} \\ &\le 2C \sum_{k=1}^{n-1} \frac{k}{n-k+1} \\ &\le 2C n \sum_{k=1}^{n-1} \frac{1}{n-k+1} (\text{since $k\le n$}) \\ &= 2Cn\sum_{t=2}^n \frac{1}{t} \\ &\le 2C n \int_{1}^n \frac{1}{t} dt \\ &=2C n\ln n \end{split}

We proved expected runtime, but can we prove that the runtime is low with high probability?

Recall Markov’s inequality (With ZZ as a nonnegative random variable)

λ>0,P(Z>λ)<E[Z]λ\forall \lambda > 0, \mathbb{P}(Z > \lambda) < \frac{\mathbb{E} [Z]}{\lambda}

Therefore for runtime of QuickSort, EZ<Cnlogn\mathbb{E} Z < C n \log n, since ZZ is nonnegative (algorithms cannot run in negative time), we can apply Markov’s inequality, P(Z>200Cnlogn)<1/100\mathbb{P}(Z > 200 C n \log n) < 1/100. Thus the runtime is 99% probability within O(nlogn)O(n \log n)

Freivalds’ Algorithm

How do we verify computation correctness?

Freivalds’ algorithm specifically focuses on verifying matrix-matrix multiplication. That is, we have two matrices A,BRn×nA, B \in \mathbb{R}^{n \times n} and would like to know C=ABC = A B. We could of course multiply these matrices ourselves in Θ(n3)\Theta(n^3) time (or faster using Strassen or more recent algorithms), but if we upload A,BA, B to the cloud which returns some CC to us, is there a way to verify that C=A×BC = A \times B much faster than doing the multiplication ourselves from scratch?

Answer is yes using randomization: pick a uniformly random x{0,1}nx \in \{0, 1\}^n and check whether ABx=CxABx = Cx.

🧙🏽‍♂️
Note this check can be implemented in O(n2)O(n^2) time instead of O(n3)O(n^3) time via associativity, since ABx=A(Bx)ABx = A(Bx)

Claim: If C=A×BC = A \times B, then Px(ABx=Cx)=1\mathbb{P}_x (ABx = Cx) = 1, otherwise Px(ABx=Cx)1/2\mathbb{P}_x (ABx = Cx) \le 1/2

Proof:

For the second half (assuming CA×BC \ne A \times B), let’s define D=ABCD = AB - C

Then D0D \ne 0 and in particular its ith column DiD_i is nonzero for some ii.

Now suppose ABz=CzABz = Cz for some particular z{0,1}nz \in \{0,1\}^n, Dz=0Dz = 0, then if z=zz’ = z but the i-th entry flipped, we have Dz0Dz’ \ne 0 since Dz=D(z±ei)=Dz±Dei=±Di0Dz’ = D(z \pm e_i) = D_z \pm D e_i = \pm D_i \ne 0, where eie_i is the i-th standard basis vector.

Now let’s pair all the 2n2^n elements of {0,1}n\{0,1\}^n into 2n12^{n-1} pairs, where each pair differs in only the i-th entry. Then we see that in each pair (z,z)(z, z’) at most one of DzDz or DzDz’ can equal zero, but not both. Thus {z{0,1}n:Dz=0}2n1|\{z \in \{0,1\}^n : Dz = 0 \} | \le 2^{n-1}, leading to the probability for the second half.


If we want to reduce our probability of returning an incorrect answer to some desired failure probability pp, we can for example set N=log21/pN = \lceil \log_2 1/p \rceil and pick NN random vectors x1,,xN{0,1}nx_1, \dots, x_N \in \{0,1\}^n independently. We then declare A×B=CA \times B = C iff ABxi=Cxi,i{1,,N}ABx_i = Cx_i, \forall i \in \{1, \dots, N\}, then the probability we fail is at most (1/2)Np(1/2)^N \le p. Our runtime would be Θ(n2log(1/p))\Theta(n^2 \log(1/p)).

Karger’s Global Minimum Cut Algorithm (Contraction Algorithm)

Given a (possibly weighted) undireced graph GG and we would like to find a cut of smallest size. A cut is a partition of the vertices VV into two non-empty sets S,SˉS, \bar{S}.

The size of the cut is defined to be (u,v)E(S×Sˉ)w(u,v)\sum_{(u,v) \in E \cap (S \times \bar{S})} w(u,v). If unweighted it would be E(S×Sˉ)|E \cap (S \times \bar{S})|.

For Contraction algorithm we will focus only on unweighted case (but it is possible to adapt the algorithm to the weighted case as well). Assuming the graph is connected (en1e \ge n-1 because otherwise we can find a cut of size 00 using DFS or BFS by having SS be one connected component and Sˉ\bar{S} be everything else)

First note that maximum flow can be used to solve global mincut deterministically. But global minimum cut has to seperate vertex 11 from some other vertex tt, though we do not know a priori which one, thus set s=1s = 1 and loop over all possibilities for t{2,,n}t \in \{2, \dots, n\}. For each possibility we compute the sts-t min-cut as discused in earlier lectures. In an unweighted graph the capacities are all 11 so the value of max flow / min cut is at most n1n-1. Thus eaach min-cut computation takes O(mn)O(mn) using Ford-Fulkerson. Since we do nn s-t mincut computations, the total time is O(mn2)O(mn^2).

We now analyze Karger’s simpler Contraction algorithm,

The contract operation on e=(u,v)e = (u,v)


Lemma: For an graph GG, fix a particular global minimum cut S,SˉS, \bar{S}, then the probability that Contraction(G)Contraction(G) return this particular cut is at least 1/(n2)1 / {n \choose 2}

Proof:

The contraction algorithm algorithm will return S,SˉS, \bar{S} iff no edge crossing the cut is ever contracted. Let kk be the number of edges crossing the cut, and let eie_i and nin_i denote the number of edges and corresponding vertices in the graph during the ith iteration of the for loop, i=1,2,,n2i = 1,2, \dots, n-2 (nin_i is just ni+1n-i+1 and eie_i is a random variable that depends on what has been contracted so far, then the probability we never contract an edge crossing this cut is precisely

i=1n2(1kei)\prod_{i=1}^{n-2} (1- \frac{k}{e_i})

Now we get a handle on the mim_i, contracting an edge cannot decrease the minimum cut value. Thus for each intermediate graph throughout the run of the algorithm, the minimum cut value is always k\ge k

In the ith iteration ν=1nidν=2ei\sum_{\nu = 1}^{n_i} d_{\nu} = 2e_i, but we just argued that ν=1nidνν=1nik=(ni+1)k\sum_{\nu = 1}^{n_i} d_{\nu} \ge \sum_{\nu = 1}^{n_i} k = (n-i+1)k

Thus

ei(ni+1)k/2e_i \ge (n-i+1)k/2

Therefore we can plug that into our product formula and get

i=1n2(12ni+1)=(12n)(12n1)2413=(n2)!2!n!=1(n2)\prod_{i=1}^{n-2}(1- \frac{2}{n-i+1}) = (1-\frac{2}{n})(1-\frac{2}{n-1}) \cdots \frac{2}{4} \frac{1}{3} = \frac{(n-2)!2!}{n!}=\frac{1}{n \choose 2}

This also implies that any graph has at most (n2)n \choose 2 minimum cuts.

This is in fact tight: the cycle on nn vertices has exactly (n2)n \choose 2 global min-cuts: pick an arbitrary two edges to cut and let S,SˉS, \bar{S} be the two distinct arcs that are disconnected by removing those two edges.

Furthermore, it is possible to implemenet the Contraction algorithm in O(m)O(m) time, but we will not give the details here.

However, the bound given by the Lemma seems a bit problematic: we want a bigger probability larger than 1/(n2)1/{n \choose 2}, so can we do better than that?

We run the algorithm N=(n2)ln1/pN = \lceil {n \choose 2} \cdot \ln 1/p \rceil times independently return the smallest cut ever found. To analyze this, we make use of the follwoing useful fact from calculus, using convexity of the function exe^x.

Fact 5: For all xR,1+xexx \in \mathbb{R}, 1+x \le e^x

Proof: Define f(x)=exf(x) = e^x. Taylor’s theorem gives ex=1+xf’’(y)y2/2e^x = 1+x f’’(y)y^2/2 for some y[0,x]y \in [0,x]. But note that f’’(y)=ey>0f’’(y) = e^y > 0 for all yRy \in \mathbb{R}, so therefore f’’(y)y2/20f’’(y) y^2/2 \ge 0, thus ex=1+x+f’’(y)y2/21+xe^x = 1+x+f’’(y)y^2/2 \ge 1+x

Probability that we will never find a global min cut is at most (with repeating the algorithm NN times)

(11/(n2))N(e1/(n2))N=eN/(n2)p\begin{split} (1 - 1/{n \choose 2})^N &\le (e^{-1/{n\choose 2}})^N \\ &=e^{-N/{n \choose 2}} \\ & \le p \end{split}

Overall runtime is O(mN)=O(mn2log(1/p))O(mN) = O(mn^2 \log(1/p)) to succeed with probability 1p1-p. Comparable to max flow based approach using Ford-Fulkerson (though we have a log(1p)\log(1-p) factor), but is much simpler

Streaming Algorithms

  1. Make one pass over large dataset to support future queries
    1. We want a very low-memory data structure O(n)O(n)
  1. Dynamic data structures

Types of problems

  1. Counting problems
  1. graph problems
    1. find a uvu \rightarrow v path
  1. frequency-based problems

Counter NN

init():
	N = 0

incr():
	N++

query():
	return N

Takes O(logN)O(\log N) bits

Can we do better?

Approximate Counting

🔥
Settle for approximate counting

Can we remember the exponent instead of the actual number?

2j1<N2j2^{j-1} < N \le 2^j

And now we only need to take O(logj)O(\log j) bits, which is O(loglogN)O(\log \log N)

What if instead of using a base of 2, we use a base of 1+ϵ1+\epsilon?

j=log1+ϵN,logj=loglog1+ϵN=log(log2Nlog21+ϵ)j = \lceil \log_{1+\epsilon N}\rceil, \log j = \log \log_{1+\epsilon} N = \log (\frac{\log_2 N}{\log_{2} 1+\epsilon})

Note: If ϵ1\epsilon \ll 1, then log21+ϵϵ\log_2 1+\epsilon \approx \epsilon

Therefore to remember jj, it takes around O(loglogN+log1ϵ)O(\log \log N + \log \frac{1}{\epsilon})

But how would we increment since we don’t know NN?

🔥
With each call to incr(), With chance 1/(1+ϵ)j1/(1+\epsilon)^j, we increment jj

LEMMA: Let XnX_n be the value of XX after nn calls to incr(), then E((1+ϵ)Xn)=n+1\mathbb{E}((1+\epsilon)^{X_n}) = n+1

Proof (by induction on nn):

base case: n=0,(1+ϵ)0=1n=0, (1+\epsilon)^0 = 1

E[(1+ϵ)Xn+1]=j=0P(Xn=j)E[(1+ϵ)Xn+1Xn=j]=j=0P(Xn=j)[1(1+ϵ)j(1+ϵ)j+1+(11(1+ϵ)j)(1+ϵ)j]=j=0P(Xn=j)[1+(1+ϵ)j]=j=0P(Xn=j)undefined1+j=0P(Xn=j)2jundefinedE[(1+ϵ)Xn]=n+1\begin{split} \mathbb{E}[(1+\epsilon)^{X_{n+1}}] &= \sum_{j=0}^\infin \mathbb{P}(X_n=j) \mathbb{E}[(1+\epsilon)^{X_{n+1}} |X_n = j] \\ &=\sum_{j=0}^\infin \mathbb{P}(X_n=j)[\frac{1}{(1+\epsilon)^j}(1+\epsilon)^{j+1}+(1-\frac{1}{(1+\epsilon)^j})(1+\epsilon)^{j}] \\ &=\sum_{j=0}^\infin \mathbb{P}(X_n=j)[1 + (1+\epsilon)^j] \\ &=\underbrace{\sum_{j=0}^\infin \mathbb{P}(X_n=j)}_{1} + \underbrace{\sum_{j=0}^\infin \mathbb{P}(X_n=j) 2^j}_{\mathbb{E}[(1+\epsilon)^{X_n}]} \\ &=n+1 \end{split}

So therefore we return N^=(1+ϵ)j1\hat{N}=(1+\epsilon)^j - 1 when query() is called.

But what about the error? We can bound

P(N^EN^λ)=Var(N^)λ2=32N2+32N+1N2λ2=12N2+32N+1λ2\mathbb{P}(|\hat{N} - \mathbb{E}\hat{N}| \ge \lambda) = \frac{Var(\hat{N})}{\lambda^2} = \frac{\frac{3}{2}N^2 + \frac{3}{2}N + 1 - N^2}{\lambda^2} = \frac{\frac{1}{2}N^2 + \frac{3}{2}N+1}{\lambda^2}

Where λ=αEN^\lambda = \alpha \mathbb{E} \hat{N}

Optimal choice:

ϵα2log(P[N^EN^λ]1)\epsilon \approx \frac{\alpha^2}{\log(\mathbb{P}[|\hat{N} - \mathbb{E} \hat{N}| \ge \lambda]^{-1})}

Randomized Counting

init():
	X = 0

incr():
	if rand() is even:
		X++

query():
	return 2X

Distinct Counting

Return number of distinct elements in stream that consist of ints picked from {1,,n}\{ 1, \dots, n \}

Hashed Distinct Counting

  1. Pick random hash function h:{1,,n}[0,1]h: \{1, \dots, n\} \rightarrow [0,1]
    1. Assume hh is independent and uniformly random
  1. Initiate X=1X = 1
    1. For each ii in stream,
      1. X=min(X,h(i))X = \min (X, h(i))
    1. Eistr[h(i)]=1t+1\mathbb{E}_{i \in str}[h(i)] = \frac{1}{t+1}
  1. when query(): return 1X1\frac{1}{X} - 1
🔥
But the smallest hash can be very sensitive to small turbulence

Industry standard: Hyperloglog

Lower bounds

Prove that any algorithm for existing problem has to take Ω()\Omega(\cdot) time

Time Hierarchy Theorem (CS172)

For any k>0k > 0, \exists problmes requiring Ω(nk)\Omega(n^k) time.

e.g.

g(P,x)={Trueif P terminates on x in xk timeFalseOtherwiseg(P,x) = \begin{cases} \text{True} &\text{if $P$ terminates on $x$ in $|x|^k$ time} \\ \text{False} &\text{Otherwise} \end{cases}
🔥
For all the problems in this class, potentially they all have O(n)O(n) time solutions

Circuit Complexity

Boolean circuit made up of AND, OR, NOT, each gate taking two inputs and having one output, and input bits x1,,xnx_1, \dots, x_n

Since standard computation model algorithm ⇒ efficient circuits,

lower bounds on circuits ⇒ LBs on standard algorithms

But what does “efficient circuit” mean? Can it be quantified?

  1. circuit sizes ss: # of wires
    1. # circuits of size ss is 2O(slogs)2^{O(s \log s)}
    1. # functions f:{0,1}n{0,1}f: \{0,1\}^n \rightarrow \{0,1\} is 22n2^{2^n}
      1. Vast majority to problems need large circuits ⇒ large runtime
      1. There exists an explicit problem requiring circuits of size (3+186)n\ge (3+\frac{1}{86})n (2016)
  1. Depth dd of the circuit

Uniform Circuits:

A\exists A s.t. A(n)A(n) spits out CnC_n efficiently

Multiplication

MULTn:{0,1}n×{0,1}n{0,1}2nMULT_n: \{0,1\}^n \times \{0,1\}^n \rightarrow \{0,1\}^{2n}

Spectral Graph Theory

Unserstanding properties of graph by looking at eigenvalues/eigenvectors of associated matrices

example:

(1) adjacency matrix

Au,v={1if (u,v)E(G)0otherwiseA_{u,v} = \begin{cases} 1 &\text{if } (u,v) \in E(G) \\ 0 &\text{otherwise} \end{cases}

(2) Graph Laplacian LL

diag(L)=[d1d2dn]Rndiag(L) = \begin{bmatrix} d_1 &d_2 &\cdots &d_n \end{bmatrix}^\top \in \mathbb{R}^{n}

did_i ⇒ degree of vertex ii (# of edges connected to ii)

Lu,v=Au,vuvL_{u,v} = -A_{u,v} \forall u \ne v

“unnomralized laplacian”

L=DAL = D-A

We also have “normalized laplacian”

Lˉ=D1/2LD1/2=ID1/2AD1/2\bar{L} = D^{-1/2} L D^{-1/2} = I - D^{-1/2}AD^{-1/2}

Random Walks

Do tt steps of random walk from vertex ii

Let ei\vec{e}_i be the standard basis vector for vertex ii

(ei)j={1if j=i0otherwise(\vec{e}_i)_j = \begin{cases} 1 &\text{if } j=i \\ 0 &\text{otherwise} \end{cases}

Basically, having x=ei\vec{x} = \vec{e}_i means we’re definitely at vertex ii

Say xWxx \rightarrow Wx go from some probability distribution xx on VV to WxWx in one random walk step

We see:

Wx=ixiwiW\vec{x} = \sum_i x_i \vec{w}_i

So

W=AD1W = AD^{-1}

Note:

xLx=u,vxuxvLu,v=(u,v)E(xuxv)2\begin{split} \vec{x}^\top L \vec{x} &= \sum_{u,v} x_u x_v L_{u,v} \\ &=\sum_{(u,v) \in E} (x_u - x_v)^2 \\ \end{split}

We can use this to calculate the cost of having a cut

(xs)u={1if uS1if uS(\vec{x}_s)_u = \begin{cases} 1 &\text{if } u \in S \\ -1 &\text{if } u \notin S \end{cases}