CS 170 Class Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley CS170.

Reference Notice: Material highly and mostly derived from the DPV textbook

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Big-O Notation (DPV 0)

Sloppiness in the analysis of running times can lead to unacceptable level of inaccuracy in the result

Also possible to be too precise - the time taken by one such step depends crucially on the particular processor and details such as caching strategy.

🔥

Instead of reporting that an algorithm takes, say

5n^3+4n+3

steps on an input size of

n

, it is much simpler to leave out lower order items and just say that the algorithm takes time

O(n^3)

- big O of

n^3

Let $f(n)$ and $g(n)$ be functions from positive integers to positive reals. We say $f = O(g)$ - f grows no faster than g - if there is a constant $c>0$ such that $f(n) \le c \cdot g(n)$

Just as $O(\cdot)$ is an analog of $\le$ , we can also define the analog of $\ge$ and $=$ as follows:

f=\Omega(g) \text{ means } g = O(f) \\ f = \Theta(g) \text{ means } f = O(g) \text{ and } f = \Omega(g)

Reduction (DPV 7.1)

If any subroutine for task $Q$ can also be used to solve $P$ , we say $P$ reduces to $Q$ . Often, $P$ is solvable by a single call to $Q$ ’s subroutine, which means any instance $x$ of $P$ can be transformed into an instance $y$ of $Q$ such that $P (x)$ can be deduced from $Q(y)$

Reductions enhance the power of algorithms: Once we have an algorithm for problem Q (which could be shortest path, for example) we can use it to solve other problems.

Graphs

How is a graph represented?

Adjacency matrix

We represent a graph by an adjacency matrix, let there be $n = |V|$ vertices $v_1, \dots, v_n$

a_{i,j} = \begin{cases} 1 \quad \text{if there is an edge from $v_i$ to $v_j$} \\ 0 \quad \text{otherwise} \end{cases}

For an undirected graph the matrix is symmetric since an edge $\{u, v\}$ can be taken in either direction

The presence of a particular edge can be cheked in constant time, but it takes $O(n^2)$ space.

Adjacency List

With size proportional to the number of edges, consists of $|V|$ linked lists, one per vertex. The linked list for vertex $u$ holds the names of vertices to which $u$ has an outgoing edge.

Total size of the data structure is $O(|E|)$ but checking for an edge is no longer constant time(requires sifting through $u$ ’s adjacency list)

Algorithms

Fibonacci(DPV 0)

F_n = \begin{cases} F_{n-1}+F_{n-2} &\text{if } n>1, \\ 1 &\text{if } n = 1, \\ 0 &\text{if } n = 0 \end{cases}

Basic Arithmetic (DPV 1.1)

🔥

How much does the size of a number change when we change bases? Recall the rule for converting logarithms from base a to base b:

\log_b N = (\log_a N )/(\log_a b)

. So the size of integer

N

in base

a

is the same as its size in base

b

, times a constant factor

\log_a b

. In big-O notation, therefore, the base is irrelevant, and we write the size simply as

O(\log N)

Incidentally, this function $\log N$ appears repatedly in our subject, in many guises. Here’s a sampling:

$\log N$ is the power to which you need to raise $2$ in order to obtain $N$ .

Going backward, it, $\lceil \log N \rceil$ , can also be seen as the number of times you must halve $N$ to get down to 1. This is useful when a number is halved at each iteration of an algorithm

It, $\lceil \log (N+1) \rceil$ , is the number of bits in the binary representation of $N$ .

It, $\lfloor \log N \rfloor$ , is also the depth of a complete binary tree with $N$ nodes

It is even the sum $1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{N}$ , to within a constant factor.

Addition

Basic property of numbers: The sum of any three single-digit numbers is at most two digits long (holds for any base b ≥ 2).

This simple rule gives us a way to add two numbers in any base: align their right-hand ends, and then perform a single right-to-left pass in which the sum is computed digit by digit, maintaining the overflow as a carry. Since we know each individual sum is a two-digit number, the carry is always a single digit, and so at any given step, three single-digit numbers are added.

We will not spell out the pseudocode because we are so familiar with it. We will go ahead and anlyze its efficiency: We want the answer expressed as a function of the size of the input: the number of bits of x and y, the number of keystrokes needed to type them in.

Suppose x and y are each n bits long; in this chapter we will consistently use the letter $n$ for the sizes of numbers. Then the sum of $x$ and $y$ is $n + 1$ bits at most, and each individual bit of this sum gets computed in a fixed amount of time. The total running time for the addition algorithm is therefore of the form $c_0 + c_1n$ , where $c_0$ and $c_1$ are some constants; in other words, it is linear. Instead of worrying about the precise values of $c_0$ and $c_1$ , we will focus on the big picture and denote the running time as $O(n)$ .

Some readers may be confused at this point: Why $O(n)$ operations? Isn’t binary addition something that computers today perform by just one instruction? There are two answers.

First, it is certainly true that in a single instruction we can add integers whose size in bits is within the word length of today’s computers—32 perhaps. But, as will become apparent later in this chapter, it is often useful and necessary to handle numbers much larger than this, perhaps several thousand bits long. Adding and multiplying such large numbers on real computers is very much like performing the operations bit by bit. Second, when we want to understand algorithms, it makes sense to study even the basic algorithms that are encoded in the hardware of today’s computers. In doing so, we shall focus on the bit complexity of the algorithm, the number of elementary operations on individual bits—because this accounting reflects the amount of hardware, transistors and wires, necessary for implementing the algorithm.

Multiplication

algorithm for multiplying two numbers $x$ and $y$ is to create an array of intermediate sums, each representing the product of $x$ by a single digit of $y$ . These values are appropriately left-shifted and then added up. Suppose for instance that we want to multiply $13 \times 11$ , or in binary notation, $x = 1101$ and $y = 1011$ .

If $x$ and $y$ are both $n$ bits, then there are $n$ intermediate rows, with lengths of up to $2n$ bits (taking the shifting into account). The total time taken to add up these rows, doing two numbers at a time, is:

\underbrace{O(n)+O(n)+\cdots +O(n)}_\text{$n-1$ times} = O(n^2)

But there are better ways to do this…

Division

Modular Arithmetic (DPV 1.2, unfinished)

Just like multiplication, division works on computers with limited bits.

Modular arithmetic is a system for dealing with restricted ranges of integers. We define $x \text{ modulo } N$ to be the remainder when $x$ is divided by $N$ ; that is, if $x = qN + r$ with $0 \le r < N$ , then $x \text{ modulo } N$ is equal to $r$ . This gives an enhanced notion of equivalence between numbers: $x$ and $y$ are congruent modulo $N$ if they differ by a multiple of $N$ , or in symbols

x \equiv y \mod N \longleftrightarrow \text{$N$ divides $(x-y)$}

Another interpretation is that modular arithmetic deals with all the integers, but divides them into $N$ equivalence classes, each of the form ${i + kN : k \in Z}$ for some $i$ between $0$ and $N$

🔥

Two’s complement uses a lot of modular arithmetic.

It uses n bits to represent numbers in the range $[−2^{n−1}, 2^{n−1} − 1]$ and is usually described as follows:

Positive integers, in the range $[0, 2^{n−1} − 1]$ , are stored in regular binary and have a

leading bit of 0.

Negative integers $−x$ , with $1 \le x \le 2n−1$ , are stored by first constructing $x$ in binary, then flipping all the bits, and finally adding 1. The leading bit in this case is 1.

Here’s a much simpler way to think about it: any number in the range $−2^{n−1}$ to $2^{n−1} − 1$ is stored modulo $2^n$ . Negative numbers $−x$ therefore end up as $2^n − x$ . Arithmetic operations like addition and subtraction can be performed directly in this format, ignoring any overflow bits that arise.

Hashing (DPV 1.5)

Hash Tables

We’re talking about IPV4 here, IP: 0.0.0.0 ~ 255.255.255.255

We will have an array of size $n = 250$ and each key $k$ is hashed by hash function $h(k): k \rightarrow x$ where $x \in \{0, 1, \dots, n-1\}$ . Each array element is a linked list

📌

To ensure that hash table operations have good performance: 1. have the size of the hash table be about twice as large as the number of items 2. assume that the size of the domain of all data items is

N = n^k

, a power of n, then each data item can be considered as a k-tuple of integers modulo

n

, and

H = \{h_a : a \in \{0, \dots, n-1\}^k\}

is a universal family of hash functions

Hash Functions

We want to make sure that hash collision is unlikely
- Otherwise we are wasting space

We need randomization
- Pick a hash function randomly from some class of functions!

Let the number of buckets instead of $n=250 \rightarrow n = 257$ , note that $257$ is a prime number

Let any IP address $x$ as a quadruple $x = (x_1, \dots, x_4)$ and we will fix any four numbers $a_1, a_2, a_3, a_4 \mod n$ and map the address $x$ to $h(x) = (a_1 a_1 + a_2 x_2 + a_3 x_3 + a_4 x_4) \mod 257$ .

We have defined the hash function

h_a(x_1, \dots, x_4) = \sum_{i=1}^4 a_i \cdot x_i \mod n

With this hash function,

Property: Consider any pair of distinct IP addresses $x = (x_1, \dots, x_4)$ , $y = (y_1, \dots, y_4)$ . If the coefficients $a = (a_1, \dots, a_4)$ are chosen uniformly at random from $\{0, 1, \dots, n-1\}$ , then

\mathbb{P}\{ h_a(x) = h_a(y) \} = \frac{1}{n}

This condition guarantees that the expected lookup time for any item is small.

Proof:

Since $x = (x_1, \dots , x_4)$ and $y = (y_1, \dots , y_4)$ are distinct, these quadruples must differ in some component; without loss of generality let us assume that $x_4 \ne y_4$ . We wish to compute the probability

Pr[h_a(x_1, \dots , x_4) = h_a(y_1, \dots , y_4)]

that is, the probability that

\sum_{i=1}^4 a_i \cdot x_i \equiv \sum_{i=1}^4 a_i \cdot y_i \mod n

This last equation can be rewritten as

\sum_{i=1}^3 a_i \cdot (x_i - y_i) \equiv a_4 \cdot (y_4 - x_4) \mod n

Suppose that we draw a random hash function ha by picking $a = (a_1, a_2, a_3, a_4)$ at random. We start by drawing $a_1$ , $a_2$ , and $a_3$ , and then we pause and think:

What is the probability that the last drawn number $a_4$  is such that equation above holds?

We know

So far the LHS evaluates to $c$

Since $n$ is prime and $x_4 \ne y_4$ , $(y_4 - x_4)$ has a unique inverse mod $n$

Thus for previous equation to hold,

a_4 = c \cdot (y_4 - x_4)^{-1} \mod n

The probability of this happening is $1/n$

Universal Hash Functions

Family of hash functions that satisfy the following property:

For any two distinct data items $x$ and $y$ , exactly $|H|/n$ of all the hash functions in $H$ map $x$ and $y$ to the same bucket, where $n$ is the number of buckets

In our example, we used the family

H = \{h_a : a \in \{0,\dots,n-1\}^4\}

Which is an group of universal hash functions

In other words, for any two data items, the probability these items collide is 1/n if the hash function is randomly drawn from a universal family.

📌

This property implies that hash table operations have good performance in expectation

Divide and Conquer (DPV 2)

Recurrence Relationships (DPV 2.2)

Divide and conquer often follow a generic pattern of runtime:

T(n)=aT(\lceil n/b \rceil)+O(n^d)

🔥

Master Theorem: If for some constants

a>0, b>1, d\ge 0

T(n)=aT(\lceil n/b \rceil)+O(n^d)

, then:

T(n)=\begin{cases} O(n^d) &\text{if } d>\log_b a \\ O(n^d \log n) &\text{if } d = \log_b a \\ O(n^{\log_b a}) &\text{if } d < \log_b a \end{cases}

To prove master theorem,

we will assume that n is a power of b (this will not influence the final bound because n is at most a multiplicative factor of b away from some power of b)

Notice that the size of the subproblems decrease by a factor of b with each level of recursion, and therefore reaches the base case after $\log_b n$ levels. Its branching factor is $a$ , so the k-th level of the tree is made up of $a^k$ subproblems, each of size $n/b^k$ . The total work done at this level is

a^k\times O(\frac{n}{b^k})^d=O(n^d)\times(\frac{a}{b^d})^k

As $k$ goes from 0 to $\log_b n$ , these numbers form a geometric series with ratio $a/b^d$ . Therefore we can find the sum by three cases:

$a/b^d < 1$
1. The series is decreasing, and its sum is just given by the first term $O(n^d)$

$a/b^d = 1$
1. We will have $O(\log n)$ counts of the first term
1. $O(n^d \log n)$

$a/b^d > 1$
1. The series is increasing, and its sum is given by its last term
1. $n^d (\frac{a}{b^d})^{\log_b n} = n^d (\frac{a^{\log_b n}}{(b^{\log_b n})^d}) = a^{\log_b n} = a^{(\log_a n)(\log_b a)} = n^{\log_b a}$
1. $O(n^{\log_b a})$

Multiplication (DPV 2.1)

Gauss’s trick:

Product of two complex numbers:

(a+bi)(c+di)=ac-bd+(bc+ad)i

Can in fact be done with just three terms $ac, bd, (a+b)(c+d)$ because:

bc+ad=(a+b)(c+d)-ac-bd

Using the same idea, if $x$ and $y$ are two n-bit integers, we can represent them as left and right halves of bits

x = \fbox{ $x_L$ } \fbox{ $x_R$ } = 2^{n/2}x_L+x_R \\ y = \fbox{ $y_L$ } \fbox{ $y_R$ } = 2^{n/2}y_L+y_R \\

\begin{split} x\times y &= (2^{n/2}x_L+x_R)(2^{n/2}y_L+y_R) \\ &=2^nx_L y_L + 2^{n/2}(x_Ly_R+x_Ry_L)+x_Ry_R \end{split}

We have recursion expression of

T(n)=4T(n/2)+O(n) \in O(n^2)

for this multiplication relationship

But we can use Guss’s trick, we only need $x_L y_L, x_R y_R, (x_L+x_R)(y_L+y_R)$ to compute the multiplication since $x_Ly_R+x_Ry_L = (x_L+x_R)(y_L+y_R)-x_Ly_L-x_Ry_R$

(Karatsuba Algorithm)

T(n)=3T(n/2+1)+O(n) \in O(n^{\log_2 3}) \approx O(n^{1.59})

MergeSort

“Split the list into two halves, recursively sort each half, and merge the two sorted sublists”

The merge procedure does a constant amount of work per recursive call, thus $\text{merge}(\cdot) \in O(n)$ and the overall time taken by mergesort:

T(n) = 2T(n/2)+O(n) \in O(n \log n)

We will also define an iterative version of mergesort, where $\text{inject}(array,element)$ means to append an element to the end of an array and $\text{eject}(array)$ means to remove an element at the front of the array and return it.

🔥

There’s a

O(n \log n)

lower bound for sorting, which means that merge sort is optimal!

Reason of why sorting using comparison must be at least $\Omega(n \log n)$ is because every permutation of the elements must be present in the leaves of the sorting algorithm and therefore the complexity of the tree representing the sorting algorithm must have $\text{depth} \propto n\log n$

Median

When looking for a recursive solution, it is paradoxically often easier to work with a more general version of the problem—for the simple reason that this gives a more powerful step to recurse upon. In our case, the generalization we will consider is selection

Divide and Conquer for selection:

For any number $v$ , imagine splitting list $S$ into three categories: elements smaller than $v$ , those equal to $v$ (there might be duplicates), and those greater than $v$ .

For instance, if the array $S$ :

is put to split on $v=5$ , then

We can check $k$ is in which sub-array by using length of each array

Worst case:

We select $v$ to be the smallest/largest element of the array each time, causing array to only shrink by 1.

n+(n-1)+(n-2)+\cdots+\frac{n}{2}=\Theta(n^2)

We will define a good selection of $v$ to at least strip away 25% of the original array.

Therefore

T(n)\le T(3n/4)+O(n) \in O(n)

Matrix Multiplication (DPV 2.5)

Z_{ij}=\sum_{k=1}^n X_{ik}Y_{kj}

👆

Formula implies an

O(n^3)

algorithm, was wildly believed to be the best runtime possible but later German mathematician Volker Strassen announced a significantly more efficient algorithm based upon divide and conquer.

Matrix Multiplication can be viewed as blocks

But it is still $O(n^3)$ , by:

T(n)=8T(n/2)+O(n^2)

Turns out $XY$ can be computed from just seven $n/2 \times n/2$ subproblems

XY = \begin{bmatrix} P_5+P_4-P_2+P_6 &P_1+P_2 \\ P_3+P_4 &P_1+P_5-P_3-P_7 \\ \end{bmatrix}

where

\begin{split} P_1 = A(F-H) \quad &P_5 = (A+D)(E+H) \\ P_2 = (A+B)H \quad &P_6 = (B-D)(G+H) \\ P_3 = (C+D)E \quad & P_7 = (A-C)(E+F) \\ P_4 = D(G-E) \quad & \\ \end{split}

The new runtime is

T(n)=7T(n/2)+O(n^2)\in O(n^{\log_2 7}) \sim O(n^{2.81})

Fast Fourier Transform (DPV 2.6)

Product of two degree-d polynomials is a polynomial of degree 2d:

A(x)=a_0+a_1x+\cdots+a_dx^d, \\ B(x)=b_0+b_1x+\cdots+b_dx^d

Then

\begin{split} C(x)&=A(x)\cdot B(x)\\ &=c_0+c_1x+\cdots+c_{2d}x^{2d} \end{split}

and $C(x)$ will have coefficients

c_k = a_0b_k+a_1b_{k-1}+\cdots+a_kb_0=\sum_{i=0}^k a_ib_{k-i}

Where $\forall i > d, a_i = 0 \wedge b_i = 0$

Computing each $c_k$ takes $O(k)$ and results to an overall $\Theta(d^2)$ steps since we have $2d+1$ coefficients

So therefore do we have a faster algorithm for solving this problem?

Polynomial Multiplication and Signal Processing

Alternative Representation of Polynomials

Polynomial Evaluation by Divide and Conquer

🔥

Note: Coefficient is present in decreasing degree order

Here $A$ are coefficients of the functions (describing the function), and $\omega$ are point to be evaluated at the function.

So now we have evaluted the polynomial at those representation points, we have multiplied the values to obtain $(2n+1)$ values, how do we interpolate back?

Interpolation of Polynomials

The definitive FFT Algorithm:

The FFT takes as input a vector $a = (a_0, \dots, a_{n-1})$ and a complex number $\omega$ whose powers $1, \omega, \dots, \omega^{n-1}$ are complex n-th roots of unity.

\begin{bmatrix} A(\omega^{0}) \\ A(\omega^{1}) \\ \vdots \\ A(\omega^{n-1}) \end{bmatrix} = \begin{bmatrix} 1 &1 &1 &\cdots &1 \\ 1 &\omega^{1} &\omega^2 &\cdots &\omega^{n-1} \\ \vdots &\vdots &\vdots &\ddots &\vdots \\ 1 &\omega^{n-1} &\omega^{2(n-1)} &\cdots &\omega^{(n-1)(n-1)} \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_{n-1} \end{bmatrix}

We can apply divide and conquer to the matrix multiplication steps

The final product is the vector

Summary:

The divide-and-conquer step of the FFT can be drawn as a very simple circuit. Here is how a problem of size $n$ is reduced to two subproblems of size $n/2$ (for clarity, one pair of outputs $(j, j + n/2)$ is singled out):

the edges are wires carrying complex numbers from left to right. A weight of $j$ means “multiply the number on this wire by $\omega^j$ .

For $n$ inputs there are $\log_2 n$ levels, each with $n$ nodes, for a total of $O(n \log n)$ operations

the inputs are arranged by increasing last bit of the binary representation of their index, resolving ties by looking at the next more significant bit(s).
1. For $n=8$ , The resulting order in binary, 000, 100, 010, 110, 001, 101, 011, 111

There is a unique path between each input $a_j$ and each output $A(\omega^k)$
1. This path is most easily described using the binary representations of j and k (shown in Figure 10.4 for convenience). There are two edges out of each node, one going up (the 0-edge) and one going down (the 1-edge).
1. To get to A(ωk) from any input node, simply follow the edges specified in the bit representation of k, starting from the rightmost bit

On the path between $a_j$ and $A(\omega^k$ ), the labels add up to $jk \mod 8$ .
1. Contribution of input $a_j$ to output $A(w^k)$ is $a_jw^{jk}$

Notice that the FFT circuit is a natural for parallel computation and direct implementation in hardware.

Graphs (DPV 3 + 4)

DFS (Depth-first Search) in undirected graphs (DPV 3.2)

Surprisingly versatile linear-time procedure that reveals a wealth of information about a graph

🔥

What parts of the graph are reachable from a given matrix?

Explore

$explore(G,v) \in O(|E|)$

Depth-First Search Algorithm (DFS)

First step:

Marking visited to false, $O(|V|)$

Second step:

Every edge examined exactly twice (for an edge $\{x,y\} \in E$ , we visit from $explore(x)$ and $explore(y)$ ) ⇒ $O(|E|)$

Connected Components

We can also check what connected components exist in the graph by simply modifying a bit in the DFS algorithm

Where here $ccnum$ identifies the connect component to which $v$ belongs

$cc$ is initialized to zero and incremented each time $DFS$ calls $explore$

Previsit and postvisit ordering

We will also note down the times of two important events, the moemnt of first discovery and of final departure.

Directed Acyclic graphs (DAGs) (DPV 3.3.2)

A directed graph has a cycle iff its depth-first search reveals a back edge

Good for modeling relationships like causalities, hierarchies, and temporal dependencies.

All DAGs can be linearized. ⇒ perform tasks in decreasing order of their post numbers

only edges $(u,v)$ in a graph for which $post(u)< post(v)$ are back edges ⇒ and dag cannot have back edges ⇒ In a DAG, every edge leads to a vertex with a lower $post$ number

So therefore

🔥

Acyclicity, linearizability, and the absence of back edges during a depth-first search - are in fact one and the same thing

Also, since every DAG has at least one source and one sink…We can do this linearization

Find a source, output it, and delete it from graph

Repeat until the graph is empty

Strongly Connected Components

Two nodes $u$ and $v$ of a directed graph are connected if there is a path from $u$ to $v$ and a path from $v$ to $u$ .

🔥

Every directed graph is a DAG of its strongly connected components

And… turns out the meta-graph can be found in linear time by making further use of DFS.

Property 1: If the $explore$ subroutine is started at node $u$ , then it will terminate precisely when all nodes reachable from $u$ have been visited.

So if we call $explore$ on a node that lies somewhere in a $sink$ strongly connected component, then we will retrieve exactly that component.

Property 2: The node that receives the highest $post$ number in a DFS must lie in a source strongly connected component.

Property 3: If $C$ and $C’$ are strongly connected components, and there is an edge from a node in $C$ to a node in $C’$ , then the highest $post$ number in $C$ is bigger than the highest $post$ number in $C’$ .

Proof of property 3:

In proving Property 3, there are two cases to consider. If the depth-first search visits component $C$ before component $C’$ , then clearly all of $C$ and $C′$ will be traversed before the procedure gets stuck (see Property 1). Therefore the first node visited in $C$ will have a higher post number than any node of $C′$ . On the other hand, if $C′$ gets visited first, then the depth-first search will get stuck after seeing all of $C′$ but before seeing any of C, in which case the property follows immediately.

We can use the reverse graph $G^{R}$ and property 2 to find a node from the $sink$ component of graph $G$

Once we have found the first strongly connected component and deleted it from the graph, the node with the highest post number among those remaining will belong to a sink strongly connected component of whatever remains of $G$ . Therefore we can keep using the post numbering from our initial depth-first search on $G^R$ to successively output the second strongly connected component, the third strongly connected component, and so on. The resulting algorithm is this.

Run DFS on $G^R$

Run the undirected connected components algorithm on $G$ , and during the DFS, process the vertices in decreasing order of their $post$ numbers from step 1.

This is linear time, only the constant is about twice of normal DFS search

Paths in Graphs: Distances (DPV 4.1)

The distance between two nodes is the length of the shortest path between them.

To get a concrete feel for this notion, consider a physical realization of a graph that has a ball for each vertex and a piece of string for each edge. If you lift the ball for vertex s high enough, the other balls that get pulled up along with it are precisely the vertices reachable from s. And to find their distances from s, you need only measure how far below s they hang.

Breadth-First Search (BFS) (DPV 4.2)

Initially the queue $Q$ consists only of the root, $s$ , the one node at distance $0$ . And for each subsequent distance $d = 1, 2, 3, \dots$ , there is a point in time at which $Q$ contains all the nodes at distance $d$ and nothing else. As these nodes are processed (ejected off the front of the queue), their as-yet-unseen neighbors are injected into the end of the queue.

Lengths on Edges (DPV 4.3)

We will annotate every edge $e \in E$ with $l_e$ .

Dijkstra’s Algorithm (DPV 4.4)

we can think of Dijkstra’s algorithm as just BFS, except it uses a priority queue instead of a regular queue, so as to prioritize nodes in a way that takes edge lengths into account.

Runtime Analysis:

Dijkstra’s algorithm is structurally identical to breadth-first search. However, it is slower because the priority queue primitives are computationally more demanding than the constant-time $eject$ ’s and $inject$ ’s of BFS. Since makequeue takes at most as long as $|V|$ insert operations, we get a total of $|V|$ $deletemin$ and $|V | + |E|$ $insert$ / $decreasekey$ operations. The time needed for these varies by implementation; for instance, a binary heap gives an overall running time of $O((|V | + |E|) log |V |)$ .

Priority Queue Implemention in Dijkstra (DPV 4.5)

Array

The simplest implementation of a priority queue is as an unordered array of key values for all potential elements (the vertices of the graph, in the case of Dijkstra’s algorithm).

Initially, these values are set to $\infin$ . An insert or decreasekey is fast, because it just involves adjusting a key value, an $O(1)$ operation. To $deletemin$ , on the other hand, requires a linear-time scan of the list.

Binary Heap

Elements are stored in a complete binary tree, in which each level is filled in from left to right, and must be full before the next level is started.

And special ordering constraint: the key value of any node of the tree is less than or equal to that of its children.

To $insert$ , place the new element at the bottom of the tree (in the first available position), and let it “bubble up.” That is, if it is smaller than its parent, swap the two and repeat (Figure 4.11(b)–(d)). The number of swaps is at most the height of the tree, which is $\lfloor \log_2 n \rfloor$ when there are $n$ elements.

A $decreasekey$ is similar, except that the element is already in the tree, so we let it bubble up from its current position.

To $deletemin$ , return the root value. To then remove this element from the heap, take the last node in the tree (in the rightmost position in the bottom row) and place it at the root. Let it “sift down”: if it is bigger than either child, swap it with the smaller child and repeat (Figure 4.11(e)–(g)). Again this takes O(log n) time.

The regularity of a complete binary tree makes it easy to represent using an array. The tree nodes have a natural ordering: row by row, starting at the root and moving left to right within each row. If there are n nodes, this ordering specifies their positions $1, 2, \dots , n$ within the array. Moving up and down the tree is easily simulated on the array, using the fact that node number $j$ has parent $\lfloor j/2 \rfloor$ and children $2j$ and $2j + 1$

d-ary Heap

A d-ary heap is identical to a binary heap, except that nodes have d children instead of just two. This reduces the height of a tree with $n$ elements to $\Theta(\log_d n) = \Theta((\log n)/(\log d))$ .

Inserts are therefore speeded up by a factor of $\Theta(log d)$ . $Deletemin$ operations, however, take a little longer, namely $O(d \log_d n)$

Negative-value edges? (DPV 4.6.1)

Dijkstra’s method can be thought as a sequence of “Update” operations

But Dijkstra’s method will mark off vertices that have been used to calculate shorter distances, this usually cannot be the case with negative-value edges because now updated vertices might receive an update from a negative edge.

🔥

So can we just simply update ALL the edges

|V|-1

times? Bellman-Ford Algorithm

\in O(|V||E|)

Detecting Negative Cycles (DPV 4.6.2)

In such situations, it doesn’t make sense to even ask about shortest paths. There is a path of length 2 from A to E. But going round the cycle, there’s also a path of length 1, and going round multiple times, we find paths of lengths 0, −1, −2, and so on.

Fortunately, it is easy to automatically detect negative cycles and issue a warning. Such a cycle would allow us to endlessly apply rounds of update operations, reducing dist estimates every time.

🔥

So instead of stopping after

|V | − 1

iterations (in Bellman-Ford), perform one extra round. There is a negative cycle if and only if some dist value is reduced during this final round.

Shortest Paths in DAGs that doesn’t have negative cycles (DPV 4.7)

There are two subclasses of graphs that automatically exclude the possibility of negative cycles:

graphs without negative edges
1. Dijkstra

graphs without cycles
1. Can be solved in linear time!

Linearize (that is, topologically sort) the dag by depth-first search, and then visit the vertices in sorted order, updating the edges out of each.

Greedy Algorithms (DPV 5)

A player who is focused entirely on immediate advantage is easy to defeat. But in many other games such as Scrabble, it is possible to do quite well by simply making whichever move seems best at the moment

Minimum Spanning Trees

Suppose you are asked to network a collection of computers by linking selected pairs of them. Each link also has a maintenance cost, reflected in that edge’s weight. What is the cheapest possible network?

Trees

A tree is an undirected graph that is connected and acyclic

Property:

A tree on $n$ nodes has $n-1$ edges

Any connected, undirected graph $G=(V,E)$ with $|E| = |V|-1$ is a tree

An undirected graph is a tree if and only if there is a unique path between any pair of nodes

🧙🏽‍♂️

Removing a cycle edge cannot disconnect a graph

The cut property:

Suppose edges $X$ are part of a minimum spanning tree of $G = (V, E)$ . Pick any subset of nodes $S$ for which $X$ does not cross between $S$ and $V − S$ , and let $e$ be the lightest edge across this partition. Then $X \cup \{e\}$ is part of some MST.

Data structure for disjoint sets - Union by rank

Faster find:

When we want to merge sets, if the representatives (roots) of the sets are $r_x$ and $r_y$ , do we make $r_x$ point to $r_y$ or the other way around? Since tree height is the main impediment to computational efficiency, a good strategy is to make the root of the shorter tree point to the root of the taller tree.

Properties:

$\forall x, \text{rank}(x) < \text{rank}(\pi(x))$

Any node of rank k has at least $2^k$ nodes in its tree

If there are $n$ elements overall, there can be at most $n/2^k$ nodes of rank $k$

Kruskal’s Method

Sort the edges by their weights, everytime check if the smallest edge would add a cycle to the graph. If it doesn't then it gets added to the graph.

$O(|E| \log |E|) \approx O(|E| \log |V|)$ for sorting edges and $O(|E| \log |V|)$ for unino and find operations

But if the weights are small so that sorting can be done in linear time? Can we perform $union$ and $find$ faster than $\log n$ ?

during each $find$ , when a series of parent pointers is followed up to the root of a tree, we will change all these pointers so that they point directly to the root ⇒ use the faster $find$

Now the amortized cost of $find$ and $union$ is $O(1)$

Prim’s Method

Start from an arbitraty node, at each step we added the shortest edge connecting some node already in the tree to one that isn't yet.

So obtain a subgroup and another subgroup that contains vertices not selected in the first subgroup. Whatever edges that hasn't been selected yet that connects the first group with the second and has the smallest weight will be selected.

Huffman Encoding $O(n \log n)$

What is the most economic way of writing strings consisting of letters A,B,C,D?

The obvious choice is to use $2$ bits per symbol—say codeword $00$ for $A$ , $01$ for $B$ , $10$ for $C$ , and $11$ for $D$ - in which case $260$ megabits are needed in total. Can there possibly be a better encoding than this?

In search of inspiration, we take a closer look at our particular sequence and find that the four symbols are not equally abundant.

Is there some sort of variable-length encoding, in which just one bit is used for the frequently occurring symbol A, possibly at the expense of needing three or more bits for less common symbols?

A danger with having codewords of different lengths is that the resulting encoding may not be uniquely decipherable. For instance, if the codewords are $\{0, 01, 11, 001\}$ , the decoding of strings like $001$ is ambiguous. We will avoid this problem by insisting on the prefix-free property: no codeword can be a prefix of another codeword.

🔥

Any prefix-free encoding can be represented by a full binary tree ⇒ A binary tree in which every node has zero or two children ⇒ every codeword is generated y a path from root to leaf.

To measure how “good” a coding tree is, we want a tree whose leaves each correspond to a symbol and minimizes the overall length of the encoding

\text{cost of tree} = \sum_{i=1}^n \underbrace{f_i}_{\text{frequency of the i-th symbol}} \times \text{depth of i-th symbol in tree}

Tells us that the two symbols with the smallest frequencies MUST be at the bottom of the optimal tree, as children of lowest internal node. ⇒ because otherwise swapping whatever node with either of those two would increase performance.

This suggests that we start constructing the tree greedily: find the two symbols with the smallest frequencies, say $i$ and $j$ , and make them children of a new node, which then has frequency $f_i + f_j$

There is another way to write this cost function that is very helpful. Although we are only given frequencies for the leaves, we can define the frequency of any internal node to be the sum of the frequencies of its descendant leaves; this is, after all, the number of times the internal node is visited during encoding or decoding. During the encoding process, each time we move down the tree, one bit gets output for every nonroot node through which we pass.

And therefore

The cost of a tree is the sum of the frequencies of all leaaves and internal nodes, except the root

Entropy

Average number of bits needed to encode a single draw from the distribution

\text{number of bits needed to encode sequence} = \sum_{i=1}^n m p_i \log(1/p_i)

Measures the randomness of a set of mixed elements

\text{entropy} = \sum_{i=1}^n p_i \log(\frac{1}{p_i})

Horn Formula

A framework for doing computer doing logical reasoning!

A literal is either: a variable or a variable’s negation

Then we have two kinds of caluses:

Implications, left hand side is and AND of any number of positive literals and right hand is a single positive iteral $a \implies b$

Pure negative clauses, consisting of an OR of any number of negative literals $(\bar{u} \vee \bar{v} \vee \bar{y})$

Given a set of clauses, the goal is to do satisfying assignment - determine whether there is a consistent explanation: an assignment of $true / false$ values to the variables that satisfies all the clauses.

Our approach:

Start with all variables $false$

and proceed to set some of them to $true$
1. one by one and only if we absolutely have to because an implication would be violated

Horn formulas lie at the heart of Prolog (“programming by logic”), a language in which you program by specifying desired properties of the output, using simple logical expressions

To see why the algorithm is correct, notice that if it returns an assignment, this assignment satisfies both the implications and the negative clauses, and so it is indeed a satisfying truth assignment of the input Horn formula. So we only have to convince ourselves that if the algorithm finds no satisfying assignment, then there really is none. This is so because our “stingy” rule maintains the following invariant:

If a certain set of variables is set to true, then they must be true in any satisfying assignment.

Set cover

County is in its early stages of planning and is deciding where to put schools. There are only two constraints: each school should be in a town, and no one should have to travel more than 30 miles to reach one of them.

What is the minimum number of schools needed?

Let $S_x$ be the set of towns within 30 miles of town $x$ , a school at $x$ will esentially “cover”these other towns. Then how many sets $S_x$ must be picked in order to cover all the towns in the country?

Greedy solution:

Repeat until all elements of $B$ are covered
1. Pick the set $S_i$ with the largest number of uncovered elements

But we will see that this is not optimal, but it’s not too far off, so let’s modify this.

Suppose $B$ contains $n$ elements and that the optimal cover consists of $k$ sets, then the greedy algorithm will use at most $k \ln n$ sets.

Let $n_t$ be the number of elements still not covered after $t$ iterations of the greedy algorithm, since these remaining elements are covered by the optimal $k$ sets

there must be some set with at least $n_t/k$ of them.

Therefore, the greedy strategy will ensure that

n_{t+1} \le n_t - \frac{n_t}{k}=n_t(1-\frac{1}{k})

Which implies $n_t \le n_0(1-1/k)^t$ .

If we use an inequality

1-x \begin{cases} < e^{-x} &\quad \forall x \ne 0 \\ =e^{-x} &\quad \text{if } x =0 \end{cases}

Therefore

n_t \le n_0(1-\frac{1}{k})^t < n_0(e^{-1/k})^t = ne^{-t/k}

Dynamic Programming (DPV 6)

Shortest Paths in DAGS

👉

Every DAG can be expressed by linearized graph

In this particular graph, we see that in order for us to find shortest distance from $S$ to $D$

dist(D) = \min\{dist(B)+1, dist(C)+3\}

We can compute all distances in a single pass:

Longest Increasing Subsequences

String Edit Distance $O(nm)$

Subproblem of string edit distance $E(i − 1, j), E(i, j − 1), E(i − 1, j − 1)$

E(i, j) = \min\{1 + E(i − 1, j), 1 + E(i, j − 1), diff(i, j) + E(i − 1, j − 1)\}

Knapsack

During a robbery, a burglar finds much more loot than he had expected and has to decide what to take. His bag (or “knapsack”) will hold a total weight of at most $W$ pounds. There are $n$ items to pick from, of weight $w_1, \dots , w_n$ and dollar value $v_1, \dots , v_n$ . What’s the most valuable combination of items he can fit into his bag?

Knapsack with repetition $O(nW)$

If there are unlimited quantities of each item available

K(w) = \text{maximum value achievable with a knapsack of capacity $w$}

If the optimal solution to $K(w)$ includes item i, then removing this item from the knapsack leaves an optimal solution to $K(w − w_i)$ .

K(w) = \max_i \{ K(w - w_i) + v_i \}

Kanpsack without repetition

K(w, j) = \text{maximum value achievable using a knapsack of capacity w and items } 1, \dots , j

K(w, j) = \max\{K(w − w_j , j − 1) + v_j , K(w, j − 1)\}

(Chain) Matrix Multiplication

Observation: Order of multiplications make a big difference in the final runtime!

👉

Notice that a particular parathesization can be represented very naturally by a binary tree in which the individual matrices correspond to the leaves, the root is the final product, an interior nodes are intermediate products

For a tree to be optimal, its subtrees must also be optimal.

Let the subproblems corresponding to the subtrees be in the form $A_i \times A_{i+1} \times \cdots \times A_{j}$ . For $1 \le i \le j \le n$ , we define a function

C(i,j) = \text{minimum cost of multiplying } A_i \times A_{i+1} \times \cdots \times A_j

When there’s nothing to multiply ⇒ $C(i,i) = 0$

We can split the subproblem into smaller subproblems

C(i,j) = \min_{i \le k < j} \{ C(i,k)+C(k+1,j) + m_{i-1} \cdot m_k \cdot m_j \}

The subproblems constitute a two-dimensional table, each of whose entries take $O(n)$ time to compute, the overall runtime is thus $O(n^3)$ .

Shortest Paths

We started this chapter with a dynamic programming algorithm for the elementary task of finding the shortest path in a dag. We now turn to more sophisticated shortest-path problems and see how these too can be accommodated by our powerful algorithmic technique.

Shortest Reliable Paths $\in O(|V||E|)$

In a communications network, for example, even if edge lengths faithfully reflect transmission delays, there may be other considerations involved in choosinga path. For instance, each extra edge in the path might be an extra “hop” fraught with uncertainties and dangers of packet loss.

dist(v,i) = \min_{(u,v) \in E} \{ dist(u,i-1) + l(u,v) \}

All-pairs shortest paths (Flyd-Warshall Algorithm) $\in O(|V|^3)$

What if we want to find the shortest path not just between s and t but between all pairs of vertices? One approach would be to execute our general shortest-path algorithm from Bellman-Ford (since there may be negative edges) $|V |$ times, once for each starting node. The total running time would then be $O(|V|^2|E|)$ . We’ll now see a better alternative, the $O(|V|^3)$ dynamic programming-based Floyd-Warshall algorithm.

👉

If we disallow intermediate nodes then shortest path from

u

v

is simply

(u,v)

⇒ So can we just slowly expand the allowed intermediates, one at a time, updating the shortest path lengths at each step, and once we’ve expanded to the whole set of vertices, we have found the true shortest paths!

If we number the vertices $V$ as $\{ 1,2,\dots,k \}$ , and denote $dist(i,j,k)$ as the length of the shortest path from $i$ to $j$ in which only nodes $\{1,2,\dots, k\}$ can be used as intermediates.

What happens when we expand the intermediate set to include an extra node k? We must reexamine all pairs i, j and check whether using k as an intermediate point gives us a shorter path from i to j.

Then using $k$ gives us a shorter path from $i$ to $j$ iff

dist(i,k,k-1) + dist(k,j,k-1) < dist(i,j,k-1)

So the algorithm(Floyd-Warshall Algorithm $\in O(|V|^3)$ )

Traveling Salesman Problem (TSP) $\in O(n^2 2^n)$

A traveling salesman is getting ready for a big sales tour. Starting at his hometown, he will conduct a journey in which each of his target cities is visited exactly once before he returns home. Given the pairwise distances between cities, what is the best order in which to visit them, so as to minimize the overall distance traveled?

If we denote the cities by $1, \dots, n$ , the salesman’s hometown being 1, and let $D = (d_{ij})$ be the matrix of intercity distances, the goal is to design a tour that starts and ends at 1, including all other cities exactly once, and has minimum total length.

Bruteforce takes $O(n!)$ time, and although DP improves this runtime a lot it is still not possible to solve this in polynomial time.

Subproblems:

For a subset of cities $S = \{1,j, \dots\} \sube \{1,2, \dots, n\}$ that includes 1, and $j \in S$ , let $C(S,j)$ be the length of the shortest path visiting each node in $S$ exactly once, starting at 1 and ending at $j$ .

When $|S|>1$ , wedefine $C(S,1) = \infin$ since the path cannot both start and end at 1.

C(S,j) = \min_{i\in S: i \ne j} C(S-\{j\},i)+d_{ij}

The subproblems are ordered by $|S|$ , the whole algorithm is:

There are at most $2^n \times n$ subproblems, and each one takes linear time to solve, so the overall complexity is $O(n^2 2^n)$

Independent Sets in Trees $\in O(|V|+|E|)$

A subset of nodes $S \sub V$ is an independent set of graph $G = (V,E)$ if there are no edges between them.

Like before, we can form this problem into a tree of nodes representing sub-problems

We will define the subproblem( $u$ is a node in the tree, which also represents a node in graph):

We will simply select any node in graph vertices $V$ as the root of the tree

I(u) = \text{size of largest independent set of subtree hanging from $u$}

Our goal is to find $I(r)$ ⇒ $r$ represents the root of the tree

I(u) = \max \{1 + \sum_{\forall w \in grandchildren(u)} I(w), \sum_{\forall w \in children(u)} I(w) \}

If the independent set includes $u$ , then we get one point for it, but we aren’t allowed to include the children of $u$ (since they are directly connected to $u$ )—therefore we move on to the grandchildren.

The number of subproblems is exactly the number of vertices and with a little care the runtime can be made linear $O(|V|+|E|)$

Linear Programming and Reductions (DPV 7)

Linear Programming (DPV 7.1 + 7.4)

We are given a set of variables, and we want to assign real values to them so as to

satisfy a set of linear equations / inequalities involving these variables

maximize or minimize a given linear objective function

An inequality designates a half-space, the region on one side of the line.

Thus the set of all feasible solutions of this linear program, that is, the points (x1, x2) which satisfy all constraints, is the intersection of five half-spaces, and is a convex polygon.

We want to find the point in this polygon at which the objective function—the profit—is maximized. The points with a profit of c dollars lie on the line $x_1 + 6x_2 = c$ , which has a slope of $−1/6$ .

Since the goal is to maximize c, we must move the line as far up as possible, while still touching the feasible region. The optimum solution will be the very last feasible point that the profit line sees and must therefore be a vertex (or edge) of the polygon, as shown in the figure.

Standard Form of LP (DPV 7.1)

All variables are nonnegative

Constraints are all equations

objective function is to be minimized

Convert LP Problem into Standard Form

Duality of LP (DPV 7.4)

What Duality is about in LP and derivation

Duality theorem: If a linear program has a bounded optimum, then so does its dual, and the two optimum values coincide

🤔

When the primal is the LP that expresses the max-flow problem, it is possible to assign interpretations to the dual variables that show the dual to be non other than the minimum-cut problem

Simplex Algorithm

Fact: each vertex is the unique point at which some subset of hyperplanes meet.

So to find a vertex:

Pick a subset of the inequalities. If there is a unique point that satisfies them with equality, and this point happens to be feasible, then it is a vertex.
- Each vertex (of dimension n) is specified by a set of n inequalities.
- Note that there might be vertices that are degenrate
  - They could be found by different subsets of inequalities
  - We just pretend they don’t exist for now

Two vertices are neighbors if they have $n − 1$ defining inequalities in common.

Suppose we have a generic LP

\max \{c^\top x\} \\ Ax \le b \\ x \ge 0

For each iteration, we want to

Check if the current vertex is optimal

Determine where to move next

Note:

The origin is a vertex that we can start with.

The current vertex(origin) is optimal iff $\forall c_i \le 0$

We can increase some $x_i$ for which $c_i > 0$ (release the tight constraint $x \ge 0$ ⇒ until we hit some other constraint (loose ⇒ tight)

Now we arrived at some other vertex $u$

Then we transform $u$ into the origin(shifting the coordinate system from the usual to the “local” view of $u$ )

These local coordinates consist of (appropriately scaled) distances $y_1, \dots , y_n$ to the $n$ hyperplanes (inequalities) that define and enclose $u$ :

$a_i \cdot x \le b_i$ ⇒ $y_i = b_i - a_i \cdot x$

The revised “local” LP has the following three properties:

It includes the inequalities $y \ge 0$ , which are simply the transformed versions of the inequalities defining $u$

$u$ is the origin in y-space

Cost function becomes $\max c_u + \tilde{c}^\top y$ where $c_u$ is the value of the objective function at $u$ and $\tilde{c}$ is a transformed cost vector

The simplex function moves from vertex to neighboring vertex, stopping when the objective function is locally optimal, that is, when the coordinates of the local cost vector are all zero or negative.

Runtime Analysis

Suppose there are $n$ variables and $m$ inequalities

At each iteration:

Each neighbors share $n-1$ inequalities with the current vertex, at most $n \cdot m$ neighbors
1. Choose which inequality to drop ( $n$ ) and which new one to add ( $m$ )

Naive way to perform each iteration
1. Check each potential neighbor to see whether it really is a vertex of the polyhedron and to determine its cost
  1. Finding cost is just a dot product
  1. Checkingif it is a true vertex involves a solving a system of linear equations with $n$ equations in $n$ unknowns
    1. by Gaussian Elimination $O(n^3)$
1. $\in O(mn^4)$ per iterations

Not-naive way to perform each iteration
1. Turns out that the per-iteration overhead of rewriting the LP in terms of the current local coordinates is just $O((m + n)n)$
  1. exploits the fact that the local view changes only slightly between iterations, in just one of its defining inequalities
1. To select the best neighbor
  1. we recall that the (local view of) the objective function is of the form $\max c_u +\tilde{c} \cdot y$ where $c_u$ is the value of the objective function at $u$ .
  1. We pick any $\tilde{c}_i > 0$ (if there is none, then the current vertex is optimal and simplex halts)
1. $\in O(mn)$ per iteration

At most $m+n \choose n$ iterations (vertices)
1. Exponential in $n$
1. In practice, such exponential examples do not occur

Finding a starting vertex for simplex

It turns out that finding a starting vertex can be reduced to an LP and solved by simplex!

Recall standard form:

\min c^\top x \\ \text{s.t. } Ax=b \wedge x \ge 0

First make sure that RHS of equations are all non-negative
1. If $b_i < 0$ , just multiply both sides of the $i$ -th equation by $-1$
1. Create LP as follows
  1. Create artificial variables $z_1, \dots, z_m \ge 0$ , where $m$ is number of equations
  1. Add $z_i$ to the LHS of the i-th equation
  1. Let the objective to be minimized be $\sum_i z_i$
1. It’s easy to come up with a starting vertex
  1. $\forall z_i = b_i$ and other variables 0
1. If the optimum value of $z_1 + \cdots + z_m$ is zero, then all $z_i$ ’s obtained by simplex are zero, and hence from the optimum vertex of the new LP we get a starting feasible vertex of the original LP, just by ignoring the $z_i$
1. If $\sum_i z_i$ is not zero ⇒ the original linear program is infeasible

Degeneracy in Simplex Algorithm

When a vertex has more than $n$ (dimension of variables) constraints intersecting, then choosing any one of sets of $n$ inequalties would give us the same solution

Causes simplex to return a suboptimal degenerate vertex simply because all its neighbors are identical to it.

One way to fix it is through a perturbation
- Change each $b_i$ by a tiny random amount to $b_i \pm \epsilon_i$
- Doesn’t change the essense of the $LP$ since $\epsilon$ are tiny, but has the effect of dirrentiating between the solutions of linear systems

Unboundedness of LP

In some cases an LP is unbounded, in that its objective function can be made arbitrarily large

simplex will discover it
- in exploring the neighborhood of a vertex, it will notice that taking out an inequality and adding another leads to an underdetermined system of equations that has an infinity of solutions
- The space of solutions contains a whole line across which the objective can become larger and larger, all the way to $\infin$

Network Flow (Maximum Flow $\in O(|V| \cdot |E|^2)$ ) (DPV 7.2)

The networks we are dealing with consist of a directed graph $G = (V, E)$ ; two special nodes $s, t \in V$ , which are, respectively, a source and sink of G; and capacities $c_e > 0$ on the edges.

We would like to send as much things as possible from $s$ to $t$ without exceeding the capacities of any of the edges. A particular shipping scheme is called a flow and consists of a variable $f_e$ for each edge $e$ of the network, satisfying the following two properties:

Doesn’t violate edge capacities $0 \le f_e \le c_e$

For all nodes $u$ except $s$ and $t$ , the amount of flow entering and the amount of flow leaving $u$ should be the same ⇒ flow is conserved
1. $\sum_{(w,u) \in E} f_{wu} = \sum_{(u,z) \in E} f_{uz}$

The size of a flow is the total quantity sent form $s$ to $t$ and is the quantity leaving $s$

size(f) = \sum_{(s,u) \in E} f_{su}

We can solve max flow by Ford-Fulkerson

Find an augmenting path in the graph
1. BFS(Edmonds–Karp), A* search, RL Model

Construct the residual graph by augmenting the path based on the original graph and augmenting path

Do this until no augmenting path can be found.

The max flow problem can also be reduced to a LP problem

Once we have studied the simplex algorithm in more detail (DPV 7.6), we will be in a position to understand exactly how it handles flow LPs, which is useful as a source of inspiration for designing direct max-flow algorithms. It turns out that in fact the behavior of simplex has an elementary interpretation:

Start with zero flow.

Repeat: choose an appropriate path from s to t, and increase flow along the edges of this path as much as possible.

See Figure 7.5 (a-d) for example

There is just one complication.
- What if we had initially chosen a different path, the one in Figure 7.5(e)?
  - This gives only one unit of flow and yet seems to block all other paths.
- Simplex gets around this problem by also allowing paths to cancel existing flow.
- In this particular case, it would subsequently choose the path of Figure 7.5(f). Edge $(b, a)$ of this path isn’t in the original network and has the effect of canceling flow previously assigned to edge $(a, b)$ .

in each iteration simplex looks for an $s \rightarrow t$ path whose edges $(u, v)$ can be of two types

$(u,v)$ is in the original network, and is not yet at full capacity

The reverse edge $(v,u)$ is in the original network, and there is some flow along it

If the current flow is f , then in the first case, edge $(u, v)$ can handle up to $c_{uv} − f_{uv}$ additional units of flow, and in the second case, upto fvu additional units (canceling all or part of the existing flow on $(v, u)$ ). These flow-increasing opportunities can be captured in a residual network $G^f = (V, E^f)$ , which has exactly the two types of edges listed, with residual capacities $c^f$ :

c_f = \begin{cases} c_{uv} - f_{uv} &\quad \text{if $(u,v) \in E$ and $f_{uv} < c_{uv}$} \\ f_{uv} &\quad \text{if $(v,u) \in E$ and $f_{vu} > 0$} \end{cases}

Thus we can equivalently think of simplex as choosing an $s − t$ path in the residual network

🧙🏽‍♂️

Not only does simplex correctly compute a maximum flow, but it also generates a short proof of the optimality of this flow!

Partition the nodes of the oil network (Figure 7.4) into two groups, $L = \{s, a, b\}$ and $R = \{c, d, e, t\}$ :

Any oil transmitted must pass from $L$ to $R$ . Therefore, no flow can possibly exceed the total capacity of the edges from $L$ to $R$ , which is 7. But this means that the flow we found earlier, of size 7, must be optimal

🔥

More generally, an

(s, t)

-cut partitions the vertices into two disjoint groups

L

and

R

such that

s

is in

L

and

t

is in

R

. Its capacity is the total capacity of the edges from

L

R

, and as argued previously, is an upper bound on any flow: Pick any flow

f

and any

(s, t)

-cut

(L, R)

. Then

size(f ) \le capacity(L, R)

Some cuts are large and give loose upper bounds— $cut(\{s, b, c\}, \{a, d, e, t\})$ has a capacity of 19. But there is also a cut of capacity 7, which is effectively a certificate of optimality of the maximum flow. Such a cut always exists.

Max-flow min-cut theorem

The size of the maximum flow in a network equals the capacity of the smallest $(s, t)$ -cut. Moreover, our algorithm automatically finds this cut as a by-product!

Suppose $f$ is the final flow when the algorithm terminates. We know that node $t$ is no longer reachable from $s$ in the residual network $G^f$ . Let $L$ be the nodes that are reachable from $s$ in $G^f$ , and let $R = V − L$ be the rest of the nodes. Then $(L, R)$ is a cut in the graph G:

We claim that:

size(f) = capacity(L,R)

Runtime:

Each iteration requires $O(|E|)$ time if a DPS or BPS is used to find an $s-t$ path.

Suppose all edges in the original network have integer capacities $\le C$ . Then an inductive argument shows that on each iteration of the algorithm, the flow is always an integer and increases by an integer amount. Therefore, since the maximum flow is at most $C|E|$ ⇒ the number of iterations is at most this much.

But what if $C$ is big? However, if paths are chosen in a sensible manner—in particular, by using BFS, which finds the path with the fewest edges—then the number of iterations is at most $O(|V ||E|)$

Bipartite Matching

There is an edge between a boy and girl if they like each other (for instance, Al likes all the girls). Is it possible to choose couples so that everyone has exactly one partner, and it is someone they like? In graph-theoretic jargon, is there a perfect matching?

Create a new source node, s, with outgoing edges to all the boys; a new sink node, t, with incoming edges from all the girls; and direct all the edges in the original bipartite graph from boy to girl (Figure 7.8). Finally, give every edge a capacity of 1. Then there is a perfect matching if and only if this network has a flow whose size equals the number of couples.

Nice property about max flow problem: if all edge capacities are integers, then the optimal flow found by our algorithm is integral

Zero-sum Games (DPV 7.5)

Rock-paper-scissors game’s payoff matrix illustrated as Row’s gain

Now suppose the two of them play this game repeatedly. If Row always makes the same move, Column will quickly catch on and will always play the countermove, winning every time. Therefore Row should mix things up: we can model this by allowing Row to have a mixed strategy, in which on each turn she plays $r$ with probability $x_1$ , $p$ with probability $x_2$ , and $s$ with probability $x_3$ . This strategy is specified by the vector $x = (x_1, x_2, x_3)$ . Similarily, Column’s mixed strategy is some $y = (y_1, y_2, y_3)$ .

On any given round of the game, there is an $x_i \cdot y_j$ chance that Row and Column will play the i-th and j-th moves, respectively.

Therefore the expected(avergae) gain of Row is:

\sum_{i,j} G_{ij} \cdot x_i y_j

📌

Notice that Column can force this expected gain by choosing

y = (1/3, 1/3, 1/3)

, no matter what strategy Row chooses to play. So can the row do the same. In short, the best each player can do is to play completely randomly, with an expected payoff of zero.

If we investigate with a nonsymmetric game ⇒ then the “best” each player can do might be different

Imagine a presidential election scenario in which there are two candidates for office, and the moves they make correspond to campaign issues on which they can focus (the initials stand for economy, society, morality, and tax cut). The payoff entries are millions of votes lost by Column.

Suppose now Row’s strategy is $r = (x_1, x_2)$ The objective of Column $c = (y_1, y_2)$ :

\min_{c = (y_1, y_2)} \{3x_1 - 2x_2, -x_1 + x_2\}

Since Row knows that it is forced to announce its strategy,

\max_{r} \min_{c} \{3x_1 - 2x_2, -x_1 + x_2\}

We can transform this problem into an LP:

Dual of this LP: If Column anounces strategy first and then Row chooses

Multiplicative Weights Algorithm / Hedge Algorithm

Per-step regret $\frac{R_T}{T} \in O(\sqrt{\frac{\ln n}{T}})$

Consider the following online optimization setup:

On each time step $t = 1, \dots$ we can choose among $n$ possible strategies

we are required to produce a strategy array $x^{(t)} = (x_1^{(t)}, \dots, x_n^{(t)})$ where $x_i^{(t)} \ge 0$ and $\sum_{i=1}^n x_i^{(t)} = 1$ .

After we make our choice of $x^{(t)}$ , a loss array $l^{(t)} = (l_1^{(t)}, \dots, l_n^{(t)})$ is revealed

Therefore our loss at time $t$ is
- $\sum_{i=1}^n x_i^{(t)} \cdot l_i^{(t)}$

The regret at time $T$ is defined as
- $R_T = \sum_{t=1}^T \sum_{i=1}^n x_i \cdot l_i^{(t)} - \sum_{t=1}^T \sum_{i=1}^n x_i^{(t)*} \cdot l_i^{(t)}$
- $x_i^{(t)*} = \argmin_{x \in \Delta} \sum_{t=1}^T \sum_{i=1}^n x_i^{(t)} \cdot l_i^{(t)}$

We will use $\Delta$ to denote all possible probability distributions over $n$ elements
- $\min_{x \in \Delta} \sum_{t=1}^T \sum_{i=1}^n x_i \cdot l_i^{(t)} = \min_{\forall i \in [1,n]} \sum_{t=1}^T l_i^{(t)}$

We assume that $0 \le l_i^{(t)} \le 1$ . If losses are in different range, we can scale them and shift them to be between 0 and 1

The algorithm:

has $0 \le \epsilon \le 1/2$
- hardcoded in the algorithm
- Maintains a set of weights $w^{(t)} = (w_1^{(t)}, \dots, w_n^{(t)})$ at each time step $t$
  - $w_i^{(0)} = 1$
  - $w_i^{(t+1)} = w_i^{(t)} \cdot (1-\epsilon)^{l_i^{(t)}}$

Each step,
- $x_i^{(t)} = \frac{w_i^{(t)}}{\sum_j w_j^{(t)}}$

Analysis of Multiplicative Weights Algorithm

Multiplicative Weights in Minmax Games

Suppose $A$ is the payoff matrix for the fist player $x$

$x^{(t)}$ denotes the probability vector of player 1’s strategy

$y^{(t)}$ denotes the probability vector of player 2’s strategy

Loss for player 1 is $- (x^{(t)})^{\top}A y^{(t)}$

Loss for player 2 is $(x^{(t)})^{\top} A y^{(t)}$

Bounds:

For both players
- $R_T \in O(\sqrt{\ln n \cdot T})$
- $\frac{R_T}{T} \in O(\sqrt{\frac{\ln n}{T}})$

Circuit Evaluation (DPV 7.7)

We are given a Boolean circuit, a DAG of gates of the following types

Input gates have indegree zero, with value true or false

AND gates and OR gates have indegree 2

NOT gates have indegree 1

Additionally, one of the gates is designated as the output.

CIRCUIT VALUE problem: apply Boolean logic to the gates in topological order, does the output evaluate to true?

🔥

We can translate this problem into a linear program, by creating a variable

x_g

for each gate

g

, with constraints

0 \le x_g \le 1

⬆️Additional Constraints imposed by different gates

Those constraints forces each $x_g$ to be $0$ (false) or $1$ (true).

Turns out the CIRCUIT VALUE problem is the most general problem solvable in polynomial time, every algorithm will eventually run on a computer, and the computer is ultimately a Boolean combinational circuit implemented on a chip. If the algorithm runs in polynomial time, it can be rendered as a Boolean circuit consisting of polynomially many copies of the computer’s circuit, one per unit of time.

Hence, the fact that CIRCUIT VALUE reduces to linear programming means that all problems that can be solved in polynomial time do!

NP-Complete Problems (DPV 8)

Search Problems (DPV8.1)

A search problem is specified by an algorithm $C$ that takes two inputs, an instance $I$ and a proposed solution $S$ , and runs in time polynomial in $|I|$ . We say $S$ is a solution to $I$ if and only if $C(I,S) = \text{true}$ .

SAT(Satisfiability) problem

(x \vee y \vee z) (x \vee \bar{y})(y \vee \bar{z})(z \vee \bar{x})(\bar{x} \vee \bar{y} \vee \bar{z})

Given a Boolean Formula in Conjunctive Normal Form (CNF), which is a collection of clauses each containing of disjunction of several literals(either a Boolean variable or the negation of one), find a satisfying truth assignment.

Horn Formula:

Special form of SAT problem

At most one positive literal

Greedy algorithm (DPV 5.3)

2-literals case:

Solved by DPV 3.28 by strongly connected components of a particularly constructed graph

Or another polynomial algorithm (DPV 9)

Traveling salesman problem (TSP)

Given $n$ vertices $1, \dots, n$ and all $n(n-1)/2$ distances between them, as well as budget $b$ . We want to find a tour, a cycle that passes through every vertex exactly once, of total cost $b$ or less (or report that no such tour exists)

In other words (in search problem’s perspective), we want to find a permutation $\tau(1), \dots, \tau(n)$ such that when they are toured in this order, the total distance covered is at most $b$ :

d_{\tau(1), \tau(2)} + d_{\tau(2), \tau(3)} + \cdots + d_{\tau(n), \tau(1)} \le b

There’s a budget ⇒ solution to a search problem should be easy to recognize (polynomial checkable).

Euler and Rudrata

Euler’s Bridge problem: Is there a path that visits each edge exactly once?

Or in other words: When can a graph be drawn without lifting the pencil from the paper?

🤔

Euler’s Answer: Iff 1. The graph is connected 2. Every vertex, with the exception of two vertices (start and final vertices of the walk), has evven degree

Rudrata’s Cycle problem: Can one visit all the squares of the chessboard, without repeating any square, in one long walk that ends at the starting square and at each step makes a legal knight move?

In other words: Given a graph, find a cycle that visits each vertex exactly once - or report that no such cycle exists ⇒ Oops! Looks similar to TSP!

Cuts and Bisections

Balanced Cut problem: Given a graph with $n$ vertices and a budget $b$ , partition the vertices into two sets $S$ and $T$ such that $|S|, |T| \ge n/3$ and such that there are at most $b$ edges between $S$ and $T$

Applications: Clustering (problem of segmenting an image into its constituent components (say, an elephant standing in a grassy with a clear blue sky above) ⇒ Put a node for each pixel and put an edge between nodes whose corresponding pixels are spatially close together and also similar in color.

Integer Linear Programming

We mentioned in DPV 7 that there is a different, polynomial algorithm for linear programming. But the situation changes completely if in addition to specifying a linear objective function and linear inequalities, we also constrain the solution(values for the variables) to be integer.

In order to define a search problem, we define goal $g$ (counterpart of budget constraint in maximization problems) Then we have:

\max_{x} c^\top x\\ \text{s.t.} \\ Ax \le b \\ c \cdot x \ge g

Special case of ILP (Zero-one Equations [ZOE]):

Goal to find a vector $\vec{x}$ of $0$ and $1$ satisfying $A\vec{x} = \vec{1}$

$A$ is $m \times n$ matrix with $0 - 1$ entries

Three-Dimensional Matching

$n$ boys, $n$ girls and $n$ pets, and the compatibilities among them are specified by a set of triples $\{ b, g, p \}$ , each containing a boy, a girl, and a pet. We want to find $n$ disjoint triples and thereby create $n$ harmonious households

Each triple is shown as a triangular-shaped node joining boy, girl, and pet.

Independent set, vertex cover, and clique

Independent Set Problem: Given a graph and an integer $g$ , find $g$ vertices that are independent (no two of which have an edge between them)

Vertex Cover (Special Case of Set Cover): input is a graph and a budget $b$ , find $b$ vertices that cover every edge

Set Cover: Given a set $E$ and several subsets of it, $S_1, \dots, S_m$ , along with a budget $b$ . Select $b$ of subsets so that their union is $E$ .

CLIQUE problem: Given a graph and a goal $g$ ， find a set of $g$ vertices such that all possible edges between them are present.

Longest Path: Given a graph $G$ with nonnegative edge weights ad two distinguished vertices $s$ and $t$ , along with a goal $g$ , we want to find a path from $s$ to $t$ with total weight at least $g$ .

Knapsack and subset sum

KNAPSACK problem (DPV 6.4, DP solution $\in O(nW)$ ): Given integer weights $w_1, \dots, w_n$ and integer values $v_1, \dots, v_n$ for $n$ items. Also given a weight capacity $W$ and goal $g$ , find a set of items whose total weight is at most $W$ and total value is at least $g$ .

UNARY KNAPSACK: KNAPSACK with unary-coded integers ( $IIIIIIIIIIII = 12$ )

SUBSET SUM: Each item’s value is equal to its weight (given in binary), goal $g$ is the same as the capacity $W$ ⇒ finding a subset of given set of integers that adds up to exactly $W$

NP-Complete Problems (DPV 8.2)

The world is full of search problems, some can be solved efficiently, some seem to be very hard

Hard Problems (NP-Complete)	Easy Problems (in P)
3 SAT	2 SAT, HORN SAT
Traveling Salesman Problem	Minimum Spanning Tree
Longest Path	Shortest Path
3D Matching	Bipartite Matching
Knapsack	Unary Knapsack
Independent Set	Independent Set on Trees
Integer LP	Linear Programming
Rudrata Path	Euler Path
Balanced Cut	Minimum Cut

P and NP

Search Problems: There is an efficient checking algorithm $C$ that takes input a given input $I$ (data specifying the problem) and a potential solution $S$ , and outputs $true$ iff $S$ is a solution to instance $I$ in polynomial time $|I|$ .

We denote the class of all search problems by $NP$ (nondeterministic polynomial time).

We also denote the class of all search problems that can be solved in polynomial time by $P \sube NP$

🔥

P = NP

? Most researchers think

P \ne NP

if $P = NP$ , there’s an efficient method to prove any mathematical statement, thus eliminating the need for mathematicians! ⇒ this is a nice intuition of why $P \ne NP$ , however proving this is extremely difficult.

Reductions in search problems

The problem on the left side of the table are all in some sense - exactly the same problem, except that they are stated in different languages. We will use reductions to show that those problems are the hardest problems in $NP$ - if even one of them has a polynomial time algorithm, then every problem in $NP$ has a polynomial time algorithm.

A reduction from search problem $A$ to search problem $B$ is a polynomial-time algorithm $f$ that transforms any instance $I$ of $A$ into an instance $f(I)$ of $B$ , together with another polynomial-time algorithm $h$ that maps any solution $S$ of $f(I)$ back into a solution $h(S)$ of $I$ . If $f(I)$ has no solution, then neither does $I$ .

Now let’s define the class of hardest search problems

A search problem is NP-complete if all other search problems reduce to it.

Nice properties about reduction:

Composability
- If $A \rightarrow B \wedge B \rightarrow C$ , then $A \rightarrow C$
- If we proved that $A$ is NP-Complete, we just need to show $A \rightarrow B$ to show that $B$ is NP-Complete, because now all problems in NP reduce to $B$ .

If $A \rightarrow B$ , then
- difficulty flows in the direction of the arrow
- efficient algorithms move in the other direction

A side note on factoring:

Factoring is the task of finding all prime factors of a given integer, but the difficulty of factoring is of a different nature and nobody believes that factoring is NP-complete. One major difference is there’s no scenario such that no solution exists, and the problem succumbs to the power of quantum computation.

The Reductions (DPV 8.3)

Reduction: Rudrata (s,t) Path → Rudrata Cycle

Reduction: 3SAT → Independent Set

Reduction: SAT → 3SAT

Reduction: Independent Set → Vertex Cover

Reduction: Independent Set → Clique

Reduction: 3SAT → 3D Matching

Reduction: 3D Matching → ZOE

Reduction: ZOE → Subset Sum

ZOE → ILP

Reduction: ZOE → RUDRATA CYCLE

Reduction: RUDRATA CYCLE → TSP(Travel Salesman)

Reduction: Any problem in NP → SAT

Coping with NP-Completeness (DPV 9)

If we have a hard problem

Start by proving that it is NP-Complete
1. A proof by generalization
1. Simple reduction from another NP-Complete problem to this problem

Once problem is proved NP-Complete
1. Special cases of the problem might be solved in linear time
1. Graph problems can be solved by DP in linear time
1. approximation algorithm
  1. falls short of the optimum but never by too much
  1. Greedy algorithm for set cover $O(\log n)$
1. heuristics
  1. algorithms with no guarantees on either the running time or the degree of approximation
  1. rely on ingenuity, intuition, a good understanding of the application, meticulous experimentation to tackle the problem

Intelligent Exaustive Search (DPV 9.1)

Backtracking

It is often possible to reject a solution by looking at just a small portion of it

By quickly checking and discrediting some partial assignment, we are able to prune some portion of the search space.

Consider the Boolean formula $\phi(w,x,y,z)$ specified by the set of clauses

(w \vee x \vee y \vee z), (w \vee \bar{x}), (x \vee \bar{y}), (y \vee \bar{z}), (z \vee \bar{w}), (\bar{w} \vee \bar{z})

We will start by branching on any one variable ( $w$ )

No clause is immediately violated, thus neither of those partial assignments can be eliminated outright. Let’s keep branching on $x$ with $w=0$

Now we see that $w=0, x=1$ violates $(w \vee \bar{x})$ and this helps us prune a good chunk of search space.

In the case of Boolean satisfiability, each node of the search tree can be described either by a partial assignment or by the clauses that remain when those values are plugged into the original formula.

For instance, if $w = 0$ and $x = 0$ then any clause with $w$ or $x$ is instantly satisfied and any literal $w$ or $x$ is not satisfied and can be removed. What’s left is

(y \vee z), (\bar{y}), (y\vee\bar{z})

Likewise for $w=0, x=1$

(), (y \vee \bar{z})

“empty clause” rules out satisfiability.

Thus the nodes of the search tree, representing partial assignments, are themselves SAT sub-problems. This alternative representation is helpful for making the two decisions that repeatedly arise: which subproblem to expand next, and which branching variable to use.

Since the benefit of backtracking lies in its ability to eliminate portions of the search space, and since this happens only when an empty clause is encountered, it makes sense to choose the subproblem that contains the smallest clause and to then branch on a variable in that clause.

More abstractly, a backtracking algorithm requires a test that looks like a subproblem and quickly declares one of three outcomes

Failure: the subproblem has no solution

Success: A solution to the subproblem is found

Uncertainty

In the case of SAT, this test declares failure if there is an empty clause, success if there are no clauses, and uncertainty otherwise.

Branch-and-bound

Same principle of backtracing algorithm can be generalized from search problem to optimization problems.

Suppose we have a minimization problem

As before, we need a basis for eliminating partial solutions, since there is no other source of efficiency in our method. To reject a subproblem, we must be certain that its cost exceeds that of some other solution we have already encountered. But its exact cost is unknown to us and is generally not efficiently computable. So instead we use a quick lower bound on this cost.

Use this on TSP problem with graph $G = (V,E)$ with edge lengths $d_e > 0$ :

A partial solution is a simple path $a \rightsquigarrow b$ passing through some vertices $S \sube V$ , where $S$ includes the endpoints $a$ and $b$ .
- We can denote such a partial solution by $[a,S,b]$ where $a$ is fixed throughout the algorithm
- Corresponding subproblem is to find the best completion of the tour (cheapest complementary path $b \rightsquigarrow a$ with intermediate nodes $V - S$ )

Initial problem is $[a, \{a\}, a]$

At each step we extend a particular partial solution $[a,S,b]$ by a single edge $(b,x)$ , where $x \in V-S$
- up to $|V-S|$ ways to do this
- leads to a subproblem of $[a, S \cup \{x\}, x]$

To impose a lower-bound on completing a partial tour $[a,S,b]$ , many methods have been proposed, but we can look at some easy ideas:
- Remaining vertices are $V-S$ , plus edges from $a$ and $b$ to $V-S$ , therefore its cost is at least the sum of
  - Lightest edge from $a$ to $V-S$
  - Lightest edge from $b$ to $V-S$
  - Minimum Spanning Tree of $V-S$

Approximation Algorithms (DPV 9.2)

Suppose we have an algorithm $A$ for our problem which given the instance $I$ returns a solution with value $A(I)$ , the approximation ratio of algorithm $A$ is defined as

\alpha_A = \max_I \frac{A(I)}{OPT(I)}

$\alpha_A$ measures by the factor by which the output of algorithm $A$ exceeds the optimal solution, on the worst-case input.

When faced with an NP-complete optimization problem, a reasonable goal is to look for an approximation algorithm $A$ whose $\alpha_A$ is as small as possible.

Vertex Cover

Given an undirected graph $G = (V,E)$ , find a subset of the vertices $S \sube V$ that touches every edge and $|S|$ is minimized.

A special case of SET COVER

From DPV 5.4, we know it can be approximated with a factor of $O(\log n)$ by greedy

Matching: A subset of edges have no vertices in common are matched together. A matching is maximal if no more edges can be added to it.

🧙🏽‍♂️

Maximal matchings will help us find good vertex covers, and moreover, they are easy to generate: repeatedly pick edges that are disjoint from the ones chosen already, until this is no longer possible.

any vertex cover of a graph $G$ must be at least as large as the number of edges in any matching in $G$ ; that is, any matching provides a lower bound on OPT. Simply because each edge of the matching must be covered by one of its endpoints in any vertex cover

let $S$ be a set that contains both endpoints of each edge in a maximal matching $M$ . Then $S$  must be a vertex cover—if it isn’t, that is, if it doesn’t touch some edge $e \in E$ , then $M$ could not possibly be maximal since we could still add $e$ to it. But our cover $S$ has $2|M|$ vertices. And from the previous paragraph we know that any vertex cover must have size at least $|M|$ , so we are done ( $S$ is our approximation of vertex cover)!

This simple procedure always returns a vertex cover whose size is at most twice optimal!

Clustering

We have some data thaty we want to divide into groups ⇒ first define the notion of “distance”, which has to satisfy the usual “metric” properties

$d(x,y) \ge 0$ for all $x,y$

$d(x,y) = 0$ iff $x=y$

$d(x,y) = d(y,x)$

$d(x,y) \le d(x,z) + d(z,y)$
1. Triangle inequality

The problem is NP-hard, but there’s an approximation algorithm:

Pick $k$ of data points as cluster centers, assign each of the reamining points to the center closest to it, creating $k$ clusters
- Centers picked one at a time, rule: always pick the next center to be as far as possible from the centers chosen so far

🧙🏽‍♂️

The resulting diameter is guaranteed to be at most twice optimal

Let $x \in X$ be the point farthest from $\mu_1, \dots , \mu_k$ (in other words the next center we would have chosen, if we wanted $k+1$ of them), and let $r$ be its distance to its closest center. Then every point in $X$ must be within distance $r$ of its cluster center. By triangular inequality, this means every cluster has diameter at most $2r$ .

we have identified $k + 1$ points $\{\mu_1, \mu_2, \dots , \mu_k, x\}$ that are all at a distance at least $r$ from each other ⇒ Any partition into $k$ clusters must put two of these points in the same cluster and must therefore have diameter at least $r$ .

TSP

We saw that the triangle inequality helped making the k-CLUSTER problem approximable; In TRAVELLING SALESMAN PROBLEM, if the distances between cities satisfy the metric properties, then there is an algorithm that outputs a tour of length at most $1.5$ times optimal. We will look at a weaker result that achieves a factor of $2$ .

Is there a structure that is easy to compute and plausibly related to the best traveling salesman tour? Turns out to be the minimum spanning tree.

Removing any edge from a traveling salesman tour leaves a path through all the vertices, with is a spanning tree, therefore

\text{TSP cost} \ge \text{cost of this path} \ge \text{MST cost}

Now a MST is not necessarily a tour. In order to make this a tour, try use each edge twice, then follow the shape of the MST we end up with a tour that visits all the cities, some of them more than once.

Therefore this tour has a length at most twice the MST cost

To fix the problem of visiting some cities more than once, the tour should simply skip any city it is about to revisit, and instead move directly to the next new city in its list (By the triangle inequality, these bypasses can only make the overall tour shorter).

General TSP

Much harder to approximate if TSP does not satisfy the triangle inequality!

Recall that we have a polynomial-time reduction from TSP $I(G,C)$ instance:

If $G$ has a Rudrata path, then $OPT(I(G,C)) = n$ , the number of vertices in $G$ .

If $G$ has no Rudrata path, then $OPT(I(G,C)) \ge n+C$

This means even an approximate solution to TSP would enable us to solve RUDRATA PATH!

So unless $P = NP$ , there cannot exist an efficient approximation algorithm for the general TSP.

Knapsack

This approximation: Given any $\epsilon > 0$ , it will return a solution of value at least $(1-\epsilon)$ times the optimal value, in time it only scales polynomially in input size and in $1/\epsilon$ .

We earlier have a solution to the Knapsack problem with $O(nW)$ . Using a similar technique, a running time of $O(nV)$ can also be achived. Neither of those running times is polynomial because $W$ and $V$ can be very larger, exponential in size of the input.

Let’s consider the $O(nV)$ algorithm. In the bad case when $V$ is large, what if we simply scale down all the values in some way?

Let’s see why this works. First of all, since the rescaled values $\hat{v}_i$ are all at most $n/\epsilon$ , the dynamic program is efficient, running in time $O(n^3/\epsilon)$ .

Now suppose the optimal solution to the original problem is to pick some subset of items $S$ , with total value $K^*$ , the rescaled value of this same assignment is

\sum_{i \in S} \hat{v}_i = \sum_{i \in S} \lfloor v_i \cdot \frac{n}{\epsilon v_{\max}} \rfloor \ge \sum_{i \in S} (v_i \cdot \frac{n}{\epsilon v_{\max}} - 1) \ge K^* \cdot \frac{n}{\epsilon v_{\max}} -n

Therefore the optimal assignment for the shrunken problem, call it $\hat{S}$ , has a rescaled value of at least this much. In terms of the original values, assignment $\hat{S}$ has a value of at least

\sum_{i \in \hat{S}} v_i \ge \sum_{i \in \hat{S}} \hat{v}_i \cdot \frac{\epsilon v_{\max}}{n} \ge (K^* \cdot \frac{n}{\epsilon v_{\max}} - n) \cdot \frac{\epsilon v_{\max}}{n} = K^* - \epsilon v_{\max} \ge K^*(1-\epsilon)

The approximability hierarchy

NP-Complete optimization problems are classified as follows (assuming $P \ne NP$ , if $P = NP$ then all classes collapse into one):

Those of which like TSP, no finite approximation ratio is possible

Those for which an approximation ratio is possible, but there are limits to how small this can be
- Vertex Cover, k-Cluster, TSP with triangle inequality

Those for which approximability has no limits, and polynomial approximation algorithms with error ratios arbitrarily close to zero exist.
- Knapsack

Those for which the approximation ratio is about $\log n$
- Set Cover

Local Search Heuristics (DPV 9.3)

For minimization problem, “try out” local solutions, keep them if they work well

On each iteration, the current solution is replaced by a better one close to it, in its neighborhood. This neighborhood structure is something we impose upon the problem and is the central design decision in local search.

TSP with local search

Assume we have all interpoint distances between $n$ cities, giving a search space of $(n − 1)!$ different tours. What is a good notion of neighborhood?

Define two tours as being close if they only differ in a few edges
- They cannot differ in just one edge (cuz we have to include every vertices)
- So we will consider differences of two edges
- We define the 2-change neighborhood of tour s as being the set of tours that can be obtained by removing two edges of s and then putting in two other edges.

It is quick because a tour only has $O(n^2)$ neighbours
- But unclear how many iterations will be needed

But the final answer is only lcocally optimal
- We can fix by making it a 3-change, or 4-change, etc. But each iteration is more expensive now

Graph Partitioning

Removing the restriction on the sizes of $A$ and $B$ would give the MINIMUM CUT problem, which can be efficiently solved using flow techniques.

However, this present variant is NP-hard.

Let $(A,B)$ with $|A| = |B|$ be a candidate solution; We will define its neighbors to be all solutions obtainable by swapping one pair of vertices across the cut: all solutions of the form $(A - \{a\} + \{b\}, B - \{b\} + \{a\})$ where $a \in A, b \in B$ .

How do we improve local search procedures?

Dealing with Local Optima

Randomization and restarts

Randomization

Invaluable ally in local search

Used to
- pick a random initial solution
- choose a local move when several are available

When there are many local optima, randomization is a way of making sure that there is at least some probability of getting to the right one.

Simulated Annealing

As the problem size grows, the ratio of bad to good local optima often increases, sometimes to the point of being exponentially large. In such cases, simply repeating the local search a few times is ineffective.

What about we occasionally allow moves that actually increase the cost, in the hope that they will pull the search out of dead ends?

The method of simulated annealing redefines the local search by introducing the notion of a temperature $T$ .

Trick is to start with $T$ large and gradually reduce it to zero

it is quite an art to choose a good timetable by which to decrease the temperature, called an annealing schedule.

Randomized Algorithms

We now study randomized algorithms instead of a deterministic algorithm. A randomized algorithm $A$ can be seen as a regular algorithm but takes two inputs: usual input $x$ and uniformly random string $r \in \{0,1\}^R$ for some $R$ . We generally categorize randomized algorithms into two classes

Las Vegas
- Algorithms for which correctness guarentees
- Runtime $t(x,r)$ with input $x$ with randomness $r$ is a random variable
- Then we talk about $T(n)$ being the expected runtime of $A$ on inputs $x$ of size $n$
  - $T(n) = \max_{|x| = n} \mathbb{E}_{r} t(x,r)$
- e.g. Quicksort worst case is $O(n^2)$ but expectation is $O(n \log n)$

Monte Carlo
- Runtime is always bounded by some desired bound
- but correctness is a random variable
- Let $\Delta(x,r)$ be the “indicator random variable” for the event that $A(x,r)$ returns an incorrect value ( $\Delta(x,r) = 1$ when $A(x,r)$ is incorrect and $\Delta (x,r) = 0$ otherwise.
- $\mathbb{E} \Delta(x,r) = \mathbb{P}(\Delta (x,r) = 1) = \delta >0$

Primality Testing (DPV 1.3)

Is there some test that tells us whether a number is prime without factoring out the number?

Fermat’s Little Theorem (assuming $p$ is prime)

a^{p-1} \equiv 1 \mod p

Proof:

Let $S$ be the nonzero integers modulo $p$ ; that is, $S = \{1, 2, \dots , p − 1\}$ . One crucial observation: the effect of multiplying these numbers by a (modulo p) is simply to permute them.

Let’s expand this equivalence to a set

\{1,2,\dots,6\} = \{3\times 1 \mod 7, 3 \times 2 \mod 7, \dots, 3 \times 6 \mod 7\}

If we multiply everything in the set together,

6! \equiv 3^6 \cdot 6! \mod 7

Divide by $6!$ on each side, we get exactly Fermat’s Little Theorem

3^6 \equiv 1 \mod 7

We’ll prove that when the elements of S are multiplied by $a$  modulo $p$ , the resulting numbers are all distinct and nonzero. And since they lie in the range [1, p − 1], they must simply be a permutation of S.

The numbers $a \times i \mod p$ are distinct because if $a \cdot i \equiv a \cdot j \mod p$ , then dividing both sides

by a gives $i \equiv j \mod p$ .

They are nonzero because $a \cdot i \equiv 0$ similarly implies $i \equiv 0$ . (And we can divide by $a$ , because by assumption it is nonzero and therefore relatively prime to $p$ )

Therefore we can write $S$

S = \{1, 2, \dots , p − 1\} = \{a \cdot 1 \mod p, a \cdot 2 \mod p, \dots , a \cdot (p − 1) \mod p\}

Multiply each element in $S$

(p-1)! \equiv a^{p-1}(p-1)! \mod p

Dividing by $(p − 1)!$ (which we can do because it is relatively prime to p, since p is assumed prime) then gives the theorem.

So Fermat’s theorem suggests a “factorless” test

The problem is that Fermat’s theorem is not an if-and-only-if condition; it doesn’t say what happens when N is not prime, so in these cases the preceding diagram is questionable. In fact, it is possible for a composite number N to pass Fermat’s test (that is, $a^{N −1} \equiv 1 \mod N$ ) for certain choices of $a$ .

It turns out that certain extremely rare composite numbers N , called Carmichael numbers, pass Fermat’s test for all $a$  relatively prime to $N$  . On such numbers our algorithm will fail; but they are pathologically rare, and we will later see how to deal with them.

We will ignore those Carmichael numbers for now; We will show that $N$ fails Fermat’s test for at least half of the possible values of $a$

Lemma: If $a^{N −1} \not\equiv 1 \mod N$ for some $a$ relatively prime to $N$ , then it must hold for at least half the choices of $a < N$ .

Proof: Fix some value of $a$ for which $a^{N-1} \not\equiv 1 \mod N$ , the key is to notice that every element $b < N$ that passes Fermat’s test with respect to $N$ has a twin, $a \cdot b$ that fails the test:

(a \cdot b)^{N-1} \equiv a^{N-1} \cdot b^{N-1} \equiv a^{N-1} \not\equiv 1 \mod N

Moreover, all these elements $a \cdot b$ , for fixed $a$ but different choices of $b$ , are distinct, for the same reason $a \cdot i \not\equiv a \cdot j$ in the proof of Fermat’s test: just divide by $a$ .

The one-to-one function $b \rightarrow a \cdot b$ shows that at least as many elements fail the test as pass it

Therefore the algorithm in Figure 1.7 has the following property

Generating Random Primes

Let $\pi(x)$ be the number of primes $\le x$ , then $\pi(x) \approx \frac{x}{\ln x}$ , more precisely

\lim_{x \to \infin} \frac{\pi(x)}{x / \ln(x)} = 1

Such abundance makes it simple to generate a random n-bit prime:

Pick a random n-bit number $N$

Run a primality test on $N$

If it passes the test, output $N$ , otherwise repeat the process

Chamichael numbers

The smallest Carmichael number is $561 = 3 \cdot 11 \cdot 17$ , it is not a prime; Yet it fools the fermat test because $a^{560} \equiv 1 \mod 561$ for all values of $a$ relatively prime to $561$ . For a long time it was thought that there might be only finitely many numbers of this type, but we know that they are infinite, but exceedingly rare.

There is a way around Carmichael numbers, using a slightly more refined primality test due to Rabin and Miller.

Write $N-1$ in the form $2^t u$ , we will choose a random base $a$ and check the value of $a^{N-1} \mod N$ . Perform this computation by first determining $a^u \mod N$ and then repeatedly squaring, to get the sequence:

a^u \mod N, a^{2u} \mod N, \dots, a^{2^tu} = a^{N-1} \mod N

If $a^{N-1} \not\equiv 1 \mod N$ , then $N$ is composite by Fermat’s little theorem, and we’re done. If $a^{N-1} \equiv 1 \mod N$ , we conduct a little follow-up test

Somewhere in the preceeding sequence, we ran into a 1 for the first time.
- If this happened after the first position (if $a^u \mod N \ne 1$ ) and the preceeding value in the value in the list is not $-1 \mod N$ , then we declare $N$ composite
- we have found a nontrivial square root of 1 modulo N : a number that is not ±1 mod N but that when squared is equal to 1 mod N
  - Such a number can only exist if $N$ is composite

Quicksort

The key insight is that the runtime $t(A, r)$ is, up to a constant factor, equal to the number of comparisons performed by QuickSort.

This is because for every recursive call on some array of size, say, $l$ , the total time spent at that level of recursion (steps 1 through 3 above) is $\Theta(l)$ , whereas the number of comparisons is also $l − 1 = \Theta(l)$ .

Furthermore, for any pair of indices $i \ne j \in \{1, \dots , n\}$ , $A[i]$ is compared to $A[j]$ either $0$ or $1$ times throughout the entire run of the algorithm; this is because for two items to be compared, one of them had to have been the pivot, in which case the non-pivot item is then put into either L or R and never has a chance to be compared with the pivot again.

Thus if we define an indicator random variable $X_{i,j}$ which is $1$ if the i-th smallest element is ever compared with the j-th smallest and $0$ otherwise, we have that

t(A,r) \le C \sum_{i<j} X_{i,j}

where $C$ is a constant

Applying linearity of expectation,

\mathbb{E}_r t(A,r) \le \mathbb{E}_r [C \sum_{i < j} X_{i,j}] = C\sum_{i<j} \mathbb{E}_r X_{i,j}

Let’s try to bound $\mathbb{E}_r X_{i,j}$

What is the probability that the ith and jth smallest elements (call them $a_i, a_j$ ) are ever compared? If we trace the QuickSort recursion tree from the root, both elements start off being in the same recursive subproblem, then when the root recursion node picks a random pivot, if that pivot is either $< a_i$  or $> a_j$  then $a_i$  and $a_j$  either both go to $L$  or both go to $R$  (i.e. stay in the same recursive subproblem). Otherwise, if the pivot is in $S =\{a_i, a_{i+1}, \dots , a_j \}$ , then they do not go to the same recursive subproblem.

Thus what decides whether $X_{i,j}$ equals 0 or 1 is the following: it is equal to 1 iff the first time a pivot is picked in the set S, it is picked to be either $a_i$ or $a_j$ ; otherwise $X_{i,j} = 0$ since $a_i, a_j$ will be sent separately to $L$ and $R$ and never have a chance to be compared again. Thus $\mathbb{P}(X_{i,j} = 1) = 2/(j − i + 1)$ .

\mathbb{E}_r t(A,r) \le C \sum_{i<j} \frac{2}{j-i+1}

Doing some algebraic manipulation

\begin{split} \mathbb{E}_r t(A,r) &\le C \sum_{i<j} \frac{2}{j-i+1} \\ &\le \sum_{i=1}^{n-1} \sum_{j=i+1}^n \frac{2}{j-i+1} \\ &\le 2C \sum_{k=1}^{n-1} \frac{k}{n-k+1} \\ &\le 2C n \sum_{k=1}^{n-1} \frac{1}{n-k+1} (\text{since $k\le n$}) \\ &= 2Cn\sum_{t=2}^n \frac{1}{t} \\ &\le 2C n \int_{1}^n \frac{1}{t} dt \\ &=2C n\ln n \end{split}

We proved expected runtime, but can we prove that the runtime is low with high probability?

Recall Markov’s inequality (With $Z$ as a nonnegative random variable)

\forall \lambda > 0, \mathbb{P}(Z > \lambda) < \frac{\mathbb{E} [Z]}{\lambda}

Therefore for runtime of QuickSort, $\mathbb{E} Z < C n \log n$ , since $Z$ is nonnegative (algorithms cannot run in negative time), we can apply Markov’s inequality, $\mathbb{P}(Z > 200 C n \log n) < 1/100$ . Thus the runtime is 99% probability within $O(n \log n)$

Freivalds’ Algorithm

How do we verify computation correctness?

Freivalds’ algorithm specifically focuses on verifying matrix-matrix multiplication. That is, we have two matrices $A, B \in \mathbb{R}^{n \times n}$ and would like to know $C = A B$ . We could of course multiply these matrices ourselves in $\Theta(n^3)$ time (or faster using Strassen or more recent algorithms), but if we upload $A, B$ to the cloud which returns some $C$ to us, is there a way to verify that $C = A \times B$  much faster than doing the multiplication ourselves from scratch?

Answer is yes using randomization: pick a uniformly random $x \in \{0, 1\}^n$  and check whether $ABx = Cx$ .

🧙🏽‍♂️

Note this check can be implemented in

O(n^2)

time instead of

O(n^3)

time via associativity, since

ABx = A(Bx)

Claim: If $C = A \times B$ , then $\mathbb{P}_x (ABx = Cx) = 1$ , otherwise $\mathbb{P}_x (ABx = Cx) \le 1/2$

Proof:

For the second half (assuming $C \ne A \times B$ ), let’s define $D = AB - C$

Then $D \ne 0$ and in particular its ith column $D_i$ is nonzero for some $i$ .

Now suppose $ABz = Cz$ for some particular $z \in \{0,1\}^n$ , $Dz = 0$ , then if $z’ = z$ but the i-th entry flipped, we have $Dz’ \ne 0$ since $Dz’ = D(z \pm e_i) = D_z \pm D e_i = \pm D_i \ne 0$ , where $e_i$ is the i-th standard basis vector.

Now let’s pair all the $2^n$ elements of $\{0,1\}^n$ into $2^{n-1}$ pairs, where each pair differs in only the i-th entry. Then we see that in each pair $(z, z’)$ at most one of $Dz$ or $Dz’$ can equal zero, but not both. Thus $|\{z \in \{0,1\}^n : Dz = 0 \} | \le 2^{n-1}$ , leading to the probability for the second half.

If we want to reduce our probability of returning an incorrect answer to some desired failure probability $p$ , we can for example set $N = \lceil \log_2 1/p \rceil$ and pick $N$ random vectors $x_1, \dots, x_N \in \{0,1\}^n$ independently. We then declare $A \times B = C$ iff $ABx_i = Cx_i, \forall i \in \{1, \dots, N\}$ , then the probability we fail is at most $(1/2)^N \le p$ . Our runtime would be $\Theta(n^2 \log(1/p))$ .

Karger’s Global Minimum Cut Algorithm (Contraction Algorithm)

Given a (possibly weighted) undireced graph $G$ and we would like to find a cut of smallest size. A cut is a partition of the vertices $V$ into two non-empty sets $S, \bar{S}$ .

The size of the cut is defined to be $\sum_{(u,v) \in E \cap (S \times \bar{S})} w(u,v)$ . If unweighted it would be $|E \cap (S \times \bar{S})|$ .

For Contraction algorithm we will focus only on unweighted case (but it is possible to adapt the algorithm to the weighted case as well). Assuming the graph is connected ( $e \ge n-1$ because otherwise we can find a cut of size $0$ using DFS or BFS by having $S$ be one connected component and $\bar{S}$ be everything else)

First note that maximum flow can be used to solve global mincut deterministically. But global minimum cut has to seperate vertex $1$ from some other vertex $t$ , though we do not know a priori which one, thus set $s = 1$ and loop over all possibilities for $t \in \{2, \dots, n\}$ . For each possibility we compute the $s-t$ min-cut as discused in earlier lectures. In an unweighted graph the capacities are all $1$ so the value of max flow / min cut is at most $n-1$ . Thus eaach min-cut computation takes $O(mn)$ using Ford-Fulkerson. Since we do $n$ s-t mincut computations, the total time is $O(mn^2)$ .

We now analyze Karger’s simpler Contraction algorithm,

The contract operation on $e = (u,v)$

Removes $e= u, v$ from the graph and adds a new merged “supervertex” $uv$ into the graph

All edges used to point into either $u$ or $v$ now point into $uv$
- If a vertex had edges to both $u$ and $v$ , then both edges remain and both point to $uv$
- Thus the algorithm can create multi-edges

Lemma: For an graph $G$ , fix a particular global minimum cut $S, \bar{S}$ , then the probability that $Contraction(G)$ return this particular cut is at least $1 / {n \choose 2}$

Proof:

The contraction algorithm algorithm will return $S, \bar{S}$ iff no edge crossing the cut is ever contracted. Let $k$ be the number of edges crossing the cut, and let $e_i$ and $n_i$ denote the number of edges and corresponding vertices in the graph during the ith iteration of the for loop, $i = 1,2, \dots, n-2$ ( $n_i$ is just $n-i+1$ and $e_i$ is a random variable that depends on what has been contracted so far, then the probability we never contract an edge crossing this cut is precisely

\prod_{i=1}^{n-2} (1- \frac{k}{e_i})

Now we get a handle on the $m_i$ , contracting an edge cannot decrease the minimum cut value. Thus for each intermediate graph throughout the run of the algorithm, the minimum cut value is always $\ge k$

In the ith iteration $\sum_{\nu = 1}^{n_i} d_{\nu} = 2e_i$ , but we just argued that $\sum_{\nu = 1}^{n_i} d_{\nu} \ge \sum_{\nu = 1}^{n_i} k = (n-i+1)k$

Thus

e_i \ge (n-i+1)k/2

Therefore we can plug that into our product formula and get

\prod_{i=1}^{n-2}(1- \frac{2}{n-i+1}) = (1-\frac{2}{n})(1-\frac{2}{n-1}) \cdots \frac{2}{4} \frac{1}{3} = \frac{(n-2)!2!}{n!}=\frac{1}{n \choose 2}

This also implies that any graph has at most $n \choose 2$ minimum cuts.

This is in fact tight: the cycle on $n$ vertices has exactly $n \choose 2$ global min-cuts: pick an arbitrary two edges to cut and let $S, \bar{S}$ be the two distinct arcs that are disconnected by removing those two edges.

Furthermore, it is possible to implemenet the Contraction algorithm in $O(m)$ time, but we will not give the details here.

However, the bound given by the Lemma seems a bit problematic: we want a bigger probability larger than $1/{n \choose 2}$ , so can we do better than that?

We run the algorithm $N = \lceil {n \choose 2} \cdot \ln 1/p \rceil$ times independently return the smallest cut ever found. To analyze this, we make use of the follwoing useful fact from calculus, using convexity of the function $e^x$ .

Fact 5: For all $x \in \mathbb{R}, 1+x \le e^x$

Proof: Define $f(x) = e^x$ . Taylor’s theorem gives $e^x = 1+x f’’(y)y^2/2$ for some $y \in [0,x]$ . But note that $f’’(y) = e^y > 0$ for all $y \in \mathbb{R}$ , so therefore $f’’(y) y^2/2 \ge 0$ , thus $e^x = 1+x+f’’(y)y^2/2 \ge 1+x$

Probability that we will never find a global min cut is at most (with repeating the algorithm $N$ times)

\begin{split} (1 - 1/{n \choose 2})^N &\le (e^{-1/{n\choose 2}})^N \\ &=e^{-N/{n \choose 2}} \\ & \le p \end{split}

Overall runtime is $O(mN) = O(mn^2 \log(1/p))$ to succeed with probability $1-p$ . Comparable to max flow based approach using Ford-Fulkerson (though we have a $\log(1-p)$ factor), but is much simpler

Streaming Algorithms

Make one pass over large dataset to support future queries
1. We want a very low-memory data structure $O(n)$

Dynamic data structures

Types of problems

Counting problems

graph problems
1. find a $u \rightarrow v$ path

frequency-based problems

Counter $N$

init():
	N = 0

incr():
	N++

query():
	return N

Takes $O(\log N)$ bits

Can we do better?

Approximate Counting

🔥

Settle for approximate counting

Can we remember the exponent instead of the actual number?

2^{j-1} < N \le 2^j

And now we only need to take $O(\log j)$ bits, which is $O(\log \log N)$

What if instead of using a base of 2, we use a base of $1+\epsilon$ ?

j = \lceil \log_{1+\epsilon N}\rceil, \log j = \log \log_{1+\epsilon} N = \log (\frac{\log_2 N}{\log_{2} 1+\epsilon})

Note: If $\epsilon \ll 1$ , then $\log_2 1+\epsilon \approx \epsilon$

Therefore to remember $j$ , it takes around $O(\log \log N + \log \frac{1}{\epsilon})$

But how would we increment since we don’t know $N$ ?

🔥

With each call to incr(), With chance

1/(1+\epsilon)^j

, we increment

j

LEMMA: Let $X_n$ be the value of $X$ after $n$ calls to incr(), then $\mathbb{E}((1+\epsilon)^{X_n}) = n+1$

Proof (by induction on $n$ ):

base case: $n=0, (1+\epsilon)^0 = 1$

\begin{split} \mathbb{E}[(1+\epsilon)^{X_{n+1}}] &= \sum_{j=0}^\infin \mathbb{P}(X_n=j) \mathbb{E}[(1+\epsilon)^{X_{n+1}} |X_n = j] \\ &=\sum_{j=0}^\infin \mathbb{P}(X_n=j)[\frac{1}{(1+\epsilon)^j}(1+\epsilon)^{j+1}+(1-\frac{1}{(1+\epsilon)^j})(1+\epsilon)^{j}] \\ &=\sum_{j=0}^\infin \mathbb{P}(X_n=j)[1 + (1+\epsilon)^j] \\ &=\underbrace{\sum_{j=0}^\infin \mathbb{P}(X_n=j)}_{1} + \underbrace{\sum_{j=0}^\infin \mathbb{P}(X_n=j) 2^j}_{\mathbb{E}[(1+\epsilon)^{X_n}]} \\ &=n+1 \end{split}

So therefore we return $\hat{N}=(1+\epsilon)^j - 1$ when query() is called.

But what about the error? We can bound

\mathbb{P}(|\hat{N} - \mathbb{E}\hat{N}| \ge \lambda) = \frac{Var(\hat{N})}{\lambda^2} = \frac{\frac{3}{2}N^2 + \frac{3}{2}N + 1 - N^2}{\lambda^2} = \frac{\frac{1}{2}N^2 + \frac{3}{2}N+1}{\lambda^2}

Where $\lambda = \alpha \mathbb{E} \hat{N}$

Optimal choice:

\epsilon \approx \frac{\alpha^2}{\log(\mathbb{P}[|\hat{N} - \mathbb{E} \hat{N}| \ge \lambda]^{-1})}

Randomized Counting

init():
	X = 0

incr():
	if rand() is even:
		X++

query():
	return 2X

Distinct Counting

Return number of distinct elements in stream that consist of ints picked from $\{ 1, \dots, n \}$

Hashed Distinct Counting

Pick random hash function $h: \{1, \dots, n\} \rightarrow [0,1]$
1. Assume $h$ is independent and uniformly random

Initiate $X = 1$
1. For each $i$ in stream,
  1. $X = \min (X, h(i))$
1. $\mathbb{E}_{i \in str}[h(i)] = \frac{1}{t+1}$

when query(): return $\frac{1}{X} - 1$

🔥

But the smallest hash can be very sensitive to small turbulence

Keep $k$ smallest hash values

$X_1, \dots, X_k (X_1 \le X_2 \le \cdots \le X_k)$

$\mathbb{E} X_k = \frac{k}{t+1}$

when query(), return $\frac{k}{X_k} - 1$

Memory grows linearly with $k$

Industry standard: Hyperloglog

Lower bounds

Prove that any algorithm for existing problem has to take $\Omega(\cdot)$ time

Time Hierarchy Theorem (CS172)

For any $k > 0$ , $\exists$ problmes requiring $\Omega(n^k)$ time.

e.g.

g(P,x) = \begin{cases} \text{True} &\text{if $P$ terminates on $x$ in $|x|^k$ time} \\ \text{False} &\text{Otherwise} \end{cases}

🔥

For all the problems in this class, potentially they all have

O(n)

time solutions

Circuit Complexity

Boolean circuit made up of AND, OR, NOT, each gate taking two inputs and having one output, and input bits $x_1, \dots, x_n$

Since standard computation model algorithm ⇒ efficient circuits,

lower bounds on circuits ⇒ LBs on standard algorithms

But what does “efficient circuit” mean? Can it be quantified?

circuit sizes $s$ : # of wires
1. # circuits of size $s$ is $2^{O(s \log s)}$
1. # functions $f: \{0,1\}^n \rightarrow \{0,1\}$ is $2^{2^n}$
  1. Vast majority to problems need large circuits ⇒ large runtime
  1. There exists an explicit problem requiring circuits of size $\ge (3+\frac{1}{86})n$ (2016)

Depth $d$ of the circuit

Uniform Circuits:

$\exists A$ s.t. $A(n)$ spits out $C_n$ efficiently

Multiplication

MULT_n: \{0,1\}^n \times \{0,1\}^n \rightarrow \{0,1\}^{2n}

Spectral Graph Theory

Unserstanding properties of graph by looking at eigenvalues/eigenvectors of associated matrices

example:

(1) adjacency matrix

A_{u,v} = \begin{cases} 1 &\text{if } (u,v) \in E(G) \\ 0 &\text{otherwise} \end{cases}

(2) Graph Laplacian $L$

diag(L) = \begin{bmatrix} d_1 &d_2 &\cdots &d_n \end{bmatrix}^\top \in \mathbb{R}^{n}

$d_i$ ⇒ degree of vertex $i$ (# of edges connected to $i$ )

L_{u,v} = -A_{u,v} \forall u \ne v

“unnomralized laplacian”

L = D-A

We also have “normalized laplacian”

\bar{L} = D^{-1/2} L D^{-1/2} = I - D^{-1/2}AD^{-1/2}

Random Walks

Do $t$ steps of random walk from vertex $i$

Let $\vec{e}_i$ be the standard basis vector for vertex $i$

(\vec{e}_i)_j = \begin{cases} 1 &\text{if } j=i \\ 0 &\text{otherwise} \end{cases}

Basically, having $\vec{x} = \vec{e}_i$ means we’re definitely at vertex $i$

Say $x \rightarrow Wx$ go from some probability distribution $x$ on $V$ to $Wx$ in one random walk step

We see:

W\vec{x} = \sum_i x_i \vec{w}_i

W = AD^{-1}

Note:

\begin{split} \vec{x}^\top L \vec{x} &= \sum_{u,v} x_u x_v L_{u,v} \\ &=\sum_{(u,v) \in E} (x_u - x_v)^2 \\ \end{split}

We can use this to calculate the cost of having a cut

(\vec{x}_s)_u = \begin{cases} 1 &\text{if } u \in S \\ -1 &\text{if } u \notin S \end{cases}

Big-O Notation (DPV 0)

Reduction (DPV 7.1)

Graphs

How is a graph represented?

Adjacency matrix

Adjacency List

Algorithms

Fibonacci(DPV 0)

Basic Arithmetic (DPV 1.1)

Addition

Multiplication

Division

Modular Arithmetic (DPV 1.2, unfinished)

Hashing (DPV 1.5)

Hash Tables

Hash Functions

Universal Hash Functions

Divide and Conquer (DPV 2)

Recurrence Relationships (DPV 2.2)

Multiplication (DPV 2.1)

MergeSort

Median

Matrix Multiplication (DPV 2.5)

Fast Fourier Transform (DPV 2.6)

Graphs (DPV 3 + 4)

DFS (Depth-first Search) in undirected graphs (DPV 3.2)

Explore

Depth-First Search Algorithm (DFS)

Connected Components

Previsit and postvisit ordering

Directed Acyclic graphs (DAGs) (DPV 3.3.2)

Strongly Connected Components

Paths in Graphs: Distances (DPV 4.1)

Breadth-First Search (BFS) (DPV 4.2)

Lengths on Edges (DPV 4.3)

Dijkstra’s Algorithm (DPV 4.4)

Priority Queue Implemention in Dijkstra (DPV 4.5)

Negative-value edges? (DPV 4.6.1)

Detecting Negative Cycles (DPV 4.6.2)

Shortest Paths in DAGs that doesn’t have negative cycles (DPV 4.7)

Greedy Algorithms (DPV 5)

Minimum Spanning Trees

Huffman Encoding @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')O(nlog⁡n)O(n \log n)O(nlogn)﻿

Horn Formula

Set cover

Dynamic Programming (DPV 6)

Shortest Paths in DAGS

Longest Increasing Subsequences

String Edit Distance @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')O(nm)O(nm)O(nm)﻿

Knapsack

(Chain) Matrix Multiplication

Shortest Paths

Traveling Salesman Problem (TSP) @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')∈O(n22n)\in O(n^2 2^n)∈O(n22n)﻿

Independent Sets in Trees @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')∈O(∣V∣+∣E∣)\in O(|V|+|E|)∈O(∣V∣+∣E∣)﻿

Linear Programming and Reductions (DPV 7)

Linear Programming (DPV 7.1 + 7.4)

Standard Form of LP (DPV 7.1)

Duality of LP (DPV 7.4)

Simplex Algorithm

Finding a starting vertex for simplex

Degeneracy in Simplex Algorithm

Unboundedness of LP

Network Flow (Maximum Flow @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')∈O(∣V∣⋅∣E∣2)\in O(|V| \cdot |E|^2)∈O(∣V∣⋅∣E∣2)﻿) (DPV 7.2)

Bipartite Matching

Zero-sum Games (DPV 7.5)

Multiplicative Weights Algorithm / Hedge Algorithm

Multiplicative Weights in Minmax Games

Circuit Evaluation (DPV 7.7)

NP-Complete Problems (DPV 8)

Search Problems (DPV8.1)

NP-Complete Problems (DPV 8.2)

P and NP

Reductions in search problems

The Reductions (DPV 8.3)

Coping with NP-Completeness (DPV 9)

Intelligent Exaustive Search (DPV 9.1)

Backtracking

Branch-and-bound

Approximation Algorithms (DPV 9.2)

Vertex Cover

Huffman Encoding $O(n \log n)$

String Edit Distance $O(nm)$

Traveling Salesman Problem (TSP) $\in O(n^2 2^n)$

Independent Sets in Trees $\in O(|V|+|E|)$

Network Flow (Maximum Flow $\in O(|V| \cdot |E|^2)$ ) (DPV 7.2)

Counter $N$