notes for 计算机与人工智能应用数学

Lecture 1

A probability space $P=(U,p)$ consists of

Universe $U$: finite non-empty set
Probability function $p:U\to[0,1]$ such that $\sum\limits_{u\in U}p(u)=1$.

An event is $T\sube U$, the event happens if and only if the outcome falls inside $T$. The probability of $T$ is defined to be $\text{Pr}(T)=\sum\limits_{u\in T}p(u)$.

Monte Hall problem: Switching gives $\dfrac{2}{3}$ success probability.

Birthday paradox: Let $U=\{(x_1,x_2,\cdots,x_n)|1\le x_k\le 365\},T=\{(x_1,x_2,\cdots,x_n)|x_j=x_k\text{ for some }j,k\}$, and let $q(n)=\text{Pr}(T)$, then $q(n)=1-(1-\dfrac{1}{365})(1-\dfrac{2}{365})\cdots(1-\dfrac{n-1}{365})$. Since $e^{-x}\approx 1-x$ when $x$ is small (also in fact $e^{-x}\ge 1-x$ for $x\ge 0$), thus $q(n)=1-(1-\dfrac{1}{365})(1-\dfrac{2}{365})\cdots(1-\dfrac{n-1}{365})\ge 1-\exp(-\sum\limits_{i=1}^{n-1}\dfrac{i}{365})=1-\exp(-\dfrac{n(n-1)}{730})$. So approximately when $n>\sqrt{730\times \ln(2)}\approx 22$, $q(n)>0.5$.

Online auction problem: Let strategy $k$ be the strategy that skips the first $k$ offers and accepts $x_j$ if $j$ is the first $j$ satisfying $x_j>\max\{x_1,x_2,\cdots,x_k\}$, consider the probability of success for strategy $k$. Let $T_j$ be the set of permutations satisfying $x_j=n$, $\max\{x_1,x_2,\cdots,x_{j-1}\}=\max\{x_1,x_2,\cdots,x_k\}$, then $\text{Pr}(T)=\dfrac{1}{n}·\dfrac{k}{j-1}$, so the probability of success for strategy $k$ is $\dfrac{k}{n}(\dfrac{1}{k}+\dfrac{1}{k+1}+\cdots+\dfrac{1}{n-1})$. Since $H_n=1+\dfrac{1}{2}+\dfrac{1}{3}+\cdots+\dfrac{1}{n}\approx\ln n+C$, so $\dfrac{k}{n}(\dfrac{1}{k}+\dfrac{1}{k+1}+\cdots+\dfrac{1}{n-1})=\dfrac{k}{n}(H_{n-1}-H_{k-1})\approx\dfrac{k}{n}(\ln\dfrac{n-1}{k-1})$, choose $k=\lceil\dfrac{n}{e}\rceil$, then the probability is approximately $\dfrac{1}{e}$, which is the maximum.

Union bound:

Let $T_1,T_2,\cdots,T_m$ be events, $T\sube\cup_{i=1}^mT_i$, then $\text{Pr}(T)\le\sum\limits_{i=1}^m\text{Pr}(T_i)$.
If $T_i$'s are disjoint and $T=\cup_{i=1}^mT_i$, then $\text{Pr}(T)=\sum\limits_{i=1}^m\text{Pr}(T_i)$.

Ramsey numbers: Let $R(k)$ be the smallest $N$ such that among $N$ people, there either exists $k$ mutual friends or $k$ mutual strangers.

Theorem: For all $k\ge 3$, $R(k)\ge\lfloor 2^{k/2}\rfloor$.

Proof: Let $n=\lfloor 2^{k/2}\rfloor $, let $G$ be a random graph with $n$ vertices, we prove that the probability that there does not exist $k$ mutual friends or $k$ mutual stranger is larger than $0$. Let $T$ be such event, equivalently we show that $\text{Pr}(\overline{T})<1$. By union bound, $\text{Pr}(\overline{T})\le 2\dbinom{n}{k}·\dfrac{1}{2^{\binom{k}{2}}}\le 2\dfrac{n^k}{k!}·\dfrac{1}{2^{\binom{n}{2}}}\le 2\dfrac{2^{k(k/2)}}{k!}·\dfrac{1}{2^{(k^2-k)/2}}=2\dfrac{2}{2^{k/2}}{k!}<1$. So $\text{Pr}(T)>0$.

Conditional probability $\text{Pr}(S|T)=\begin{cases}\dfrac{\text{Pr}(S\cap T)}{\text{Pr}(T)}&(\text{Pr}(T)>0)\\0&(\text{Pr}(T)=0)\end{cases}$.

The chain rule: $\text{Pr}(S_1\cap S_2\cap\cdots\cap S_m)=\prod\limits_{i=1}^m\text{Pr}(S_i|S_1\cap S_2\cap\cdots\cap S_{m-1})$.
Distributive law: let $T\sube W_1\cup W_2\cup\cdots\cup W_m$, then $\text{Pr}(T)\le\sum\limits_{1\le j\le m}\text{Pr}(W_j)\text{Pr}(T|W_j)$.

Lecture 2

The probability of $1$ being in a cycle of length $s$ is $\dfrac{1}{n}$.

Proof: Let $E_s$ be the event that $L_1>s$, consider $\text{Pr}(E_s|E_{s-1})$, let the first $s-1$ elements in the cycle of $1$ be $i_1=1,i_2,i_3,\cdots,i_{s-1}$, then $E_s$ happens if and only if the next element of $i_{s-1}$ is not $1$, so $\text{Pr}(E_s|E_{s-1})=\dfrac{n-s}{n-s+1}$. By chain rule, $\text{Pr}(L_1=s)=\dfrac{n-1}{n}·\dfrac{n-2}{n-1}·\cdots·\dfrac{n-(s-1)}{n-(s-2)}·\dfrac{1}{n-(s-1)}=\dfrac{1}{n}$.

Greedy clique problem: the algorithm returns a clique with size $\log_2(n)-\log_2(\log_2(n))\le |A(G)|\le\log_2(n)+\log_2(\log_2(n))$ with probability $1-o(1)$.

Upper bound: Let $K=\log_2(n)+\log_2(\log_2(n))$, for $2\le i\le n$, let $T_i$ be the event such that the greedy algorithm selects $i$ as the $K$-th vertex to join $S$, then by distributive law, $\text{Pr}(|S|>K)=\sum\limits_{2\le i\le n}\text{Pr}(T_i)·\text{Pr}(|S|>K|T_i)$, since $\text{Pr}(|S|>K|T_i)\le\dfrac{n}{2^K}$, $\text{Pr}(|S|>K)\le\dfrac{n}{2^K}=\dfrac{1}{\log n}=o(1)$.

Lower bound (Chebyshev inequality): Let $K^-=\log_2(n)-\log_2(\log_2(n))$, let $X_m(G)$ be the $m$-th vertex to join the clique, and let $Y_m=X_{m+1}-X_m$, then $Y_j$ is an independent geometric random variable with $b_j=\dfrac{1}{2^j}$, so $\text{Pr}(Y_j=t)=(1-b_j)^{t-1}b_j$ for all $t\ge 1$. Let $X'=\sum\limits_{j=1}^{K^-}Y_j$, then we need to estimate $\text{E}(X')$ and $\text{Var}(X')$. Since $\text{E}(Y_j)=2^j$, by linear expectation, $E(X')=\sum\limits_{j=1}^{K^-}2^j=2^{1+K^-}-2\le\dfrac{2n}{\log_2(n)}$, $\text{Var}(X')=\sum\limits_{j=1}^{K^-}2^{2j}(1-2^{-j})\le 2(\dfrac{n}{\log_2(n)})^2$. So $\text{Pr}(X'>n-1)\le\text{Pr}(X'-\text{E}(X')>\dfrac{n}{2})$, by Chebyshev inequality, $\text{Pr}(X'-\text{E}(X')>\dfrac{n}{2})\le\dfrac{\text{Var}(X')}{(\frac{n}{2})^2}\le\dfrac{8}{(\log_2(n))^2}=o(1)$.

The expectation of a random variable is defined as $\text{E}(X)=\sum\limits_{u\in U}p(u)X(u)$.

The expected number of cycles in a permutation is $H_n$.

Proof: For a permutation $p$, let $L_i$ be the length of the cycle containing $i$, then the number of cycles in $p$ equals to $\sum\limits_{i=1}^n\dfrac{1}{L_i}$, so the expected number of cycles in a permutation is $\text{E}(\sum\limits_{i=1}^n\dfrac{1}{L_i})=n\text{E}(L_1)=\sum\limits_{i=1}^n\dfrac{1}{i}=H_n$.

Conditional expectation: $\text{E}(X|T)=\sum\limits_{u\in T}\dfrac{p(u)X(u)}{\text{Pr}(T)}$.

Distributive law for expectation: Let $W_1,W_2,\cdots,W_m$ be a partition of $U$, then $\text{E}(X)=\sum\limits_{i=1}^m\text{Pr}(W_i)E(X|W_i)$.

Mean of the geometric distribution: the first time to see a head when throwing a $p$-biased coin is $\dfrac{1}{p}$.

Proof: $\text{E}(X)=p+(1-p)(1+\text{E}(X))$, so $\text{E(X)}=\dfrac{1}{p}$.

Independent variables: $X,Y$ are independent if $\forall x,y$, $\text{Pr}(X=x,Y=y)=\text{Pr}(X=x)\text{Pr}(Y=y)$.

Variance of $X$: $\text{Var}(X)=\text{E}((X-\text{E}(X))^2)=\mathbb{E}(X^2)-\mathbb{E}(X)^2$.

Standard derivation of $X$: $\sigma(X)=\sqrt{\text{Var}(X)}$.

If $X$ and $Y$ are independent, then $\text{E}(XY)=\text{E}(X)+\text{E}(Y),\text{Var}(X+Y)=\text{Var}(X)+\text{Var}(Y)$.

Tail estimates:

Markov inequality: Let $X$ be non-negative random variable, then $\text{Pr}(X>c\text{E}(X))<\dfrac{1}{c}$.
Chebyshev inequality: $\text{Pr}(|X-E(X)|>c\sigma(X))<\dfrac{1}{c^2}$. (not tight, but very general)

Lecture 3

i.i.d coin flips: $n$ independent tosses of a fair coin, $\mathbb{E}(X)=\dfrac{n}{2}$, $\text{Var}(X)=\sum\limits_{i}\text{Var}(X_i)=\dfrac{n}{4}$, so according to Chebyshev inequality, $\text{Pr}(|X-\mu|\ge c\cdot\sigma)\le\dfrac{1}{c^2}$, where $\mu=\dfrac{n}{2},\sigma=\dfrac{\sqrt{n}}{2}$. However, this upper bound is not tight (When $c=10$, $\text{RHS}=\dfrac{1}{100}$, however the probability is actually $\le e^{-20}$).

A more powerful tool: Chernoff bound: Let $X_1,X_2,\cdots,X_n$ be independent coin tosses where $\text{Pr}(X_i=1)=b_i,\text{Pr}(X_i=0)=1-b_i$, $X=\sum\limits_{i=1}^nX_i$, then $\text{Pr}(X\ge(1+\delta)\mu)\le\exp(-\dfrac{\delta^2}{2+\delta}\mu),\text{Pr}(X\le(1-\delta)\mu)\le\exp(-\dfrac{\delta^2}{2}\mu)$.

Proof: First we prove that $\exp(-\dfrac{\delta^2}{2+\delta}\mu)\ge(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}$:

\[\begin{aligned} &\exp(-\dfrac{\delta^2}{2+\delta}\mu)\ge(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}\\ \Leftrightarrow&-\dfrac{\delta^2}{2+\delta}\mu\ge\mu(\delta-(1+\delta)\ln(1+\delta))\\ \Leftrightarrow&(1+\delta)\ln(1+\delta)\ge\delta+\dfrac{\delta^2}{2+\delta}\\ \Leftrightarrow&\ln(1+\delta)\ge\dfrac{2\delta}{2+\delta} \end{aligned} \]
Since when $\delta=0$, $\ln(1+\delta)=\dfrac{2\delta}{2+\delta}$, so we only need to prove $\dfrac{\mathrm d}{\mathrm d\delta}(\ln(1+\delta))\ge\dfrac{\mathrm d}{\mathrm d\delta}(\dfrac{2\delta}{2+\delta})$. Since $\dfrac{\mathrm d}{\mathrm d\delta}(\ln(1+\delta))=\dfrac{1}{1+\delta}$, $\dfrac{\mathrm d}{\mathrm d\delta}(\dfrac{2\delta}{2+\delta})=\dfrac{4}{(2+\delta)^2}$, $\dfrac{1}{1+\delta}-\dfrac{4}{(2+\delta)^2}=\dfrac{\delta^2}{(1+\delta)(2+\delta)^2}\ge 0$. So the inequality holds.

Next let's prove $\text{Pr}(X\ge(1+\delta)\mu)\le(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}$. Let $t$ be a parameter, then $\text{Pr}(X>(1+\delta)\mu)=\text{Pr}(e^{tX}>e^{t(1+\delta)\mu})\le\dfrac{\mathbb{E}(e^{tX})}{e^{t(1+\delta)\mu}}$. Since $e^{tX}=\prod\limits_{i=1}^ne^{tX_i}=\prod\limits_{i=1}^n(1+b_i(e^t-1))\le\prod\limits_{i=1}^n(e^{b_i(e^t-1)})\le e^{\mu(e^t-1)}$, let $f(t)=\mu(e^t-1)-t(1+\delta)\mu$. The minimum of $f(t)$ is $f(\ln(1+\delta))=\mu\delta-\mu(1+\delta)\ln(1+\delta)$. So $\text{Pr}(X\ge(1+\delta)\mu)\le(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}$.

Chernoff bound for mean: $\text{Pr}(|\bar{X}-\mu|\ge\epsilon)\le 2\exp(-\dfrac{\epsilon^2}{2+\epsilon}·n)$.

Hoeffding inequality: let $Z_1,Z_2,\cdots,Z_n$ be bounded random variables with $Z_i\in[a,b]$, then for all $t\ge 0$ $\text{Pr}(|\dfrac{1}{n}\sum\limits_{i=1}^n(Z_i-\mathbb{E}(Z_i))|\ge t)\le\exp(-\dfrac{2nt^2}{(b-a)^2})$.

Hoeffding lemma: Let $Z$ be a bounded variable with $Z\in[a,b]$, then $\mathbb{E}(\exp(t(Z-\mathbb{Z})))\le\exp(\dfrac{t^2(b-a)^2}{8})$.

Negatively associated random variables: $X_1,X_2,\cdots,X_n$ are negative associated, if and only if $\mathbb{E}(\exp(\sum X_i))\le\prod\mathbb{E}(\exp(X_i))$.

Martingale: $Y_i=\mathbb{E}(f(X_1,X_2,\cdots,X_n)|X_1,X_2,\cdots,X_i)$. Another equivalent definition: $Y_i$ is a sequence such that $Y_i$ generated by $X_1,X_2,\cdots,X_{n-1}$ with mean $Y_{i-1}$.

Azuma's inequality: If $Y_i=\mathbb{E}(f(X_1,X_2,\cdots,X_n)|X_1,X_2,\cdots,X_i)$ is a martingale, $|Y_i-Y_{i-1}|\le c_i$, then for any $t\ge 0$, $\text{Pr}(|f(X_1,X_2,\cdots,X_n)-Y_0|\ge t)\le\exp(-\dfrac{2t^2}{\sum c_i^2})$.

Picking the best robot:

Analysis: If $\forall i,|\hat{Acc_i}-Acc_i|\le\dfrac{\epsilon}{2}$, then we can always find the best robot with error $\epsilon$. So let $E=\{\forall i,\forall i,|\hat{Acc_i}-Acc_i|\le\dfrac{\epsilon}{2}\}$, $D_i=|\hat{Acc_i}-Acc_i|>\dfrac{\epsilon}{2}$, then $\bar{E}\sube D_1\cup D_2\cup\cdots\cup D_k$, so by union bound, $\text{Pr}(\bar{E})\le\sum\limits_{i=1}^k\text{Pr}(D_i)=k\text{Pr}(D_1)$. By Chernoff bound, $\text{Pr}(D_1)\le\exp(-\dfrac{(\epsilon/2)^2}{\epsilon/2+2}·n)\le\exp(-\dfrac{\epsilon^2}{100}n)$ (assuming that $2\epsilon+8\le 100$). To let $\text{Pr}(\bar{E})\le\epsilon$, we should let $\exp(-\dfrac{\epsilon^2}{100}n)\le\dfrac{\epsilon}{k}$, $n=O(\dfrac{\ln(\frac{k}{\epsilon})}{\epsilon^2})$

Lecture 4

Entropy is a universal measure of randomness.

Definition of entropy: $H(X)=-\sum\limits_{x}\text{Pr}(X=x)\log_2\text{Pr}(X=x)$

For binary random variables, $H(p)=-p\log_2p-(1-p)\log_2(1-p)$

Lemma 1: if $nq$ is an integer in $[0,n]$, then $\dfrac{2^{nH(q)}}{n+1}\le\dbinom{n}{nq}\le 2^{nH(q)}$.

Proof: The statement is trivial if $q=0$ or $1$, so assume that $0<q<1$.

To prove the upper bound, since $\sum\limits_{k=0}^n\dbinom{n}{k}q^k(1-q)^{n-k}=(q+(1-q))^n=1$, so $\dbinom{n}{nq}\le q^{-nq}(1-q)^{-(1-q)n}=2^{nH(q)}$.

To prove the lower bound, we know that $\dbinom{n}{q}q^{qn}(1-q)^{(1-q)n}$ is one term of the expression $\sum\limits_{k=0}^n\dbinom{n}{k}q^k(1-q)^{n-k}$, we show that it is the largest term. Consider the difference between two consecutive terms:

\[\begin{aligned} &\dbinom{n}{k}q^k(1-q)^{n-k}-\dbinom{n}{k+1}q^{k+1}(1-q)^{n-k-1}\\ =&\dbinom{n}{k}q^k(1-q)^{n-k}(1-\dfrac{q}{1-q}\cdot\dfrac{n-k}{k+1}) \end{aligned} \]
The difference is non-negative if and only if $1-\dfrac{q}{1-q}\cdot\dfrac{n-k}{k+1}$, i.e. $k\le qn-1+q$, so $k=qn$ gives the largest term in the summation.

So $\dbinom{n}{nq}\ge\dfrac{q^{-qn}(1-q)^{-(1-q)n}}{n+1}=\dfrac{2^{nH(q)}}{n+1}$.

Actually the number of sequences with $nq$ heads is very close to its upper bound, i.e., $2^{nH(q)-\epsilon}$.

Entropy measures how many unbiased, independent bits can be extracted from a random variable.

Let $X$ be a random variable supported on $\mathcal X$. An extraction function $\text{Ext}(\mathcal X\to\{0,1\}^*)$ outputs a random sequence of bits such that $\forall\text{Pr}(|\text{Ext}(X)|=k)>0,\text{Pr}(\text{Ext}(X)=y||y|=k)=\dfrac{1}{2^k}$.

Theorem: Consider a coin that comes up heads with probability $p>\dfrac{1}{2}$, then for any $\delta>0$ and sufficiently large $n$ we have:

The average number of bits output by any deterministic extraction function on an input sequence of $n$ independent flips is at most $nH(p)$.
There exists an extraction function $\text{Ext}(\cdot)$ that outputs, on an sequence of $n$ independent flips, an average of at least $(1-\delta)nH(p)$ independent random bits.

Proof:

Upper bound: First, if an input sequence $x$ occurs with probability $q$, then $|\text{Ext}(x)|\le\log_2(\dfrac{1}{q})$. So if $B$ is a random variable representing the number of bits our extraction function produces on input $X$, then

\[\begin{aligned} \mathbb{E}[B]&=\sum\limits_{x}\text{Pr}(X=x)|\text{Ext}(x)|\\ &\le\sum\limits_{x}\text{Pr}(X=x)\log_2(\dfrac{1}{\text{Pr}(X=x)})\\ &=H(X)\\ &=nH(p) \end{aligned} \]
Lower bound:

Lemma 2: Suppose that the value of $X$ is chosen uniformly at random from the integers $\{0,1,2,\cdots,m-1\}$, so that $H(X)=\log_2m$. Then there is an extraction function for $X$ that outputs on average at least $\lfloor\log_2m\rfloor-1$ independent and unbiased bits.

So let $Z$ be the number of heads and $B$ be the number of bits our extraction function produces, then $\mathbb{E}(B)=\sum\limits_{k=0}^n\text{Pr}(Z=k)\mathbb{E}(B|Z=k)$. If given $Z=k$, then the sequence can be seen as uniformly random chosen from $\dbinom{n}{k}$ possible outcomes, so by the lemma 2, $\mathbb{E}(B|Z=k)\ge\lfloor\log_2\dbinom{n}{k}\rfloor-1$. By lemma $1$, for some small $\epsilon$ to be determined and $n(p-\epsilon)\le k\le n(p+\epsilon)$, we have $\dbinom{n}{k}\ge\dbinom{n}{\lfloor n(p+\epsilon)\rfloor}\ge\dfrac{2^{nH(p+\epsilon)}}{n+1}$, so

\[\begin{aligned} \mathbb{E}(B)\ge&\sum\limits_{k=\lfloor n(p-\epsilon)\rfloor}^{\lfloor n(p+\epsilon)\rfloor}\text{Pr}(Z=k)\mathbb{E}(B|Z=k)\\ \ge&\sum\limits_{k=\lfloor n(p-\epsilon)\rfloor}^{\lfloor n(p+\epsilon)\rfloor}\text{Pr}(Z=k)(nH(p+\epsilon)-\log_2(n+1)-1)\\ =&(nH(p+\epsilon)-\log_2(n+1)-1)\text{Pr}(|Z-np|\le\epsilon n) \end{aligned} \]
Since $\mathbb{E}(Z)=np$, thus by Chernoff bound, $\text{Pr}(|Z-np|\le\epsilon n)\ge 1-2e^{-n\epsilon^2/3p}$. So for any $\delta>0$, we can have $\mathbb{E}(B)\ge(1-\delta)nH(p)$ by choosing $\epsilon$ sufficiently small and $n$ sufficiently large.

Compression: reduces the number of bits needed to represent data by exploiting its likelihood structure.

An compression function $\text{Com}:\{0,1\}^*\to\{0,1\}^*$ takes a sequence of binary bits as input and outputs a sequence of binary bits such that $\forall x\ne x'$, $\text{Com}(x)\ne\text{Com}(x')$.

Theorem: Consider a coin that comes up heads with probability $p>\dfrac{1}{2}$. For any constant $\delta>0$, when $n$ is sufficiently large:

There exists a compression function $\text{Com}(\cdot)$ such that $\mathbb{E}(|\text{Com}(x)|)\le(1+\delta)nH(p)$.
For any compression function, $\mathbb{E}(|\text{Com}(x)|)\ge(1-\delta)nH(p)$.

Shannon Theorem: Through a noisy channel (each bit will be flipped with probability $p$), the sender can reliably send messages about $k=n(1-H(p))$ bits within each block of $n$ bits.

Background:

The sender takes a $k$-bit message and encodes it into a block of $n\ge k$ bits via the encoding function.
These bits are sent over the noisy channel.
The receiver attempts to determine the original $k$-bit message using the decoding function.

Formal description of Shannon Theorem: For a binary symmetric channel with parameter $p<\dfrac{1}{2}$ and any $\delta,\epsilon>0$, when $n$ is sufficiently large:

For any $k\le n(1-H(p)-\delta)$, there exists a $(k,n)$ encoding and decoding functions such that the success probability $\ge 1-\epsilon$.
No $(k,n)$ encoding and decoding functions exist for $k\ge n(1-H(p)+\delta)$ such that the success probability $\ge 1-\epsilon$.

Proof: Let $C=\{c_1,c_2,\cdots,c_M\}$ be some arbitrary collection of distinct codewords with length $n$. If we send $c_i$ through the channel, we will receive some $\tilde{c_i}$. Since $\mathbb{E}(d_H(c_i,\tilde{c_i}))=np$, by Chernoff bound, there is some $\gamma$ such that with probability $1-\dfrac{\epsilon}{2}$, $(p-\gamma)n\le d_H(c_i,\tilde{c_i})\le (p+\gamma)n$. Choose $\delta$ as small as possible and define $\text{Ring}(c_i)=\{c:|d_H(c_i,c)-np|<\delta n\}$.

If $\tilde{c_i}\in\text{Ring}(c_i)$, but $\forall j\ne i$, $\tilde{c_j}\notin\text{Ring}(c_i)$, then $c_i$ can be successfully decoded. We denote this as $\text{Success}(c_i)$. If $\forall i$, $\text{Success}(c_i)$ holds, then the decoding algorithm is successful.

How to choose a codebook such that the success probability is high?

First we should know how to calculate the volume of $\text{Ring}(c_i)$. In fact, as $n\to\infty$, the volume of $\text{Ring}(c_i)$ is at most $2^{(H(p)+\delta')n}$, with $\delta'\to 0$.

Proof for this: By Chernoff bound, $\text{Pr}(|d_H(c_i,\tilde{c_i})-\mu|>\delta\mu)<2\exp(-\dfrac{1}{3}\delta^2\mu)$, where $\mu=np$. In order to let this probability less than $\dfrac{\epsilon}{2}$, $\dfrac{\epsilon}{2}\le 2\exp(-\dfrac{1}{3}\delta^2\mu)$ is a sufficient condition, so $\delta\ge\sqrt{\dfrac{-3\ln(\frac{\epsilon}{4})}{\mu}}=O(\dfrac{1}{\sqrt{n}})$, so $\gamma=p\delta=O(\dfrac{1}{\sqrt{n}})$ when $n$ gets large.

Let $L=\lceil np-n\gamma\rceil,R=\lfloor np+n\gamma\rfloor$. Then $\text{Vol}(\text{Ring}(c_i))=\sum\limits_{i=L}^R\dbinom{n}{i}$. Since $p<\dfrac{1}{2}$, $\gamma=O(\dfrac{1}{\sqrt{n}})$, so when $n$ gets large enough, $R<\dfrac{n}{2}$, $\text{Vol}(\text{Ring}(c_i))\le (R-L+1)\dbinom{n}{R}\le n\dbinom{n}{R}$. On the other hand, $\dbinom{n}{R}=\dbinom{n}{\lfloor n(p+\gamma)\rfloor}\le 2^{nH(p+\gamma)}$. And $H(p+\gamma)=-(p+\gamma)\ln(p+\gamma)-(1-p-\gamma)\ln(1-p-\gamma)$, when $n\to\infty$, $\gamma\to 0$, so as $n$ grows large, $H(p+\gamma)=-(p+\gamma)(\ln(p)+o(1))-(1-p-\gamma)(\ln(1-p)-o(1))=-p\ln(p)-(1-p)\ln(1-p)+O(\dfrac{1}{\sqrt{n}})=H(p)+O(\dfrac{1}{\sqrt{n}})$. $\text{V}(\text{Ring}(c_i))\le n2^{n(H(p)+O(\frac{1}{\sqrt{n}}))}=2^{n(H(p)+O(\frac{1}{\sqrt{n}})+\frac{\ln n}{n})}=2^{n(H(p)+\delta')}$, where $\delta'\to 0$ as $n\to\infty$.

According to this, let's now choose the codebook $C$ we desired. Let $M=2\cdot 2^k$, let $C$ be a codebook with $M$ codewords uniformly chosen from $\{0,1\}^n$. Fix some $i$, consider the probability such that $\text{Success}(c_i)$ does not happen. With probability $1-\dfrac{\epsilon}{2}$, $\tilde{c_i}\in\text{Ring}(c_i)$. And for some $j\ne i$, $\text{P}(\tilde{c_j}\in\text{Ring}(c_i))=\dfrac{\text{Vol}(\text{Ring}(c_i))}{2^n}=2^{(H(p)+\delta'-1)n}$. By union bound, $\text{Pr}(\text{Success}(c_i)\text{ does not happen})\le\dfrac{\epsilon}{2}+M2^{(H(p)+\delta'-1)n}=\dfrac{\epsilon}{2}+2^{1+(k/n+H(p)-1+\delta')n}$. Since $\dfrac{k}{n}<1-H(p)$, so the second term goes to $0$ as $n\to\infty$, so $\text{Pr}(\text{Success}(c_i)\text{ does not happen})<\epsilon$ if $C$ and $i$ are both chosen uniformly random from all possible choices.

So $\dfrac{1}{\text{#possible codebooks}}\sum\limits_{C}\dfrac{1}{M}\sum\limits_{i=1}^M\text{Pr}(\text{Success}(c_i)\text{ does not happen})\le\epsilon$. So there must be a choice of codebook $C^*$ which does better than average. Let $C'$ be the codebook containing the best half of $C^*$, then the failure probability for all codewords in $C'$ is $\le 2\epsilon$. So $C'$ provides a good choice with $2^k$ codewords.

Lecture 5

Maximum clique problem: Let $G$ be a random graph, let $w(G)$ be the size of the largest clique in $G$, we prove that when $n$ is large enough, $w(G)$ is very close to $2\log_2(n)$.

Proof:

Lower bound: Let $m=(2-\epsilon)\log_2(n)$, and $T$ be the event that $w(G)\ge m$, $M$ be the family of vertex subsets of size $m$, $A_V=[V\text{ is a clique}]$. So we only need to prove that $\text{Pr}(T)=1-o(1)$. Now consider $X=\sum\limits_{V\in M}A_V$, thus $\text{Pr}(T)=\text{Pr}(X>0)$. To prove this, we prove the following two things:

$\mathbb{E}(X)\to\infty$ as $n\to\infty$

$\text{Var}(X)=(\mathbb{E}(X)^2)\cdot o(1)$ as $n\to\infty$.

If these two statements are proved, then by Chebyshev inequality, $\text{Pr}(X\le 0)\le\{|X-\mathbb{E}(X)|>\dfrac{1}{2}\mathbb{E}(X)\}\le\dfrac{\text{Var}(X)}{(\frac{1}{2}\mathbb{E}(X)^2)}=o(1)$.

For (1). Apply Stirling's approximation: $n!\sim\sqrt{2\pi n}(\dfrac{n}{e})^n$. So

\[\begin{aligned} &\mathbb{E}(X)\\ =&\dbinom{n}{m}\cdot\dfrac{1}{2^{\binom{m}{2}}}\\ =&\dfrac{n!}{m!(n-m)!}\cdot\dfrac{1}{2^{\binom{m}{2}}}\\ \approx&\dfrac{\sqrt{2\pi n}(\frac{n}{e})^n}{\sqrt{2\pi m}(\frac{m}{e})^m\cdot\sqrt{2\pi(n-m)}(\frac{n-m}{e})^{n-m}}\cdot\dfrac{1}{2^{\binom{m}{2}}}\\ \ge&\Omega(\dfrac{n^m}{\sqrt{2\pi m}(\frac{m}{e})^m}\cdot\dfrac{1}{2^{\binom{m}{2}}})\\ =&\Omega(\dfrac{en}{(2\pi m)^{\frac{1}{2m}}\cdot m}\cdot \dfrac{1}{m^{\frac{m}{2}}})^m\\ \end{aligned} \]
Since $m=(2-\epsilon)\log_2(n)$, and $\lim\limits_{m\to\infty}(2\pi m)^{\frac{1}{2m}}=1$, so

\[\begin{aligned} &\Omega((\dfrac{en}{(2\pi m)^{\frac{1}{2m}}\cdot m}\cdot \dfrac{1}{m^{\frac{m}{2}}})^m)\\ =&\Omega((\dfrac{0.01n}{\log_2n}\cdot \dfrac{1}{n^{1-\frac{1}{2}\epsilon}})^m)\\ =&\Omega((\dfrac{0.01n^{\frac{1}{2}\epsilon}}{\log_2n})^{\log_2n})\\ =&n^{\Omega(\log_2n)} \end{aligned} \]
Since $n^{\Omega(\log_2n)}\to\infty$ as $n\to\infty$, $\mathbb{E}(X)\to\infty$ as $n\to\infty$.

For (2).

\[\begin{aligned} \text{Var}(X)&=(\mathbb{E}(\sum\limits_{V\in M}A_V)^2)-(\mathbb{E}(X))^2\\ &\le(\mathbb{E}(\sum\limits_{V}\sum\limits_{V'}A_VA_{V'}))-\sum\limits_{V}\sum\limits_{|V\cap V'|\le 1}\mathbb{E}(A_V)\mathbb{E}(A_V)\\ &=\sum\limits_{V}\mathbb{E}(A_V)+\sum\limits_{V}\sum\limits_{|V\cap V'|\le 1}\mathbb{E}(A_VA_{V'})+\sum\limits_{V}\sum\limits_{|V\cap V'|>1}\mathbb{E}(A_VA_{V'})-\sum\limits_{V}\sum\limits_{|V\cap V'|\le 1}\mathbb{E}(A_V)\mathbb{E}(A_V)\\ \end{aligned} \]
Since when $|V\cap V'|\le 1$, $\mathbb{E}(A_VA_{V'})=\mathbb{E}(A_V)\mathbb{E}(A_{V'})$, thus

\[\begin{aligned} \text{Var}(X)&\le\mathbb{E}(X)+\sum\limits_{2\le k\le m}\sum\limits_{V}\sum\limits_{|V\cap V'|=k}\mathbb{E}(A_VA_{V'})\\ &=\mathbb{E}(X)+\sum\limits_{2\le k\le m}\sum\limits_{V}\sum\limits_{|V\cap V'|=k}\text{Pr}(A_V=1)\text{Pr}(A_{V'}=1|A_V=1)\\ &=\mathbb{E}(X)+\sum\limits_{2\le k\le m}\sum\limits_{V}\text{Pr}(A_V=1)\dbinom{m}{k}\dbinom{n-k}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}\\ &=\mathbb{E}(X)+\mathbb{E}(X)\cdot\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}} \end{aligned} \]
Now let's focus on $\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}$. We prove that $\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}\le\dfrac{m^5}{n-m+1}\mathbb{E}(X)$ as follows:

Since $\mathbb{E}(X)=\dfrac{\dbinom{n}{m}}{2^{\binom{m}{2}}}$, thus first we have

\[ \begin{aligned} \sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}&\le\dfrac{m^5}{n-m+1}\mathbb{E}(X)\\ \Leftrightarrow\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}&\le\dfrac{m^5}{n-m+1}\dfrac{\dbinom{n}{m}}{2^{\binom{m}{2}}}\\ \Leftrightarrow\dfrac{1}{\dbinom{n}{m}}\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}2^{\binom{k}{2}}&\le\dfrac{m^5}{n-m+1} \end{aligned} \]
Now let's focus on the term on the left side of the $\le$ term. Let $a_k=\dbinom{m}{k}\dbinom{n-m}{m-k}2^{\binom{k}{2}}$, then

\[\begin{aligned} &\dfrac{a_{k+1}}{a_k}\\ =&\dfrac{\binom{m}{k+1}\binom{n-m}{m-k-1}2^{\binom{k+1}{2}}}{\binom{m}{k}\binom{n-m}{m-k}2^{\binom{k}{2}}}\\ =&\dfrac{(m-k)}{(k+1)(n-2m+k+1)} \end{aligned} \]
Roughly speaking, when $n$ gets large enough, then when $k>\log_2n$, this ratio is $>1$, when $k<\log_2(n)$, this ratio is $<1$, so $a$ is generally a convex array, so $\max\limits_{2\le k\le m}a_k$ is reached at either $k=2$ or $k=m$. We only need to prove that $\dfrac{1}{\dbinom{n}{m}}\cdot ma_2\le\dfrac{m^5}{n-m+1}$ and $\dfrac{1}{\dbinom{n}{m}}\cdot ma_m\le\dfrac{m^5}{n-m+1}$.

For $a_2$, we have

\[\begin{aligned}&\dfrac{1}{\dbinom{n}{m}}\cdot ma_2\\=&\dfrac{1}{\dbinom{n}{m}}\cdot m\dbinom{m}{2}\dbinom{n-m}{m-2}\cdot 2\\\le&m^3\cdot\dfrac{\dbinom{n-m}{m-2}}{\dbinom{n}{m}}\\=&m^3\dfrac{\frac{(n-m)(n-m-1)\cdots(n-2m+3)}{(m-2)(m-3)\cdots 1}}{\frac{n(n-1)\cdots(n-m+1)}{m(m-1)\cdots 1}}\\=&m^3\cdot m(m-1)\cdot\dfrac{(n-m)(n-m-1)\cdots(n-2m+3)}{n(n-1)\cdots(n-m+1)}\\\le&\dfrac{m^5}{n-m+1}\end{aligned} \]
For $a_m$, we have $\dfrac{1}{\dbinom{n}{m}}\cdot ma_m=\dfrac{m2^{\binom{m}{2}}}{\dbinom{n}{m}}=\dfrac{m}{\mathbb{E}(X)}$. Since $\mathbb{E}(X)=n^{\Omega(\log_n)}$, apparently when $n$ is large enough, $\dfrac{m}{\mathbb{E}(X)}\le\dfrac{m^5}{n-m+1}$. So $\dfrac{1}{\dbinom{n}{m}}\cdot ma_m\le\dfrac{m^5}{n-m+1}$.

So $\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}\le\dfrac{m^5}{n-m+1}\mathbb{E}(X)$.

So since $\mathbb{E}(X)\to\infty$ as $n\to\infty$, replace $m$ with $(2-\epsilon)\log_2n$ we have $\text{Var}(X)\le\mathbb{E}(X)+\dfrac{64(\log_2n)^5}{n}\mathbb{E}(X)^2\le\dfrac{128(\log_2n)^5}{n}$, which completes the proof.

Network routing problem: Every node $i$ on a hypercube sends a message $M_i$ to some $\sigma(i)$, however, only one message can pass through every directed edge at a time, how to get all messages successfully delivered within a reasonably short time?

Bit fixing algorithm: fix the bits from left to right, i.e., for $i=1,2,\cdots,n$ if the $i$-th bit of $n$ does not equal to the $i$-th bit of $\sigma(n)$, then flip the $i$-th bit and travel along the corresponding edge.

This routing algorithm has exponential delay in the worst case. Let $n$ be odd and define $\sigma(u0v)=v1u$, then the path from $u0^{\frac{n+1}{2}}$ to $0^{\frac{n-1}{2}}1u$ must contain the edge $(0^n,0^{\frac{n-1}{2}}10^{\frac{n-1}{2}})$, hence at least $2^{\frac{n-1}{2}}$ messages need to travel through this edge.

Randomized BFA: generate a random $v_i$, use BFA to route message $M_i$ from $i$ to $v_i$, at time $6n$ use BFA to route message $M_i$ from $v_i$ to $\sigma(i)$.

Analysis: We prove that the success probability is larger than $1-O(2^{-3n})$. It suffices to prove that in phase $1$ and $2$, the probability for any $M_i$ not to reach the destination in time $6n$ is $O(2^{-3n})$. Let $T_i$ be the arrival time for message $M_i$ to reach $v_i$ in phase $1$, then fix some $i$, if we can prove that $\text{Pr}(T_i>6n)$ is $O(2^{-4n})$, then by union bound, $\text{Pr}(\exists i,T_i>6n)\le |V|\cdot O(2^{-4n})=O(2^{-3n})$.

Now our goal is to prove that $\text{Pr}(T_i>6n)=O(2^{-4n})$. Let $S=\{j|j\ne i,\text{Path}(j,v_j)\cap\text{Path}(i,v_i)\ne\varnothing\}$, notice that $T_i\le d_H(i,v_i)+|S|$, then we only need to focus on $|S|$. Let $Y_e$ be the number of paths that passes through $e$, assume that $\text{Path}(i,v_i)=(e_1,e_2,\cdots,e_l)$, then $|S|\le\sum\limits_{j=1}^lY_{e_j}$. Since $\mathbb{E}(Y_e)=1$, $\mathbb{E}(|S|)\le\dfrac{n}{2}$, so by Chernoff bound $\text{Pr}(|S|>5n)<2^{-4n}$, which completes the proof.

Monte Carlo methods: generate random samples $X_1,X_2,\cdots,X_n$ from a distribution, and use the sample mean $\dfrac{1}{n}\sum\limits_{i=1}^nf(X_i)$ to estimate $\mathbb{E}(f(X))$. (main challenge: how to sample)

Importance sampling: estimate $\mathbb{E}[f(x)]$, $x\sim p$ where $p$ is very complicated. Use some proposal distribution $q$, then estimate $\hat{I_n}=\dfrac{1}{n}\sum\limits_{i=1}^nf(y_i)\dfrac{p(y_i)}{q(y_i)}$, where $y_i\sim q$.

Rejection sampling: suppose $f(x)\le Cg(x)$, we can sample $f$ from $g$ as follows: generate $X\sim g$ and accept $X$ with probability $\dfrac{f(X)}{Cg(X)}$, if accepted, output $X$, otherwise repeat this process.

Lecture 6

Generating functions: let $\lang a_k\rang$ be an infinite sequence of complex numbers, define its generating function as: $A(x)=\sum\limits_{k=0}^na_kx^k$.

Let $X$ be a random variable with range $\{0,1,2,\cdots\}$ and let $p_k=\text{Pr}(X=k)$. Let $A(x)=\sum\limits_{k\ge 0}p_kx^k$. Then $\mathbb{E}(X)=A'(1),\text{Var}(X)=A''(1)+A'(1)-(A'(1))^2$.

Use generating function to calculate the closed form of a sequence:

Fibonacci numbers: $a_0=a_1=1,a_n=a_{n-1}+a_{n-2}$, let $A(x)=\sum\limits_{i=0}^{+\infty}a_ix^i$, so $A(x)=a_0+a_1x+\sum\limits_{i=2}^{+\infty}(a_{i-1}+a_{i-2})x^i=1+x+x(A(x)-1)+x^2A(x)$. So $A(x)=\dfrac{1}{1-x-x^2}$. Since $A(x)=\dfrac{1}{(1+\frac{1+\sqrt{5}}{2}x)(1-\frac{1-\sqrt{5}}{2}x)}$, let $\alpha=\dfrac{1+\sqrt{5}}{2},\beta=\dfrac{1-\sqrt{5}}{2}$, then $A(x)=\dfrac{\alpha}{\alpha-\beta}\cdot\dfrac{1}{1-\alpha x}-\dfrac{\beta}{\alpha-\beta}\cdot\dfrac{1}{1-\beta x}$. So $a_n=\dfrac{\alpha}{\alpha-\beta}(\alpha^{n+1}-\beta^{n+1})$.
Number of triangulations for a convex $n$-gon: let $a_n$ be the number of triangulations of a convex $(n+2)$-gon, then $a_n=\sum\limits_{0\le k\le n-1}a_ka_{n-k-1}$. So $A(x)=1+xA(x)^2$, $A(x)=\dfrac{1\pm\sqrt{1-4x}}{2x}$. To ensure $A(0)$ is finite, $A(x)=\dfrac{1-\sqrt{1-4x}}{2x}$. Since $(1+x)^z=\sum\limits_{k\ge 0}\dbinom{z}{k}x^k$, so $\sqrt{1-4x}=\sum\limits_{k\ge 0}\dbinom{\frac{1}{2}}{k}(-4)^k$. So $a_n=\dfrac{1}{2}(-1)^n\dbinom{\frac{1}{2}}{n+1}4^{n+1}=\dfrac{1}{2n+1}\dbinom{2n+1}{n}$.
Up-down permutations:

then $a_1=1$, $a_n=\sum\limits_{k=\text{odd}}\dbinom{n-1}{k}a_ka_{n-1-k}$ for odd $n\ge 3$. $n\cdot\dfrac{a_n}{n!}=\sum\limits_{k=\text{odd}}\dfrac{a_k}{k!}\cdot\dfrac{a_{n-1-k}}{(n-1-k)!}$.

Let $b_n=\dfrac{a_n}{n!}$, then $b_1=1$, $nb_n=\sum\limits_{k=\text{odd}}b_kb_{n-1-k}$. So let $B(x)=\sum\limits_{n=\text{odd}}b_nx^n$, then $B'(x)=1+B(x)^2$, $B(x)=\tan x$.

How to solve the power series of $\tan x$? Consider complex analysis:
Three representations of complex numbers:
- $z=a+ib$
- $z=r(\cos\theta+i\sin\theta)$
- $z=re^{i\theta}$
Let $f:\mathbb{C}\to\mathbb{C}$ be a complex function, let $\Gamma$ be a path from $a$ to $b$, let $\int_{\Gamma}f(z)\mathrm dz=\lim\limits_{m\to\infty}\sum\limits_{0\le k\le m-1}f(z_k)(z_{k+1}-z_k)$, where $z_0,z_1,\cdots,z_m$ divide $\Gamma$ evenly.

Complex integral on a closed curve:
- Cauchy's integral theorem: if $f(z)$ is analytic (complex differentiable) on and inside a simple closed contour $\gamma$, then $\oint_{\gamma}f(z)\mathrm dz=0$.
- Let $f(z)$ be analytic inside and on a simple closed contour $\gamma$, if $z_0$ is any point inside $\gamma$, then $\oint_{\gamma}\dfrac{f(z)}{z-z_0}\mathrm dz=f(z_0)2\pi i$.
  
  Proof: Consider a small circle $C_{\epsilon}$ around $z_0$ with radius $\epsilon$, then $\oint_{\gamma}\dfrac{f(z)}{z-z_0}\mathrm dz=\oint_{C_{\epsilon}}\dfrac{f(z)}{z-z_0}\mathrm dz$. Parameterize $C_{\epsilon}$ as $z=z_0+\epsilon e^{i\theta}$, then $\mathrm dz=i\epsilon e^{i\theta}\mathrm d\theta$, so as $\epsilon\to 0,f(z)\approx f(z_0)$ by the continuity of $f$, so $\oint_{C_{\epsilon}}\dfrac{f(z)}{z-z_0}\mathrm dz\approx f(z_0)\int_0^{2\pi}\dfrac{i\epsilon e^{i\theta}}{\epsilon e^{i\theta}}\mathrm d\theta=2\pi i f(z_0)$.
- Cauchy's residue theorem: Complex function $f$ has a pole (of order $m$) at $z_0$ if $(z-z_0)^m$ if $(z-z_0)^mf(z)$ is holomorphic and non-zero at $z_0$. If $f(z)$ has a pole of order $m$ at $z_0$, it can be represented by Laurent series at a neighborhood of $z_0$, $f(z)=\sum\limits_{n=-m}^{\infty}a_n(z-z_0)^n$ with non-zero $a_{-m}$ (generalizes Taylor series to functions with singularities). For $f(z)=\sum\limits_{-\infty}^{\infty}a_n(z-z_0)^n$, let $a_{-1}$ the residue of $f(z)$ at $z_0$.
  
  (Cauchy's residue theorem) Let $f$ be analytic inside and on a simple closed contour $\gamma$ except for finitely many singularities $z_1,z_2,\cdots,z_n$ insider $\gamma$, then $\oint_{\gamma}f(z)\mathrm dz=2\pi i\sum\limits_{k=1}^n\text{Res}(f,z_k)$.

Lecture 7

Methods for finding residues:

Simple pole: $\text{Res}(f,z_0)=\lim\limits_{z\to z_0}(z-z_0)f(z)$.
Pole of order $m$: $\text{Res}(f,z_0)=\dfrac{1}{(m-1)!}\lim\limits_{z\to z_0}\dfrac{\mathrm d^{m-1}}{dz^{m-1}}(z-z_0)^mf(z)$.

How to calculate $\tan x$ using the tools in complex analysis?

We first extend $\tan x$ to be a function over the complex plane. $\forall z\in\mathbb{C}$, let $\tan z=\dfrac{e^{iz}-e^{-iz}}{i(e^{iz}+e^{-iz})}$.

Let $\beta_n$ be the residue of $\dfrac{\tan z}{z^{n+1}}$ at $z=0$. We can prove that $b_n=\beta_n$. This is because $\dfrac{\tan z}{z^{n+1}}=\dfrac{1}{z^{n+1}}\sum\limits_{k\ge 0}b_kz^k=\dfrac{b_0}{z^{n+1}}+\dfrac{b_1}{z^n}+\cdots+\dfrac{b_n}{z}+b_{n+1}+b_{n+2}z+\cdots$.

For each odd $n$, complex function $\dfrac{\tan z}{z^{n+1}}$ has singularities at

$z_m=(m-\dfrac{1}{2})\pi$ for integer $m$.

$z=0$.

Construct a closed curve $\Gamma_m$ traversing the boundary of the $2m\times 2m$ symmetric square. By Cauchy's Residue Theorem, let $f=\dfrac{\tan z}{z^{n+1}}$,

\[\dfrac{1}{2\pi i}\oint_{\Gamma_m}\dfrac{\tan z}{z^{n+1}}=\text{Res}(f,0)+\sum\limits_{i=-2m+1}^{2m}\text{Res}(f,(i-\dfrac{1}{2})\pi) \]
Since $|\tan z|\le 10$ for $\forall z\in\Gamma_m$, $\text{LHS}$ has absolute value $\le 8m\cdot\max|\dfrac{\tan z}{z^{n+1}}\le 80\dfrac{1}{m^{n}}$, as $m\to\infty$, $\text{LHS}\to 0$. So $b_n=-\sum\limits_{m}\text{Res}(\dfrac{\tan z}{z^{n+1}},z_m)$.

A spanning tree of $G$ is a subgraph $G'=(V,E')$ such that $G'$ is connected and $|E'|=|V|-1$.

Carleys formula: $\text{#sp}(K_n)=n^{n-2}$.

Matrix Tree Theorem: The Laplacian of $G$ is an $n\times n$ matrix $L_G=(l_{i,j})$ where $l_{i,i}=\text{degree}(v_i)$, $l_{i,j}=-1$ if $(i,j)\in E$ and $l_{i,j}=0$ otherwise. Then $\text{#sp}(G)=\det(L_G^{(i)})$ for any $1\le i\le |V|$, where $A^{(i)}$ is the matrix $A$ with the $i$-th row and $i$-th column deleted.

Lemma (Cauchy-Binet Formula): Let $n\le m$ and $A=(a_{i,j}),B=(b_{i,j})$ be $n\times m$ real matrices. For any $S\sube\{1,2,3,\cdots,m\}$ with $|S|=n$, let $A_S$ denote the $n\times n$ submatrix of $A$ with columns indexed by $|S|$, similarly for $B_S$. Then $\det(A\cdot B^T)=\sum\limits_{|S|=n}\det(A_S)\det(B_S)$.

Proof of Matrix Tree Theorem with the lemma: Let $A$ be the $n\times m$ matrix where for each $e_j=(v_{j_1},v_{j_2})$, let $a_{j_1,j}=1,a_{j_2,j}=-1$ and other $a_{i,j}=0$. Then $AA^T=L_G$. Fix some $1\le i\le n$, let $A'$ be the $(n-1)\times m$ matrix obtained from $A$ by deleting its $i$-th row, then $A'(A')^T=L_G^{(i)}$. Since for $S\sube\{1,2,3,\cdots,m\}$ with $|S|=n-1$, $|\det(A'_S)|=1$ if and only if $\{e_k|k\in S\}$ forms a spanning tree of $G$, $\det(L_G^{(i)})=\text{#sp}(G)$, which proves the theorem.

Lecture 9

Inequalities recap:

Union bound: $\mathbb{P}(X_1\lor X_2\lor\cdots\lor X_n)\le\sum\limits_{i}\mathbb{P}(X_i)$.
Markov's inequality: $\mathbb{P}(X\ge a)\le\dfrac{\mathbb{E}(X)}{a}$, where $X$ is non-negative.
Chebyshev's inequality: let $\mu=\mathbb{E}(X),\sigma^2=\text{Var}(X)$, then $\mathbb{P}(|X-\mu|\ge k\sigma)\le\dfrac{1}{k^2}$.

Moment generating function: $M_X(t)=\mathbb{E}(\exp(tX))$.

Another representation of Chernoff bound: $\mathbb{P}(X\ge a)\le\dfrac{M_X(t)}{e^{ta}}$.

Proof:

\[\mathbb{P}(X\ge a)=\int_a^{\infty}f(x)\mathrm dx\le\int_a^{\infty}\dfrac{e^{tx}}{e^{ta}}f(x)\mathrm dx\le\dfrac{M_X(t)}{e^{ta}} \]

Hoeffding bound: let $Z_1,Z_2,\cdots,Z_n$ be independent random variables such that $Z_i\in[a_i,b_i]$, then

\[\mathbb{P}(\sum(Z_j-\mathbb{E}(Z_j))\ge t)\le\exp(-\dfrac{2t^2}{\sum(b_j-a_j)^2}) \]

High-dimensional geometry.

For high-dimensional objects, most of their volume is near its surface. For an object $A$ in $\mathbb{R}^d$, shrink $A$ by a small amount $\epsilon$ to produce a new object $(1-\epsilon)A$, then we have $\text{volume}((1-\epsilon)A)=(1-\epsilon)^d\text{volume}(A)$. Since $(1-\epsilon)^d\le e^{-\epsilon d}$, fix $\epsilon$ and as $d\to\infty$, $(1-\epsilon)^d$ will rapidly approach $0$.

For unit ball in $d$-dimensional space:

Surface area $A(d)=\dfrac{2\pi^{\frac{d}{2}}}{\Gamma(\frac{d}{2})}$.
Volume $V(d)=\dfrac{2\pi^{\frac{d}{2}}}{d\Gamma(\frac{d}{2})}$.

Where $\Gamma(n+1)=n\Gamma(n),\Gamma(1)=1,\Gamma(\dfrac{1}{2})=\sqrt{\pi}$.

Properties of high-dimensional ball:

Volume lies near equator:

For $c\ge 1$ and $d\ge 3$, at least a $1-\dfrac{2}{c}e^{-\frac{c^2}{2}}$ fraction of the volume of the $d$-dimensional unit ball has $|x_1|\le\dfrac{c}{\sqrt{d-1}}$.

Proof: By symmetry, we just need to prove that at most a $\dfrac{2}{c}e^{-\frac{c^2}{2}}$ fraction of the half of the ball with $x_1\ge 0$ has $x_1\ge\dfrac{c}{\sqrt{d-1}}$, let $A$ denote the portion of the ball with $x_1\ge\dfrac{c}{\sqrt{d-1}}$ and $H$ denote the upper hemisphere, then

\[\text{volume}(A)=\int_{\frac{c}{\sqrt{d-1}}}^1(1-x_1^2)^{\frac{d-1}{2}}V(d-1)\mathrm dx_1 \]
To get the upper bound of $\text{volume}(A)$, use $1-x\le e^{-x}$ and integrate to infinite, thus

\[\begin{aligned} \text{volume}(A)&\le\int_{\frac{c}{\sqrt{d-1}}}^{\infty}\dfrac{x_1\sqrt{d-1}}{c}e^{-\frac{d-1}{2}x_1^2}V(d-1)\mathrm dx_1\\ &=V(d-1)\dfrac{\sqrt{d-1}}{c}\int_{\frac{c}{\sqrt{d-1}}}^{\infty}x_1e^{-\frac{d-1}{2}x_1^2}\mathrm dx_1\\ &=V(d-1)\dfrac{\sqrt{d-1}}{c}(-\dfrac{1}{d-1}e^{-\frac{d-1}{2}x_1^2})|_{\frac{c}{\sqrt{d-1}}}^{\infty}\\ &=\dfrac{V(d-1)}{c\sqrt{d-1}}e^{-\frac{c^2}{2}} \end{aligned} \]
Since $\text{volume}(H)\ge\dfrac{V(d-1)}{2\sqrt{d-1}}$, we have $\dfrac{\text{volume}(A)}{\text{volume}(H)}\le\dfrac{2}{c}e^{-\frac{c^2}{2}}$.

Volume lies on the shell, random vectors are orthogonal:

Consider drawing $n$ points $x_1,x_2,\cdots,x_n$ at random from the unit ball, with probability $1-O(n^{-1})$:

$|x_i|\ge 1-\dfrac{2\ln n}{d}$ for all $i$.
$|x_i\cdot x_j|\le\dfrac{\sqrt{6\ln n}}{\sqrt{d-1}}$ for all $i\ne j$.

Proof:

For the first part, since $\text{Pr}(|x_i|<1-\dfrac{2\ln n}{d})\le e^{-\frac{2\ln n}{d}\cdot d}=\dfrac{1}{n^2}$, by union bound, $\text{Pr}(|x_i|\ge 1-\dfrac{2\ln n}{d},\forall i)\le\dfrac{1}{n}$.

For the second part, fix some $i,j$, $\text{Pr}(|x_i\cdot x_j|\le\dfrac{\sqrt{6\ln n}}{\sqrt{d-1}})\le O(e^{-\frac{6\ln n}{2}})=O(n^{-3})$, so by union bound, $\text{Pr}(|x_i\cdot x_j|\le\dfrac{\sqrt{6\ln n}}{\sqrt{d-1}},\forall i\ne j)\le O(\dbinom{n}{2}n^{-3})=O(\dfrac{1}{n})$.

Gaussian Annulus Theorem: For a $d$-dimensional spherical Gaussian with unit variance in each direction, for any $\beta\le\sqrt{d}$, all but at most $3e^{-c\beta^2}$ of the probability mass lies within the annulus $\sqrt{d}-\beta\le|x|\le\sqrt{d}+\beta$, where $c$ is a fixed positive constant.

Lemma: Let $X_1,X_2,\cdots,X_n$ be independent $\sigma$-subgaussian random variables ($\mathbb{E}(e^{\lambda(X_i-\mu_i)})\le e^{\lambda^2\sigma^2/2}$). Let $S=\sum X_i$, then for any $t>0$: $\mathbb{P}(|S-\mathbb{E}(S)|\ge t)\le 2\exp(-\dfrac{t^2}{2n\sigma^2})$.

Proof: $\mathbb{E}(e^{\lambda(S-n\mu)})=\prod\limits_{i}\mathbb{E}(e^{\lambda(X_i-\mu)})\le e^{n\sigma^2\lambda^2/2}$. So by Chernoff bound, $\mathbb{P}(S-n\mu\ge t)\le e^{-\lambda t}\mathbb{E}(e^{\lambda(S-n\mu)})\le e^{-\lambda t+n\sigma^2\lambda^2/2}$. Plug $\lambda=\dfrac{t}{n\sigma^2}$ into the bound we get $\mathbb{P}(S-n\mu\ge t)\le\exp(-\dfrac{t^2}{2n\sigma^2})$.

For sub-exponential random variables, we have:

Let $X_1,X_2,\cdots,X_n$ be i.i.d sub-exponential random variables with parameter $\nu,b$ ($\mathbb{E}(e^{\lambda(X_i-\mu_i)})\le e^{\lambda^2\nu^2/2}$ for all $|\lambda|<\dfrac{1}{b})$. Let $S=\sum X_i$, then for any $t>0$: $\mathbb{P}(|S-\mathbb{E}(S)|\ge t)\le 2\exp(-\min(\dfrac{t^2}{2n\sigma^2},\dfrac{t}{2b}))$.

Since $X_i^2$ is sub-exponential with parameters $(2,4)$, apply sub-exponential tail we have $\mathbb{P}(|S-\sqrt{d}|\ge\beta)\le 2\exp(-\min(\dfrac{(2\beta\sqrt{d})^2}{8d},\dfrac{2\beta\sqrt{d}}{8}))$. Since $\beta\le\sqrt{d}$, the $\dfrac{(2\beta\sqrt{d})^2}{8d}\le\dfrac{2\beta\sqrt{d}}{8}$, so $\mathbb{P}(|S-\sqrt{d}|\ge\beta)\le 2e^{-\beta^2/2}$.

Lecture 10

Application of GAT: Random Projection.

Consider random projection $f:\mathbb{R}^d\to\mathbb{R}^k$, given $v$, let $u_1,u_2,\cdots,u_k$ be $k$ Gaussian vectors in $\mathbb{R}^d$, then define $f(v)=(u_1\cdot v,u_2\cdot v,\cdots,u_k\cdot v)$.

Then $f$ preserves norms under scaling: Let $v$ be a fixed vector, then there exists $c>0$ such that $\forall\epsilon\in(0,1)$, $\text{Pr}(||f(v)|-\sqrt{k}|v||\ge\epsilon\sqrt{k}|v|)\le 3e^{-ck\epsilon^2}$.

Proof:

Assume that $|v|=1$, then $\text{Var}(u_i\cdot v)=\text{Var}(\sum\limits_{j=1}^du_{i,j}v_j)=\sum\limits_{j=1}^dv_j^2\text{Var}(u_{i,j})=1$. So by Gaussian Annulus Theorem, $\text{Pr}(||f(v)|-\sqrt{k}|v||\ge\epsilon\sqrt{k}|v|)\le 3e^{-ck\epsilon^2}$.

$f$ preserves pairwise distance as well. (JL Lemma) For any $0<\epsilon<1$ and any integer $n$, let $k\ge\dfrac{3}{c\epsilon^2}\ln n$ with $c$ as in Gaussian Annulus Theorem. For any set of $n$ points in $\mathbb{R}^d$, the random projection $\mathbb{R}^d\to\mathbb{R}^k$ defined above has the property that for all pairs of points $v_i$ and $v_j$, with probability at least $1-\dfrac{3}{2n}$, $(1-\epsilon)\sqrt{k}|v_i-v_j|\le|f(v_i)-f(v_j)|\le(1+\epsilon)\sqrt{k}|v_i-v_j|$.

The compressed dimension is $d$-independent!!!

Singular value decomposition

For a matrix $n\times m$ matrix $M$ with rank $r$:

There exists a $n\times r$ matrix $U$ and a $r\times m$ matrix $V$ such that $M=UV$.
There are $n-r$ eigenvectors with eigenvalue $0$.

Background: given $n$ data points in a $d$s-dimensional space, find a line such that the sum of the square of the distance from all data points to the line is minimal.

Let $v$ be the unit vector along the line, then the square of the length of the projection of point $x$ onto the line is $\lang x,v\rang^2$. By Pythagorean theorem, the square of the distance of point $x$ to the line is $|x|^2-\lang x,v\rang^2$. Since $\sum|x|^2$ is a fixed value, let $A$ be the $n\times d$ matrix such that the $i$-th row vector equals to the $i$-th data point, the the optimal $v$ is $\arg\max\limits_{|v|=1}|Av|$.

Let $v_1=\arg\max\limits_{|v|=1}|Av|$, $v_2=\arg\max\limits_{|v|=1,v\perp v_1}|Av|,v_3=\arg\max\limits_{|v|=1,v\perp v_1,v\perp v_2}|Av|$ and so on, and let $\sigma_i=|Av_i|$. We define the right singular values as $v_1,v_2,\cdots,v_r$ and the singular values as $\sigma_1\ge\sigma_2\ge\cdots\ge\sigma_r$, then define the left singular values $u_i=\dfrac{1}{\sigma_i}Av_i$, then for different $i,j$, $u_i$ and $u_j$ are orthogonal.

Proof: Assume that for some $i<j$, $u_i^Tu_j=\delta>0$, then let $v'_i=\dfrac{v_i+\epsilon v_j}{|v_i+\epsilon v_j|}$, then $|Av_i|>u_i^T(\dfrac{\sigma_i u_i+\epsilon\sigma_j u_j}{\sqrt{1+\epsilon^2}})>(\sigma_i+\epsilon\sigma_j\delta)(1-\dfrac{\epsilon^2}{2})=\sigma_i-\dfrac{\epsilon^2}{2}\sigma_i+\epsilon\sigma_j\delta-\dfrac{\epsilon^3}{2}\sigma_j\delta$, when $\epsilon$ is small enough, the term is $>\sigma_i$. Since $v'_i$ is orthogonal to all $v_k$ where $k<i$, $v'_i$ fits the subspace better than $v_i$, a contradiction. So $u_i^Tu_j=0$ for all $i<j$.

Let $U$ be the $n\times r$ matrix such that the $i$-th column vector is $u_i$, $D$ be the $r\times r$ diagonal matrix with $D_{i,i}=\sigma_i$, $V^T$ be the $r\times d$ matrix such that the $i$-th row vector is $v_i$, then $A=UDV^T$. In other words, $A=\sum\limits_{i=1}^r\sigma_iu_iv_i^T$.

Proof: $\forall j$, $\sum\limits_{i=1}^j\sigma_iu_iv_i^Tv_j=\sigma_ju_j=Av_j$, so $A=\sum\limits_{i=1}^r\sigma_iu_iv_i^T=UDV^T$.

Low rank approximation.

What if we project the data points onto a $k$-dimensional space?

Let $A$ be an $n\times d$ matrix with singular vectors $v_1,v_2,\cdots,v_r$, for $1\le k\le r$, let $V_k$ be the subspace spanned by $v_1,v_2,\cdots,v_k$. For each $k$, $V_k$ is the best-fit $k$-dimensional subspace for $A$.

Define the Frobenius norm of matrix $A$ as $|A|_F=\sqrt{\sum\limits_{i=1}^n\sum\limits_{j=1}^dA_{i,j}^2}$, then $|A|_F^2=\sum\limits_{i=1}^r\sigma_i^2$.

Proof: $\sum\limits_{i=1}^n|a_i|^2=\sum\limits_{i=1}^n\sum\limits_{j=1}^r(a_i\cdot v_j)^2=\sum\limits_{j=1}^r\sum\limits_{i=1}^n(a_i\cdot v_j)^2=\sum\limits_{j=1}^r|Av_j|^2=\sum\limits_{i=j}^r\sigma_j^2$.

Low rank approximation for Frobenius norm: given matrix $A$, find matrix $B$ such that $|A-B|_F$ is minimal, and $\text{rank}(B)\le k$.

Analysis: For any of rank at most $k$, $|A-A_k|_F\le|A-B|_F$. So the minimum value of $|A-B|_F$ is $\sum\limits_{i=k+1}^r\sigma_i^2$.

Proof: Since orthogonal matrix preserves Frobenius norm, so $|A-B|_F=|U^T(A-B)V|_F=|D-U^TBV|$, let $C=U^TBV$, since $\text{rank}(B)=k$, $\text{rank}(C)\le k$. Since $D$ is diagonal, so $C^*=(\sigma_1,\sigma_2,\cdots,\sigma_k,0,0,0)$, $B^*=A_k$.

Define the L2-norm of matrix $A$ as $|A|_2=\max\limits_{|x|\le 1}|Ax|$.

Low rank approximation for L2-norm:

Lemma 3.8. $|A-A_k|_2^2=\sigma^2_{k+1}$.

Proof: consider an arbitrary vector $v=\sum\limits_{j=1}^rc_jv_j$, then

\[\begin{aligned} |(A-A_k)v|&=|\sum\limits_{i=k+1}^r\sigma_iu_iv_i^T\sum\limits_{j=1}^rc_jv_j|\\&=|\sum\limits_{i=k+1}^rc_i\sigma_iu_iv_i^Tv_i|\\ &=|\sum\limits_{i=k+1}^rc_i\sigma_iu_i|\\ &=\sqrt{\sum\limits_{i=k+1}^rc_i^2\sigma_i^2} \end{aligned} \]
Since $|v|\le 1$, $\sum\limits_{i=k+1}^{r}c_i^2\le 1$, so the maximum value of $\sqrt{\sum\limits_{i=k+1}^rc_i^2\sigma_i^2}$ is $\sigma_{k+1}$.

Theorem 3.9. Let $A$ be an $n\times d$ matrix. For any matrix of rank at most $k$, $|A-A_k|_2\le|A-B|_2$.

Proof: Assume $B$ is not $A_k$, then find $|z|=1$ in $\text{Null}(B)\cap\text{Span}(v_1,v_2,\cdots,v_{k+1})$, then

\[\begin{aligned} |Az|^2&=|\sum\limits_{i=1}^n\sigma_iu_iv_i^Tz|^2\\&=\sum\limits_{i=1}^n\sigma_i^2(v_i^Tz)^2\\&=\sum\limits_{i=1}^{k+1}\sigma_i^2(v_i^Tz)^2\\&\ge\sigma_{k+1}^2\sum\limits_{i=1}^{k+1}(v_i^Tz)^2\\&=\sigma_{k+1}^2 \end{aligned} \]
This is no better than $A_k$.

Lecture 12

Power methods for computing SVD:

\[\begin{aligned} B&=A^TA\\ &=(\sum\limits_{i}\sigma_iv_iu_i^T)(\sum\limits_{j}\sigma_ju_jv_j^T)\\ &=\sum\limits_{i,j}\sigma_i\sigma_jv_i(u_i^T\cdot u_j)v_j^T\\ &=\sum\limits_{i}\sigma_i^2v_iv_i^T \end{aligned} \]

\[B^2=(\sum\limits_{i}\sigma_i^2v_iv_i^T)(\sum\limits_{j}\sigma_j^2v_jv_j^T)=\sum\limits_{i}\sigma_i^4v_iv_i^T \]

Raise to power $k$: $B^k=\sum\limits_{i}\sigma_i^{2k}v_iv_i^T$.

Then by exponential growth if $x=\sum\limits_{i}c_iv_i$, then when $k$ is large enough, $B^kx\approx\sigma_1^{2k}c_1v_1$.

PageRank: $u$: hub score, $v$: authority score. $A_{i,j}=1$ means there is a hyperlink from hub $i$ to authority $j$.

Then $v_j\propto\sum\limits_{i=1}^du_iA_{i,j}$.

Lemma 10. $Av_i=\sigma_iu_i,A^Tu_i=\sigma_iv_i$.

Solve $u,v$ by SVD we obtain $u=\dfrac{Av}{|Av|}$, $v=\dfrac{A^Tu}{|A^Tu|}$.

Community detecting: There are two groups in a graph, the probability that there is an edge between the same group is $p$ and between different groups is $q$, classify the nodes.

Spectral clustering algorithm: compute the second eigenvector $U_2$ of $A$ and cluster nodes based on the sign of entries of $U_2$. Total error count: $\text{err}\le 2\dfrac{(p\lor q)\log n}{(p-q)^2n}$.

Lecture 13

A Markov chain consists of:

State space: $S=\{1,2,3,\cdots,m\}$
A state distribution over states: $\mathbb{p}(t)=(p_1(t),p_2(t),\cdots,p_m(t))\in[0,1]^m$ such that $\sum\limits_{i=1}^mp_i(t)=1$.
State transition: $P=[P_{i,j}]_{i,j=1}^m\in[0,1]^{m\times m}$ such that $\sum\limits_{j=1}^mP_{i,j}=1$.

State evolution: $\mathbb{p}(t+1)=\mathbb{p}(t)P$.

2D random walk. A drunk man walking in Manhattan, can the drunk man find his way home?

Analysis: Define some notations:

First return time: $T_i=\inf\{n\ge 1:X_n=i\}$

Return probability: $f_i=\text{Pr}(T_i<\infty)$.

Number of visits: $N_i=\sum\limits_{n=1}^{\infty}[X_n=i]$

Recurrent state: a state $i$ is recurrent if $f_i=1$.

Lemma: $\mathbb{E}(N_i)=\dfrac{1}{1-f_i}$. Proof: By Markovian, $N_i=1+1_{T_i<\infty}N^*_i$, so $\mathbb{E}(N_i)=1+\text{Pr}(T_i<\infty)\mathbb{E}(N_i)$.

For the 1D version, we have $\text{Pr}[X_{2n}=0]=\dbinom{2n}{n}4^{-n}$, use Stirling approximation $\dbinom{2m}{m}\sim\dfrac{4^m}{\sqrt{\pi m}}$, we have $\text{Pr}[X_{2n}=0]=\dfrac{1}{\sqrt{\pi n}}(1+O(n^{-1}))$. So for the 2D version, $p_{2n}=\text{Pr}[X_{2n}=0]^2=\dfrac{1}{\pi n}(1+O(n^{-1}))$. The number of visits of $(0,0)$ is $\sum\limits_{n=0}^{\infty}p_n=\infty$, so $(0,0)$ is recurrent.

“A Drunk Man Will Find His Way Home but a Drunk Bird May Get Lost Forever”: $f_{(0,0,0)}<1$ in 3D random walk.

Definition: A Markov chain is connected if the probability of moving from every state $i$ to every state $j$ is non-zero.

Lemma 4.1: Let $P$ be the transition for a connected Markov chain. The $n\times(n+1)$ matrix $A=\begin{bmatrix}P-I&1\end{bmatrix}$ by augmenting $P-I$ with an additional column of all ones has rank $n$.

Prove: Apparently $(1,1,1,\cdots,1,0)$ is in the null space of $A$. Assume that some other vector $v=(x,\alpha)$ orthogonal to $(1,1,1,\cdots,1,0)$ is in the null space, then $Av=0$, so $(P-I)x+\alpha\bold{1}=0$, which means that $x_i=\sum\limits_{j}p_{i,j}x_j+\alpha$. Since $v$ is orthogonal to $(1,1,1,\cdots,1,0)$, $x$ should not be all equal. By the connectedness, there exists neighboring $(i,j)$ such that $x_i>x_j$ and $x_i\ge$ all neighbors of $i$, so $\alpha>0$. Similarly there exists neighboring $(i,j)$ such that $x_i<x_j$ and $x_i\le$ all neighbors of $i$, so $\alpha<0$. A contradiction.

Theorem 4.2: For a connected Markov chain, there is a unique probability vector $\pi$ satisfying $\pi P=\pi$. Moreover, let $a(t)=\dfrac{1}{t}(p(0)+p(1)+\cdots+p(t-1))$, then $\lim\limits_{t\to\infty}a(t)$ exists and equal to $\pi$.

Lecture 14

Detailed balance equality: For a random walk on a connected Markov chain, if the vector $\pi$ satisfies $\pi_xp_{x,y}=\pi_yp_{y,x}$ for all $x,y$ and $\sum\limits_{x}\pi_x=1$, then $\pi$ is the stationary distribution of the Markov chain.

If the stationary distribution $\pi$ satisfies $\pi_xp_{x,y}=\pi_yp_{y,x}$, then we call the Markov chain "reversible".

MCMC algorithms: If we want to estimate $\mathbb{E}(f)=\sum\limits_{x}f(x)\pi(x)$, we design a Markov chain $P$ such that $\pi P$, then consider $\dfrac{1}{t}\sum\limits_{i=1}^tf(x_i)$, as $t\to\infty$, the value converges to $\mathbb{E}(f)$.

Metropolis hasting: given a distribution $p$, let $r$ be a hyperparameter $>1$, then let $P_{i,j}=\min(\dfrac{p_j}{p_i},1)\cdot\dfrac{1}{r}$, $P_{i,i}=1-\sum\limits_{j\ne i}P_{i,j}$. Then we have $p_iP_{i,j}=p_jP_{j,i}$, so $p$ is the stationary distribution of $P$.

Advantage: The normalizing coefficient $Z$ in some cases will cancel out so that the ratio is easy to compute.

Gibbs sampling: Repeatedly update some random index $x_i$ based on $\mathbb{P}(x_i|\{x_{j\ne i}\})$.

Mixing time: For $\epsilon>0$, the $\epsilon$-mixing time of a Markov chain is the minimum integer $t$ such that for any starting distribution $p$, the L1-norm difference between the $t$-step running average probability distribution and the stationary distribution is $\le\epsilon$, i.e., $|a(t)-\pi|\le\epsilon$.

Normalized conductance: For a subset $S$ of vertices, let $\pi(S)=\sum\limits_{s\in S}\pi_s$, the normalized conductance of $S$ is $\Phi(S)=\dfrac{\sum\limits_{(x,y)\in (S,\bar{S})}\pi_xp_{x,y}}{\min(\pi(S),\pi(\bar{S}))}$, the normalized conductance of the Markov chain is $\min\limits_{S}\Phi(S)$.

Theorem 4.5: The $\epsilon$-mixing time of a random walk on an undirected graph is $O(\dfrac{\ln(1/\pi_{min})}{\Phi^2\epsilon^3})$ where $\pi_{min}$ is the stationary probability of any state.

Applications of mixing time:

Random walk over 1D lattice of size $n$ with loop: $\Phi(S)=\Omega(\dfrac{1}{n})$, total mixing time is $O(n^2\log n/\epsilon^3)$.
2D lattice: $\Phi(S)=\Omega(\dfrac{1}{n})$.
Clique: $\Phi(S)=\Omega(1)$, total mixing time is $O(\log n/\epsilon^3)$.
A connected graph with $m$ edges: worst case bound for mixing time is $O(m^2\ln n/\epsilon^3)$.

posted @ 2025-06-10 22:47 tzc_wk 阅读(8) 评论(0) 收藏举报

刷新页面返回顶部

tzc_wk