notes for 计算机与人工智能应用数学
Lecture 1
A probability space \(P=(U,p)\) consists of
-
Universe \(U\): finite non-empty set
-
Probability function \(p:U\to[0,1]\) such that \(\sum\limits_{u\in U}p(u)=1\).
An event is \(T\sube U\), the event happens if and only if the outcome falls inside \(T\). The probability of \(T\) is defined to be \(\text{Pr}(T)=\sum\limits_{u\in T}p(u)\).
Monte Hall problem: Switching gives \(\dfrac{2}{3}\) success probability.
Birthday paradox: Let \(U=\{(x_1,x_2,\cdots,x_n)|1\le x_k\le 365\},T=\{(x_1,x_2,\cdots,x_n)|x_j=x_k\text{ for some }j,k\}\), and let \(q(n)=\text{Pr}(T)\), then \(q(n)=1-(1-\dfrac{1}{365})(1-\dfrac{2}{365})\cdots(1-\dfrac{n-1}{365})\). Since \(e^{-x}\approx 1-x\) when \(x\) is small (also in fact \(e^{-x}\ge 1-x\) for \(x\ge 0\)), thus \(q(n)=1-(1-\dfrac{1}{365})(1-\dfrac{2}{365})\cdots(1-\dfrac{n-1}{365})\ge 1-\exp(-\sum\limits_{i=1}^{n-1}\dfrac{i}{365})=1-\exp(-\dfrac{n(n-1)}{730})\). So approximately when \(n>\sqrt{730\times \ln(2)}\approx 22\), \(q(n)>0.5\).
Online auction problem: Let strategy \(k\) be the strategy that skips the first \(k\) offers and accepts \(x_j\) if \(j\) is the first \(j\) satisfying \(x_j>\max\{x_1,x_2,\cdots,x_k\}\), consider the probability of success for strategy \(k\). Let \(T_j\) be the set of permutations satisfying \(x_j=n\), \(\max\{x_1,x_2,\cdots,x_{j-1}\}=\max\{x_1,x_2,\cdots,x_k\}\), then \(\text{Pr}(T)=\dfrac{1}{n}·\dfrac{k}{j-1}\), so the probability of success for strategy \(k\) is \(\dfrac{k}{n}(\dfrac{1}{k}+\dfrac{1}{k+1}+\cdots+\dfrac{1}{n-1})\). Since \(H_n=1+\dfrac{1}{2}+\dfrac{1}{3}+\cdots+\dfrac{1}{n}\approx\ln n+C\), so \(\dfrac{k}{n}(\dfrac{1}{k}+\dfrac{1}{k+1}+\cdots+\dfrac{1}{n-1})=\dfrac{k}{n}(H_{n-1}-H_{k-1})\approx\dfrac{k}{n}(\ln\dfrac{n-1}{k-1})\), choose \(k=\lceil\dfrac{n}{e}\rceil\), then the probability is approximately \(\dfrac{1}{e}\), which is the maximum.
Union bound:
- Let \(T_1,T_2,\cdots,T_m\) be events, \(T\sube\cup_{i=1}^mT_i\), then \(\text{Pr}(T)\le\sum\limits_{i=1}^m\text{Pr}(T_i)\).
- If \(T_i\)'s are disjoint and \(T=\cup_{i=1}^mT_i\), then \(\text{Pr}(T)=\sum\limits_{i=1}^m\text{Pr}(T_i)\).
Ramsey numbers: Let \(R(k)\) be the smallest \(N\) such that among \(N\) people, there either exists \(k\) mutual friends or \(k\) mutual strangers.
Theorem: For all \(k\ge 3\), \(R(k)\ge\lfloor 2^{k/2}\rfloor\).
Proof: Let $n=\lfloor 2^{k/2}\rfloor $, let \(G\) be a random graph with \(n\) vertices, we prove that the probability that there does not exist \(k\) mutual friends or \(k\) mutual stranger is larger than \(0\). Let \(T\) be such event, equivalently we show that \(\text{Pr}(\overline{T})<1\). By union bound, \(\text{Pr}(\overline{T})\le 2\dbinom{n}{k}·\dfrac{1}{2^{\binom{k}{2}}}\le 2\dfrac{n^k}{k!}·\dfrac{1}{2^{\binom{n}{2}}}\le 2\dfrac{2^{k(k/2)}}{k!}·\dfrac{1}{2^{(k^2-k)/2}}=2\dfrac{2}{2^{k/2}}{k!}<1\). So \(\text{Pr}(T)>0\).
Conditional probability \(\text{Pr}(S|T)=\begin{cases}\dfrac{\text{Pr}(S\cap T)}{\text{Pr}(T)}&(\text{Pr}(T)>0)\\0&(\text{Pr}(T)=0)\end{cases}\).
- The chain rule: \(\text{Pr}(S_1\cap S_2\cap\cdots\cap S_m)=\prod\limits_{i=1}^m\text{Pr}(S_i|S_1\cap S_2\cap\cdots\cap S_{m-1})\).
- Distributive law: let \(T\sube W_1\cup W_2\cup\cdots\cup W_m\), then \(\text{Pr}(T)\le\sum\limits_{1\le j\le m}\text{Pr}(W_j)\text{Pr}(T|W_j)\).
Lecture 2
The probability of \(1\) being in a cycle of length \(s\) is \(\dfrac{1}{n}\).
Proof: Let \(E_s\) be the event that \(L_1>s\), consider \(\text{Pr}(E_s|E_{s-1})\), let the first \(s-1\) elements in the cycle of \(1\) be \(i_1=1,i_2,i_3,\cdots,i_{s-1}\), then \(E_s\) happens if and only if the next element of \(i_{s-1}\) is not \(1\), so \(\text{Pr}(E_s|E_{s-1})=\dfrac{n-s}{n-s+1}\). By chain rule, \(\text{Pr}(L_1=s)=\dfrac{n-1}{n}·\dfrac{n-2}{n-1}·\cdots·\dfrac{n-(s-1)}{n-(s-2)}·\dfrac{1}{n-(s-1)}=\dfrac{1}{n}\).
Greedy clique problem: the algorithm returns a clique with size \(\log_2(n)-\log_2(\log_2(n))\le |A(G)|\le\log_2(n)+\log_2(\log_2(n))\) with probability \(1-o(1)\).
Upper bound: Let \(K=\log_2(n)+\log_2(\log_2(n))\), for \(2\le i\le n\), let \(T_i\) be the event such that the greedy algorithm selects \(i\) as the \(K\)-th vertex to join \(S\), then by distributive law, \(\text{Pr}(|S|>K)=\sum\limits_{2\le i\le n}\text{Pr}(T_i)·\text{Pr}(|S|>K|T_i)\), since \(\text{Pr}(|S|>K|T_i)\le\dfrac{n}{2^K}\), \(\text{Pr}(|S|>K)\le\dfrac{n}{2^K}=\dfrac{1}{\log n}=o(1)\).
Lower bound (Chebyshev inequality): Let \(K^-=\log_2(n)-\log_2(\log_2(n))\), let \(X_m(G)\) be the \(m\)-th vertex to join the clique, and let \(Y_m=X_{m+1}-X_m\), then \(Y_j\) is an independent geometric random variable with \(b_j=\dfrac{1}{2^j}\), so \(\text{Pr}(Y_j=t)=(1-b_j)^{t-1}b_j\) for all \(t\ge 1\). Let \(X'=\sum\limits_{j=1}^{K^-}Y_j\), then we need to estimate \(\text{E}(X')\) and \(\text{Var}(X')\). Since \(\text{E}(Y_j)=2^j\), by linear expectation, \(E(X')=\sum\limits_{j=1}^{K^-}2^j=2^{1+K^-}-2\le\dfrac{2n}{\log_2(n)}\), \(\text{Var}(X')=\sum\limits_{j=1}^{K^-}2^{2j}(1-2^{-j})\le 2(\dfrac{n}{\log_2(n)})^2\). So \(\text{Pr}(X'>n-1)\le\text{Pr}(X'-\text{E}(X')>\dfrac{n}{2})\), by Chebyshev inequality, \(\text{Pr}(X'-\text{E}(X')>\dfrac{n}{2})\le\dfrac{\text{Var}(X')}{(\frac{n}{2})^2}\le\dfrac{8}{(\log_2(n))^2}=o(1)\).
The expectation of a random variable is defined as \(\text{E}(X)=\sum\limits_{u\in U}p(u)X(u)\).
The expected number of cycles in a permutation is \(H_n\).
Proof: For a permutation \(p\), let \(L_i\) be the length of the cycle containing \(i\), then the number of cycles in \(p\) equals to \(\sum\limits_{i=1}^n\dfrac{1}{L_i}\), so the expected number of cycles in a permutation is \(\text{E}(\sum\limits_{i=1}^n\dfrac{1}{L_i})=n\text{E}(L_1)=\sum\limits_{i=1}^n\dfrac{1}{i}=H_n\).
Conditional expectation: \(\text{E}(X|T)=\sum\limits_{u\in T}\dfrac{p(u)X(u)}{\text{Pr}(T)}\).
Distributive law for expectation: Let \(W_1,W_2,\cdots,W_m\) be a partition of \(U\), then \(\text{E}(X)=\sum\limits_{i=1}^m\text{Pr}(W_i)E(X|W_i)\).
Mean of the geometric distribution: the first time to see a head when throwing a \(p\)-biased coin is \(\dfrac{1}{p}\).
Proof: \(\text{E}(X)=p+(1-p)(1+\text{E}(X))\), so \(\text{E(X)}=\dfrac{1}{p}\).
Independent variables: \(X,Y\) are independent if \(\forall x,y\), \(\text{Pr}(X=x,Y=y)=\text{Pr}(X=x)\text{Pr}(Y=y)\).
Variance of \(X\): \(\text{Var}(X)=\text{E}((X-\text{E}(X))^2)=\mathbb{E}(X^2)-\mathbb{E}(X)^2\).
Standard derivation of \(X\): \(\sigma(X)=\sqrt{\text{Var}(X)}\).
If \(X\) and \(Y\) are independent, then \(\text{E}(XY)=\text{E}(X)+\text{E}(Y),\text{Var}(X+Y)=\text{Var}(X)+\text{Var}(Y)\).
Tail estimates:
- Markov inequality: Let \(X\) be non-negative random variable, then \(\text{Pr}(X>c\text{E}(X))<\dfrac{1}{c}\).
- Chebyshev inequality: \(\text{Pr}(|X-E(X)|>c\sigma(X))<\dfrac{1}{c^2}\). (not tight, but very general)
Lecture 3
i.i.d coin flips: \(n\) independent tosses of a fair coin, \(\mathbb{E}(X)=\dfrac{n}{2}\), \(\text{Var}(X)=\sum\limits_{i}\text{Var}(X_i)=\dfrac{n}{4}\), so according to Chebyshev inequality, \(\text{Pr}(|X-\mu|\ge c\cdot\sigma)\le\dfrac{1}{c^2}\), where \(\mu=\dfrac{n}{2},\sigma=\dfrac{\sqrt{n}}{2}\). However, this upper bound is not tight (When \(c=10\), \(\text{RHS}=\dfrac{1}{100}\), however the probability is actually \(\le e^{-20}\)).
A more powerful tool: Chernoff bound: Let \(X_1,X_2,\cdots,X_n\) be independent coin tosses where \(\text{Pr}(X_i=1)=b_i,\text{Pr}(X_i=0)=1-b_i\), \(X=\sum\limits_{i=1}^nX_i\), then \(\text{Pr}(X\ge(1+\delta)\mu)\le\exp(-\dfrac{\delta^2}{2+\delta}\mu),\text{Pr}(X\le(1-\delta)\mu)\le\exp(-\dfrac{\delta^2}{2}\mu)\).
Proof: First we prove that \(\exp(-\dfrac{\delta^2}{2+\delta}\mu)\ge(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}\):
\[\begin{aligned} &\exp(-\dfrac{\delta^2}{2+\delta}\mu)\ge(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}\\ \Leftrightarrow&-\dfrac{\delta^2}{2+\delta}\mu\ge\mu(\delta-(1+\delta)\ln(1+\delta))\\ \Leftrightarrow&(1+\delta)\ln(1+\delta)\ge\delta+\dfrac{\delta^2}{2+\delta}\\ \Leftrightarrow&\ln(1+\delta)\ge\dfrac{2\delta}{2+\delta} \end{aligned} \]Since when \(\delta=0\), \(\ln(1+\delta)=\dfrac{2\delta}{2+\delta}\), so we only need to prove \(\dfrac{\mathrm d}{\mathrm d\delta}(\ln(1+\delta))\ge\dfrac{\mathrm d}{\mathrm d\delta}(\dfrac{2\delta}{2+\delta})\). Since \(\dfrac{\mathrm d}{\mathrm d\delta}(\ln(1+\delta))=\dfrac{1}{1+\delta}\), \(\dfrac{\mathrm d}{\mathrm d\delta}(\dfrac{2\delta}{2+\delta})=\dfrac{4}{(2+\delta)^2}\), \(\dfrac{1}{1+\delta}-\dfrac{4}{(2+\delta)^2}=\dfrac{\delta^2}{(1+\delta)(2+\delta)^2}\ge 0\). So the inequality holds.
Next let's prove \(\text{Pr}(X\ge(1+\delta)\mu)\le(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}\). Let \(t\) be a parameter, then \(\text{Pr}(X>(1+\delta)\mu)=\text{Pr}(e^{tX}>e^{t(1+\delta)\mu})\le\dfrac{\mathbb{E}(e^{tX})}{e^{t(1+\delta)\mu}}\). Since \(e^{tX}=\prod\limits_{i=1}^ne^{tX_i}=\prod\limits_{i=1}^n(1+b_i(e^t-1))\le\prod\limits_{i=1}^n(e^{b_i(e^t-1)})\le e^{\mu(e^t-1)}\), let \(f(t)=\mu(e^t-1)-t(1+\delta)\mu\). The minimum of \(f(t)\) is \(f(\ln(1+\delta))=\mu\delta-\mu(1+\delta)\ln(1+\delta)\). So \(\text{Pr}(X\ge(1+\delta)\mu)\le(\dfrac{e^{\delta}}{(1+\delta)^{1+\delta}})^{\mu}\).
Chernoff bound for mean: \(\text{Pr}(|\bar{X}-\mu|\ge\epsilon)\le 2\exp(-\dfrac{\epsilon^2}{2+\epsilon}·n)\).
Hoeffding inequality: let \(Z_1,Z_2,\cdots,Z_n\) be bounded random variables with \(Z_i\in[a,b]\), then for all \(t\ge 0\) \(\text{Pr}(|\dfrac{1}{n}\sum\limits_{i=1}^n(Z_i-\mathbb{E}(Z_i))|\ge t)\le\exp(-\dfrac{2nt^2}{(b-a)^2})\).
Hoeffding lemma: Let \(Z\) be a bounded variable with \(Z\in[a,b]\), then \(\mathbb{E}(\exp(t(Z-\mathbb{Z})))\le\exp(\dfrac{t^2(b-a)^2}{8})\).
Negatively associated random variables: \(X_1,X_2,\cdots,X_n\) are negative associated, if and only if \(\mathbb{E}(\exp(\sum X_i))\le\prod\mathbb{E}(\exp(X_i))\).
Martingale: \(Y_i=\mathbb{E}(f(X_1,X_2,\cdots,X_n)|X_1,X_2,\cdots,X_i)\). Another equivalent definition: \(Y_i\) is a sequence such that \(Y_i\) generated by \(X_1,X_2,\cdots,X_{n-1}\) with mean \(Y_{i-1}\).
Azuma's inequality: If \(Y_i=\mathbb{E}(f(X_1,X_2,\cdots,X_n)|X_1,X_2,\cdots,X_i)\) is a martingale, \(|Y_i-Y_{i-1}|\le c_i\), then for any \(t\ge 0\), \(\text{Pr}(|f(X_1,X_2,\cdots,X_n)-Y_0|\ge t)\le\exp(-\dfrac{2t^2}{\sum c_i^2})\).
Picking the best robot:
Analysis: If \(\forall i,|\hat{Acc_i}-Acc_i|\le\dfrac{\epsilon}{2}\), then we can always find the best robot with error \(\epsilon\). So let \(E=\{\forall i,\forall i,|\hat{Acc_i}-Acc_i|\le\dfrac{\epsilon}{2}\}\), \(D_i=|\hat{Acc_i}-Acc_i|>\dfrac{\epsilon}{2}\), then \(\bar{E}\sube D_1\cup D_2\cup\cdots\cup D_k\), so by union bound, \(\text{Pr}(\bar{E})\le\sum\limits_{i=1}^k\text{Pr}(D_i)=k\text{Pr}(D_1)\). By Chernoff bound, \(\text{Pr}(D_1)\le\exp(-\dfrac{(\epsilon/2)^2}{\epsilon/2+2}·n)\le\exp(-\dfrac{\epsilon^2}{100}n)\) (assuming that \(2\epsilon+8\le 100\)). To let \(\text{Pr}(\bar{E})\le\epsilon\), we should let \(\exp(-\dfrac{\epsilon^2}{100}n)\le\dfrac{\epsilon}{k}\), \(n=O(\dfrac{\ln(\frac{k}{\epsilon})}{\epsilon^2})\)
Lecture 4
Entropy is a universal measure of randomness.
Definition of entropy: \(H(X)=-\sum\limits_{x}\text{Pr}(X=x)\log_2\text{Pr}(X=x)\)
For binary random variables, \(H(p)=-p\log_2p-(1-p)\log_2(1-p)\)
Lemma 1: if \(nq\) is an integer in \([0,n]\), then \(\dfrac{2^{nH(q)}}{n+1}\le\dbinom{n}{nq}\le 2^{nH(q)}\).
Proof: The statement is trivial if \(q=0\) or \(1\), so assume that \(0<q<1\).
To prove the upper bound, since \(\sum\limits_{k=0}^n\dbinom{n}{k}q^k(1-q)^{n-k}=(q+(1-q))^n=1\), so \(\dbinom{n}{nq}\le q^{-nq}(1-q)^{-(1-q)n}=2^{nH(q)}\).
To prove the lower bound, we know that \(\dbinom{n}{q}q^{qn}(1-q)^{(1-q)n}\) is one term of the expression \(\sum\limits_{k=0}^n\dbinom{n}{k}q^k(1-q)^{n-k}\), we show that it is the largest term. Consider the difference between two consecutive terms:
\[\begin{aligned} &\dbinom{n}{k}q^k(1-q)^{n-k}-\dbinom{n}{k+1}q^{k+1}(1-q)^{n-k-1}\\ =&\dbinom{n}{k}q^k(1-q)^{n-k}(1-\dfrac{q}{1-q}\cdot\dfrac{n-k}{k+1}) \end{aligned} \]The difference is non-negative if and only if \(1-\dfrac{q}{1-q}\cdot\dfrac{n-k}{k+1}\), i.e. \(k\le qn-1+q\), so \(k=qn\) gives the largest term in the summation.
So \(\dbinom{n}{nq}\ge\dfrac{q^{-qn}(1-q)^{-(1-q)n}}{n+1}=\dfrac{2^{nH(q)}}{n+1}\).
Actually the number of sequences with \(nq\) heads is very close to its upper bound, i.e., \(2^{nH(q)-\epsilon}\).
Entropy measures how many unbiased, independent bits can be extracted from a random variable.
Let \(X\) be a random variable supported on \(\mathcal X\). An extraction function \(\text{Ext}(\mathcal X\to\{0,1\}^*)\) outputs a random sequence of bits such that \(\forall\text{Pr}(|\text{Ext}(X)|=k)>0,\text{Pr}(\text{Ext}(X)=y||y|=k)=\dfrac{1}{2^k}\).
Theorem: Consider a coin that comes up heads with probability \(p>\dfrac{1}{2}\), then for any \(\delta>0\) and sufficiently large \(n\) we have:
- The average number of bits output by any deterministic extraction function on an input sequence of \(n\) independent flips is at most \(nH(p)\).
- There exists an extraction function \(\text{Ext}(\cdot)\) that outputs, on an sequence of \(n\) independent flips, an average of at least \((1-\delta)nH(p)\) independent random bits.
Proof:
Upper bound: First, if an input sequence \(x\) occurs with probability \(q\), then \(|\text{Ext}(x)|\le\log_2(\dfrac{1}{q})\). So if \(B\) is a random variable representing the number of bits our extraction function produces on input \(X\), then
\[\begin{aligned} \mathbb{E}[B]&=\sum\limits_{x}\text{Pr}(X=x)|\text{Ext}(x)|\\ &\le\sum\limits_{x}\text{Pr}(X=x)\log_2(\dfrac{1}{\text{Pr}(X=x)})\\ &=H(X)\\ &=nH(p) \end{aligned} \]Lower bound:
Lemma 2: Suppose that the value of \(X\) is chosen uniformly at random from the integers \(\{0,1,2,\cdots,m-1\}\), so that \(H(X)=\log_2m\). Then there is an extraction function for \(X\) that outputs on average at least \(\lfloor\log_2m\rfloor-1\) independent and unbiased bits.
So let \(Z\) be the number of heads and \(B\) be the number of bits our extraction function produces, then \(\mathbb{E}(B)=\sum\limits_{k=0}^n\text{Pr}(Z=k)\mathbb{E}(B|Z=k)\). If given \(Z=k\), then the sequence can be seen as uniformly random chosen from \(\dbinom{n}{k}\) possible outcomes, so by the lemma 2, \(\mathbb{E}(B|Z=k)\ge\lfloor\log_2\dbinom{n}{k}\rfloor-1\). By lemma \(1\), for some small \(\epsilon\) to be determined and \(n(p-\epsilon)\le k\le n(p+\epsilon)\), we have \(\dbinom{n}{k}\ge\dbinom{n}{\lfloor n(p+\epsilon)\rfloor}\ge\dfrac{2^{nH(p+\epsilon)}}{n+1}\), so
\[\begin{aligned} \mathbb{E}(B)\ge&\sum\limits_{k=\lfloor n(p-\epsilon)\rfloor}^{\lfloor n(p+\epsilon)\rfloor}\text{Pr}(Z=k)\mathbb{E}(B|Z=k)\\ \ge&\sum\limits_{k=\lfloor n(p-\epsilon)\rfloor}^{\lfloor n(p+\epsilon)\rfloor}\text{Pr}(Z=k)(nH(p+\epsilon)-\log_2(n+1)-1)\\ =&(nH(p+\epsilon)-\log_2(n+1)-1)\text{Pr}(|Z-np|\le\epsilon n) \end{aligned} \]Since \(\mathbb{E}(Z)=np\), thus by Chernoff bound, \(\text{Pr}(|Z-np|\le\epsilon n)\ge 1-2e^{-n\epsilon^2/3p}\). So for any \(\delta>0\), we can have \(\mathbb{E}(B)\ge(1-\delta)nH(p)\) by choosing \(\epsilon\) sufficiently small and \(n\) sufficiently large.
Compression: reduces the number of bits needed to represent data by exploiting its likelihood structure.
An compression function \(\text{Com}:\{0,1\}^*\to\{0,1\}^*\) takes a sequence of binary bits as input and outputs a sequence of binary bits such that \(\forall x\ne x'\), \(\text{Com}(x)\ne\text{Com}(x')\).
Theorem: Consider a coin that comes up heads with probability \(p>\dfrac{1}{2}\). For any constant \(\delta>0\), when \(n\) is sufficiently large:
- There exists a compression function \(\text{Com}(\cdot)\) such that \(\mathbb{E}(|\text{Com}(x)|)\le(1+\delta)nH(p)\).
- For any compression function, \(\mathbb{E}(|\text{Com}(x)|)\ge(1-\delta)nH(p)\).
Shannon Theorem: Through a noisy channel (each bit will be flipped with probability \(p\)), the sender can reliably send messages about \(k=n(1-H(p))\) bits within each block of \(n\) bits.
Background:
- The sender takes a \(k\)-bit message and encodes it into a block of \(n\ge k\) bits via the encoding function.
- These bits are sent over the noisy channel.
- The receiver attempts to determine the original \(k\)-bit message using the decoding function.
Formal description of Shannon Theorem: For a binary symmetric channel with parameter \(p<\dfrac{1}{2}\) and any \(\delta,\epsilon>0\), when \(n\) is sufficiently large:
- For any \(k\le n(1-H(p)-\delta)\), there exists a \((k,n)\) encoding and decoding functions such that the success probability \(\ge 1-\epsilon\).
- No \((k,n)\) encoding and decoding functions exist for \(k\ge n(1-H(p)+\delta)\) such that the success probability \(\ge 1-\epsilon\).
Proof: Let \(C=\{c_1,c_2,\cdots,c_M\}\) be some arbitrary collection of distinct codewords with length \(n\). If we send \(c_i\) through the channel, we will receive some \(\tilde{c_i}\). Since \(\mathbb{E}(d_H(c_i,\tilde{c_i}))=np\), by Chernoff bound, there is some \(\gamma\) such that with probability \(1-\dfrac{\epsilon}{2}\), \((p-\gamma)n\le d_H(c_i,\tilde{c_i})\le (p+\gamma)n\). Choose \(\delta\) as small as possible and define \(\text{Ring}(c_i)=\{c:|d_H(c_i,c)-np|<\delta n\}\).
If \(\tilde{c_i}\in\text{Ring}(c_i)\), but \(\forall j\ne i\), \(\tilde{c_j}\notin\text{Ring}(c_i)\), then \(c_i\) can be successfully decoded. We denote this as \(\text{Success}(c_i)\). If \(\forall i\), \(\text{Success}(c_i)\) holds, then the decoding algorithm is successful.
How to choose a codebook such that the success probability is high?
First we should know how to calculate the volume of \(\text{Ring}(c_i)\). In fact, as \(n\to\infty\), the volume of \(\text{Ring}(c_i)\) is at most \(2^{(H(p)+\delta')n}\), with \(\delta'\to 0\).
Proof for this: By Chernoff bound, \(\text{Pr}(|d_H(c_i,\tilde{c_i})-\mu|>\delta\mu)<2\exp(-\dfrac{1}{3}\delta^2\mu)\), where \(\mu=np\). In order to let this probability less than \(\dfrac{\epsilon}{2}\), \(\dfrac{\epsilon}{2}\le 2\exp(-\dfrac{1}{3}\delta^2\mu)\) is a sufficient condition, so \(\delta\ge\sqrt{\dfrac{-3\ln(\frac{\epsilon}{4})}{\mu}}=O(\dfrac{1}{\sqrt{n}})\), so \(\gamma=p\delta=O(\dfrac{1}{\sqrt{n}})\) when \(n\) gets large.
Let \(L=\lceil np-n\gamma\rceil,R=\lfloor np+n\gamma\rfloor\). Then \(\text{Vol}(\text{Ring}(c_i))=\sum\limits_{i=L}^R\dbinom{n}{i}\). Since \(p<\dfrac{1}{2}\), \(\gamma=O(\dfrac{1}{\sqrt{n}})\), so when \(n\) gets large enough, \(R<\dfrac{n}{2}\), \(\text{Vol}(\text{Ring}(c_i))\le (R-L+1)\dbinom{n}{R}\le n\dbinom{n}{R}\). On the other hand, \(\dbinom{n}{R}=\dbinom{n}{\lfloor n(p+\gamma)\rfloor}\le 2^{nH(p+\gamma)}\). And \(H(p+\gamma)=-(p+\gamma)\ln(p+\gamma)-(1-p-\gamma)\ln(1-p-\gamma)\), when \(n\to\infty\), \(\gamma\to 0\), so as \(n\) grows large, \(H(p+\gamma)=-(p+\gamma)(\ln(p)+o(1))-(1-p-\gamma)(\ln(1-p)-o(1))=-p\ln(p)-(1-p)\ln(1-p)+O(\dfrac{1}{\sqrt{n}})=H(p)+O(\dfrac{1}{\sqrt{n}})\). \(\text{V}(\text{Ring}(c_i))\le n2^{n(H(p)+O(\frac{1}{\sqrt{n}}))}=2^{n(H(p)+O(\frac{1}{\sqrt{n}})+\frac{\ln n}{n})}=2^{n(H(p)+\delta')}\), where \(\delta'\to 0\) as \(n\to\infty\).
According to this, let's now choose the codebook \(C\) we desired. Let \(M=2\cdot 2^k\), let \(C\) be a codebook with \(M\) codewords uniformly chosen from \(\{0,1\}^n\). Fix some \(i\), consider the probability such that \(\text{Success}(c_i)\) does not happen. With probability \(1-\dfrac{\epsilon}{2}\), \(\tilde{c_i}\in\text{Ring}(c_i)\). And for some \(j\ne i\), \(\text{P}(\tilde{c_j}\in\text{Ring}(c_i))=\dfrac{\text{Vol}(\text{Ring}(c_i))}{2^n}=2^{(H(p)+\delta'-1)n}\). By union bound, \(\text{Pr}(\text{Success}(c_i)\text{ does not happen})\le\dfrac{\epsilon}{2}+M2^{(H(p)+\delta'-1)n}=\dfrac{\epsilon}{2}+2^{1+(k/n+H(p)-1+\delta')n}\). Since \(\dfrac{k}{n}<1-H(p)\), so the second term goes to \(0\) as \(n\to\infty\), so \(\text{Pr}(\text{Success}(c_i)\text{ does not happen})<\epsilon\) if \(C\) and \(i\) are both chosen uniformly random from all possible choices.
So \(\dfrac{1}{\text{#possible codebooks}}\sum\limits_{C}\dfrac{1}{M}\sum\limits_{i=1}^M\text{Pr}(\text{Success}(c_i)\text{ does not happen})\le\epsilon\). So there must be a choice of codebook \(C^*\) which does better than average. Let \(C'\) be the codebook containing the best half of \(C^*\), then the failure probability for all codewords in \(C'\) is \(\le 2\epsilon\). So \(C'\) provides a good choice with \(2^k\) codewords.
Lecture 5
Maximum clique problem: Let \(G\) be a random graph, let \(w(G)\) be the size of the largest clique in \(G\), we prove that when \(n\) is large enough, \(w(G)\) is very close to \(2\log_2(n)\).
Proof:
Lower bound: Let \(m=(2-\epsilon)\log_2(n)\), and \(T\) be the event that \(w(G)\ge m\), \(M\) be the family of vertex subsets of size \(m\), \(A_V=[V\text{ is a clique}]\). So we only need to prove that \(\text{Pr}(T)=1-o(1)\). Now consider \(X=\sum\limits_{V\in M}A_V\), thus \(\text{Pr}(T)=\text{Pr}(X>0)\). To prove this, we prove the following two things:
- \(\mathbb{E}(X)\to\infty\) as \(n\to\infty\)
- \(\text{Var}(X)=(\mathbb{E}(X)^2)\cdot o(1)\) as \(n\to\infty\).
If these two statements are proved, then by Chebyshev inequality, \(\text{Pr}(X\le 0)\le\{|X-\mathbb{E}(X)|>\dfrac{1}{2}\mathbb{E}(X)\}\le\dfrac{\text{Var}(X)}{(\frac{1}{2}\mathbb{E}(X)^2)}=o(1)\).
For (1). Apply Stirling's approximation: \(n!\sim\sqrt{2\pi n}(\dfrac{n}{e})^n\). So
\[\begin{aligned} &\mathbb{E}(X)\\ =&\dbinom{n}{m}\cdot\dfrac{1}{2^{\binom{m}{2}}}\\ =&\dfrac{n!}{m!(n-m)!}\cdot\dfrac{1}{2^{\binom{m}{2}}}\\ \approx&\dfrac{\sqrt{2\pi n}(\frac{n}{e})^n}{\sqrt{2\pi m}(\frac{m}{e})^m\cdot\sqrt{2\pi(n-m)}(\frac{n-m}{e})^{n-m}}\cdot\dfrac{1}{2^{\binom{m}{2}}}\\ \ge&\Omega(\dfrac{n^m}{\sqrt{2\pi m}(\frac{m}{e})^m}\cdot\dfrac{1}{2^{\binom{m}{2}}})\\ =&\Omega(\dfrac{en}{(2\pi m)^{\frac{1}{2m}}\cdot m}\cdot \dfrac{1}{m^{\frac{m}{2}}})^m\\ \end{aligned} \]Since \(m=(2-\epsilon)\log_2(n)\), and \(\lim\limits_{m\to\infty}(2\pi m)^{\frac{1}{2m}}=1\), so
\[\begin{aligned} &\Omega((\dfrac{en}{(2\pi m)^{\frac{1}{2m}}\cdot m}\cdot \dfrac{1}{m^{\frac{m}{2}}})^m)\\ =&\Omega((\dfrac{0.01n}{\log_2n}\cdot \dfrac{1}{n^{1-\frac{1}{2}\epsilon}})^m)\\ =&\Omega((\dfrac{0.01n^{\frac{1}{2}\epsilon}}{\log_2n})^{\log_2n})\\ =&n^{\Omega(\log_2n)} \end{aligned} \]Since \(n^{\Omega(\log_2n)}\to\infty\) as \(n\to\infty\), \(\mathbb{E}(X)\to\infty\) as \(n\to\infty\).
For (2).
\[\begin{aligned} \text{Var}(X)&=(\mathbb{E}(\sum\limits_{V\in M}A_V)^2)-(\mathbb{E}(X))^2\\ &\le(\mathbb{E}(\sum\limits_{V}\sum\limits_{V'}A_VA_{V'}))-\sum\limits_{V}\sum\limits_{|V\cap V'|\le 1}\mathbb{E}(A_V)\mathbb{E}(A_V)\\ &=\sum\limits_{V}\mathbb{E}(A_V)+\sum\limits_{V}\sum\limits_{|V\cap V'|\le 1}\mathbb{E}(A_VA_{V'})+\sum\limits_{V}\sum\limits_{|V\cap V'|>1}\mathbb{E}(A_VA_{V'})-\sum\limits_{V}\sum\limits_{|V\cap V'|\le 1}\mathbb{E}(A_V)\mathbb{E}(A_V)\\ \end{aligned} \]Since when \(|V\cap V'|\le 1\), \(\mathbb{E}(A_VA_{V'})=\mathbb{E}(A_V)\mathbb{E}(A_{V'})\), thus
\[\begin{aligned} \text{Var}(X)&\le\mathbb{E}(X)+\sum\limits_{2\le k\le m}\sum\limits_{V}\sum\limits_{|V\cap V'|=k}\mathbb{E}(A_VA_{V'})\\ &=\mathbb{E}(X)+\sum\limits_{2\le k\le m}\sum\limits_{V}\sum\limits_{|V\cap V'|=k}\text{Pr}(A_V=1)\text{Pr}(A_{V'}=1|A_V=1)\\ &=\mathbb{E}(X)+\sum\limits_{2\le k\le m}\sum\limits_{V}\text{Pr}(A_V=1)\dbinom{m}{k}\dbinom{n-k}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}\\ &=\mathbb{E}(X)+\mathbb{E}(X)\cdot\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}} \end{aligned} \]Now let's focus on \(\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}\). We prove that \(\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}\le\dfrac{m^5}{n-m+1}\mathbb{E}(X)\) as follows:
Since \(\mathbb{E}(X)=\dfrac{\dbinom{n}{m}}{2^{\binom{m}{2}}}\), thus first we have
\[ \begin{aligned} \sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}&\le\dfrac{m^5}{n-m+1}\mathbb{E}(X)\\ \Leftrightarrow\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}&\le\dfrac{m^5}{n-m+1}\dfrac{\dbinom{n}{m}}{2^{\binom{m}{2}}}\\ \Leftrightarrow\dfrac{1}{\dbinom{n}{m}}\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}2^{\binom{k}{2}}&\le\dfrac{m^5}{n-m+1} \end{aligned} \]Now let's focus on the term on the left side of the \(\le\) term. Let \(a_k=\dbinom{m}{k}\dbinom{n-m}{m-k}2^{\binom{k}{2}}\), then
\[\begin{aligned} &\dfrac{a_{k+1}}{a_k}\\ =&\dfrac{\binom{m}{k+1}\binom{n-m}{m-k-1}2^{\binom{k+1}{2}}}{\binom{m}{k}\binom{n-m}{m-k}2^{\binom{k}{2}}}\\ =&\dfrac{(m-k)}{(k+1)(n-2m+k+1)} \end{aligned} \]Roughly speaking, when \(n\) gets large enough, then when \(k>\log_2n\), this ratio is \(>1\), when \(k<\log_2(n)\), this ratio is \(<1\), so \(a\) is generally a convex array, so \(\max\limits_{2\le k\le m}a_k\) is reached at either \(k=2\) or \(k=m\). We only need to prove that \(\dfrac{1}{\dbinom{n}{m}}\cdot ma_2\le\dfrac{m^5}{n-m+1}\) and \(\dfrac{1}{\dbinom{n}{m}}\cdot ma_m\le\dfrac{m^5}{n-m+1}\).
For \(a_2\), we have
\[\begin{aligned}&\dfrac{1}{\dbinom{n}{m}}\cdot ma_2\\=&\dfrac{1}{\dbinom{n}{m}}\cdot m\dbinom{m}{2}\dbinom{n-m}{m-2}\cdot 2\\\le&m^3\cdot\dfrac{\dbinom{n-m}{m-2}}{\dbinom{n}{m}}\\=&m^3\dfrac{\frac{(n-m)(n-m-1)\cdots(n-2m+3)}{(m-2)(m-3)\cdots 1}}{\frac{n(n-1)\cdots(n-m+1)}{m(m-1)\cdots 1}}\\=&m^3\cdot m(m-1)\cdot\dfrac{(n-m)(n-m-1)\cdots(n-2m+3)}{n(n-1)\cdots(n-m+1)}\\\le&\dfrac{m^5}{n-m+1}\end{aligned} \]For \(a_m\), we have \(\dfrac{1}{\dbinom{n}{m}}\cdot ma_m=\dfrac{m2^{\binom{m}{2}}}{\dbinom{n}{m}}=\dfrac{m}{\mathbb{E}(X)}\). Since \(\mathbb{E}(X)=n^{\Omega(\log_n)}\), apparently when \(n\) is large enough, \(\dfrac{m}{\mathbb{E}(X)}\le\dfrac{m^5}{n-m+1}\). So \(\dfrac{1}{\dbinom{n}{m}}\cdot ma_m\le\dfrac{m^5}{n-m+1}\).
So \(\sum\limits_{2\le k\le m}\dbinom{m}{k}\dbinom{n-m}{m-k}\dfrac{1}{2^{\binom{m}{2}-\binom{k}{2}}}\le\dfrac{m^5}{n-m+1}\mathbb{E}(X)\).
So since \(\mathbb{E}(X)\to\infty\) as \(n\to\infty\), replace \(m\) with \((2-\epsilon)\log_2n\) we have \(\text{Var}(X)\le\mathbb{E}(X)+\dfrac{64(\log_2n)^5}{n}\mathbb{E}(X)^2\le\dfrac{128(\log_2n)^5}{n}\), which completes the proof.
Network routing problem: Every node \(i\) on a hypercube sends a message \(M_i\) to some \(\sigma(i)\), however, only one message can pass through every directed edge at a time, how to get all messages successfully delivered within a reasonably short time?
Bit fixing algorithm: fix the bits from left to right, i.e., for \(i=1,2,\cdots,n\) if the \(i\)-th bit of \(n\) does not equal to the \(i\)-th bit of \(\sigma(n)\), then flip the \(i\)-th bit and travel along the corresponding edge.
This routing algorithm has exponential delay in the worst case. Let \(n\) be odd and define \(\sigma(u0v)=v1u\), then the path from \(u0^{\frac{n+1}{2}}\) to \(0^{\frac{n-1}{2}}1u\) must contain the edge \((0^n,0^{\frac{n-1}{2}}10^{\frac{n-1}{2}})\), hence at least \(2^{\frac{n-1}{2}}\) messages need to travel through this edge.
Randomized BFA: generate a random \(v_i\), use BFA to route message \(M_i\) from \(i\) to \(v_i\), at time \(6n\) use BFA to route message \(M_i\) from \(v_i\) to \(\sigma(i)\).
Analysis: We prove that the success probability is larger than \(1-O(2^{-3n})\). It suffices to prove that in phase \(1\) and \(2\), the probability for any \(M_i\) not to reach the destination in time \(6n\) is \(O(2^{-3n})\). Let \(T_i\) be the arrival time for message \(M_i\) to reach \(v_i\) in phase \(1\), then fix some \(i\), if we can prove that \(\text{Pr}(T_i>6n)\) is \(O(2^{-4n})\), then by union bound, \(\text{Pr}(\exists i,T_i>6n)\le |V|\cdot O(2^{-4n})=O(2^{-3n})\).
Now our goal is to prove that \(\text{Pr}(T_i>6n)=O(2^{-4n})\). Let \(S=\{j|j\ne i,\text{Path}(j,v_j)\cap\text{Path}(i,v_i)\ne\varnothing\}\), notice that \(T_i\le d_H(i,v_i)+|S|\), then we only need to focus on \(|S|\). Let \(Y_e\) be the number of paths that passes through \(e\), assume that \(\text{Path}(i,v_i)=(e_1,e_2,\cdots,e_l)\), then \(|S|\le\sum\limits_{j=1}^lY_{e_j}\). Since \(\mathbb{E}(Y_e)=1\), \(\mathbb{E}(|S|)\le\dfrac{n}{2}\), so by Chernoff bound \(\text{Pr}(|S|>5n)<2^{-4n}\), which completes the proof.
Monte Carlo methods: generate random samples \(X_1,X_2,\cdots,X_n\) from a distribution, and use the sample mean \(\dfrac{1}{n}\sum\limits_{i=1}^nf(X_i)\) to estimate \(\mathbb{E}(f(X))\). (main challenge: how to sample)
Importance sampling: estimate \(\mathbb{E}[f(x)]\), \(x\sim p\) where \(p\) is very complicated. Use some proposal distribution \(q\), then estimate \(\hat{I_n}=\dfrac{1}{n}\sum\limits_{i=1}^nf(y_i)\dfrac{p(y_i)}{q(y_i)}\), where \(y_i\sim q\).
Rejection sampling: suppose \(f(x)\le Cg(x)\), we can sample \(f\) from \(g\) as follows: generate \(X\sim g\) and accept \(X\) with probability \(\dfrac{f(X)}{Cg(X)}\), if accepted, output \(X\), otherwise repeat this process.
Lecture 6
Generating functions: let \(\lang a_k\rang\) be an infinite sequence of complex numbers, define its generating function as: \(A(x)=\sum\limits_{k=0}^na_kx^k\).
Let \(X\) be a random variable with range \(\{0,1,2,\cdots\}\) and let \(p_k=\text{Pr}(X=k)\). Let \(A(x)=\sum\limits_{k\ge 0}p_kx^k\). Then \(\mathbb{E}(X)=A'(1),\text{Var}(X)=A''(1)+A'(1)-(A'(1))^2\).
Use generating function to calculate the closed form of a sequence:
-
Fibonacci numbers: \(a_0=a_1=1,a_n=a_{n-1}+a_{n-2}\), let \(A(x)=\sum\limits_{i=0}^{+\infty}a_ix^i\), so \(A(x)=a_0+a_1x+\sum\limits_{i=2}^{+\infty}(a_{i-1}+a_{i-2})x^i=1+x+x(A(x)-1)+x^2A(x)\). So \(A(x)=\dfrac{1}{1-x-x^2}\). Since \(A(x)=\dfrac{1}{(1+\frac{1+\sqrt{5}}{2}x)(1-\frac{1-\sqrt{5}}{2}x)}\), let \(\alpha=\dfrac{1+\sqrt{5}}{2},\beta=\dfrac{1-\sqrt{5}}{2}\), then \(A(x)=\dfrac{\alpha}{\alpha-\beta}\cdot\dfrac{1}{1-\alpha x}-\dfrac{\beta}{\alpha-\beta}\cdot\dfrac{1}{1-\beta x}\). So \(a_n=\dfrac{\alpha}{\alpha-\beta}(\alpha^{n+1}-\beta^{n+1})\).
-
Number of triangulations for a convex \(n\)-gon: let \(a_n\) be the number of triangulations of a convex \((n+2)\)-gon, then \(a_n=\sum\limits_{0\le k\le n-1}a_ka_{n-k-1}\). So \(A(x)=1+xA(x)^2\), \(A(x)=\dfrac{1\pm\sqrt{1-4x}}{2x}\). To ensure \(A(0)\) is finite, \(A(x)=\dfrac{1-\sqrt{1-4x}}{2x}\). Since \((1+x)^z=\sum\limits_{k\ge 0}\dbinom{z}{k}x^k\), so \(\sqrt{1-4x}=\sum\limits_{k\ge 0}\dbinom{\frac{1}{2}}{k}(-4)^k\). So \(a_n=\dfrac{1}{2}(-1)^n\dbinom{\frac{1}{2}}{n+1}4^{n+1}=\dfrac{1}{2n+1}\dbinom{2n+1}{n}\).
-
Up-down permutations:
then \(a_1=1\), \(a_n=\sum\limits_{k=\text{odd}}\dbinom{n-1}{k}a_ka_{n-1-k}\) for odd \(n\ge 3\). \(n\cdot\dfrac{a_n}{n!}=\sum\limits_{k=\text{odd}}\dfrac{a_k}{k!}\cdot\dfrac{a_{n-1-k}}{(n-1-k)!}\).
Let \(b_n=\dfrac{a_n}{n!}\), then \(b_1=1\), \(nb_n=\sum\limits_{k=\text{odd}}b_kb_{n-1-k}\). So let \(B(x)=\sum\limits_{n=\text{odd}}b_nx^n\), then \(B'(x)=1+B(x)^2\), \(B(x)=\tan x\).
How to solve the power series of \(\tan x\)? Consider complex analysis:
Three representations of complex numbers:
- \(z=a+ib\)
- \(z=r(\cos\theta+i\sin\theta)\)
- \(z=re^{i\theta}\)
Let \(f:\mathbb{C}\to\mathbb{C}\) be a complex function, let \(\Gamma\) be a path from \(a\) to \(b\), let \(\int_{\Gamma}f(z)\mathrm dz=\lim\limits_{m\to\infty}\sum\limits_{0\le k\le m-1}f(z_k)(z_{k+1}-z_k)\), where \(z_0,z_1,\cdots,z_m\) divide \(\Gamma\) evenly.
Complex integral on a closed curve:
-
Cauchy's integral theorem: if \(f(z)\) is analytic (complex differentiable) on and inside a simple closed contour \(\gamma\), then \(\oint_{\gamma}f(z)\mathrm dz=0\).
-
Let \(f(z)\) be analytic inside and on a simple closed contour \(\gamma\), if \(z_0\) is any point inside \(\gamma\), then \(\oint_{\gamma}\dfrac{f(z)}{z-z_0}\mathrm dz=f(z_0)2\pi i\).
Proof: Consider a small circle \(C_{\epsilon}\) around \(z_0\) with radius \(\epsilon\), then \(\oint_{\gamma}\dfrac{f(z)}{z-z_0}\mathrm dz=\oint_{C_{\epsilon}}\dfrac{f(z)}{z-z_0}\mathrm dz\). Parameterize \(C_{\epsilon}\) as \(z=z_0+\epsilon e^{i\theta}\), then \(\mathrm dz=i\epsilon e^{i\theta}\mathrm d\theta\), so as \(\epsilon\to 0,f(z)\approx f(z_0)\) by the continuity of \(f\), so \(\oint_{C_{\epsilon}}\dfrac{f(z)}{z-z_0}\mathrm dz\approx f(z_0)\int_0^{2\pi}\dfrac{i\epsilon e^{i\theta}}{\epsilon e^{i\theta}}\mathrm d\theta=2\pi i f(z_0)\).
-
Cauchy's residue theorem: Complex function \(f\) has a pole (of order \(m\)) at \(z_0\) if \((z-z_0)^m\) if \((z-z_0)^mf(z)\) is holomorphic and non-zero at \(z_0\). If \(f(z)\) has a pole of order \(m\) at \(z_0\), it can be represented by Laurent series at a neighborhood of \(z_0\), \(f(z)=\sum\limits_{n=-m}^{\infty}a_n(z-z_0)^n\) with non-zero \(a_{-m}\) (generalizes Taylor series to functions with singularities). For \(f(z)=\sum\limits_{-\infty}^{\infty}a_n(z-z_0)^n\), let \(a_{-1}\) the residue of \(f(z)\) at \(z_0\).
(Cauchy's residue theorem) Let \(f\) be analytic inside and on a simple closed contour \(\gamma\) except for finitely many singularities \(z_1,z_2,\cdots,z_n\) insider \(\gamma\), then \(\oint_{\gamma}f(z)\mathrm dz=2\pi i\sum\limits_{k=1}^n\text{Res}(f,z_k)\).
Lecture 7
Methods for finding residues:
- Simple pole: \(\text{Res}(f,z_0)=\lim\limits_{z\to z_0}(z-z_0)f(z)\).
- Pole of order \(m\): \(\text{Res}(f,z_0)=\dfrac{1}{(m-1)!}\lim\limits_{z\to z_0}\dfrac{\mathrm d^{m-1}}{dz^{m-1}}(z-z_0)^mf(z)\).
How to calculate \(\tan x\) using the tools in complex analysis?
We first extend \(\tan x\) to be a function over the complex plane. \(\forall z\in\mathbb{C}\), let \(\tan z=\dfrac{e^{iz}-e^{-iz}}{i(e^{iz}+e^{-iz})}\).
Let \(\beta_n\) be the residue of \(\dfrac{\tan z}{z^{n+1}}\) at \(z=0\). We can prove that \(b_n=\beta_n\). This is because \(\dfrac{\tan z}{z^{n+1}}=\dfrac{1}{z^{n+1}}\sum\limits_{k\ge 0}b_kz^k=\dfrac{b_0}{z^{n+1}}+\dfrac{b_1}{z^n}+\cdots+\dfrac{b_n}{z}+b_{n+1}+b_{n+2}z+\cdots\).
For each odd \(n\), complex function \(\dfrac{\tan z}{z^{n+1}}\) has singularities at
- \(z_m=(m-\dfrac{1}{2})\pi\) for integer \(m\).
- \(z=0\).
Construct a closed curve \(\Gamma_m\) traversing the boundary of the \(2m\times 2m\) symmetric square. By Cauchy's Residue Theorem, let \(f=\dfrac{\tan z}{z^{n+1}}\),
\[\dfrac{1}{2\pi i}\oint_{\Gamma_m}\dfrac{\tan z}{z^{n+1}}=\text{Res}(f,0)+\sum\limits_{i=-2m+1}^{2m}\text{Res}(f,(i-\dfrac{1}{2})\pi) \]Since \(|\tan z|\le 10\) for \(\forall z\in\Gamma_m\), \(\text{LHS}\) has absolute value \(\le 8m\cdot\max|\dfrac{\tan z}{z^{n+1}}\le 80\dfrac{1}{m^{n}}\), as \(m\to\infty\), \(\text{LHS}\to 0\). So \(b_n=-\sum\limits_{m}\text{Res}(\dfrac{\tan z}{z^{n+1}},z_m)\).
A spanning tree of \(G\) is a subgraph \(G'=(V,E')\) such that \(G'\) is connected and \(|E'|=|V|-1\).
Carleys formula: \(\text{#sp}(K_n)=n^{n-2}\).
Matrix Tree Theorem: The Laplacian of \(G\) is an \(n\times n\) matrix \(L_G=(l_{i,j})\) where \(l_{i,i}=\text{degree}(v_i)\), \(l_{i,j}=-1\) if \((i,j)\in E\) and \(l_{i,j}=0\) otherwise. Then \(\text{#sp}(G)=\det(L_G^{(i)})\) for any \(1\le i\le |V|\), where \(A^{(i)}\) is the matrix \(A\) with the \(i\)-th row and \(i\)-th column deleted.
Lemma (Cauchy-Binet Formula): Let \(n\le m\) and \(A=(a_{i,j}),B=(b_{i,j})\) be \(n\times m\) real matrices. For any \(S\sube\{1,2,3,\cdots,m\}\) with \(|S|=n\), let \(A_S\) denote the \(n\times n\) submatrix of \(A\) with columns indexed by \(|S|\), similarly for \(B_S\). Then \(\det(A\cdot B^T)=\sum\limits_{|S|=n}\det(A_S)\det(B_S)\).
Proof of Matrix Tree Theorem with the lemma: Let \(A\) be the \(n\times m\) matrix where for each \(e_j=(v_{j_1},v_{j_2})\), let \(a_{j_1,j}=1,a_{j_2,j}=-1\) and other \(a_{i,j}=0\). Then \(AA^T=L_G\). Fix some \(1\le i\le n\), let \(A'\) be the \((n-1)\times m\) matrix obtained from \(A\) by deleting its \(i\)-th row, then \(A'(A')^T=L_G^{(i)}\). Since for \(S\sube\{1,2,3,\cdots,m\}\) with \(|S|=n-1\), \(|\det(A'_S)|=1\) if and only if \(\{e_k|k\in S\}\) forms a spanning tree of \(G\), \(\det(L_G^{(i)})=\text{#sp}(G)\), which proves the theorem.
Lecture 9
Inequalities recap:
- Union bound: \(\mathbb{P}(X_1\lor X_2\lor\cdots\lor X_n)\le\sum\limits_{i}\mathbb{P}(X_i)\).
- Markov's inequality: \(\mathbb{P}(X\ge a)\le\dfrac{\mathbb{E}(X)}{a}\), where \(X\) is non-negative.
- Chebyshev's inequality: let \(\mu=\mathbb{E}(X),\sigma^2=\text{Var}(X)\), then \(\mathbb{P}(|X-\mu|\ge k\sigma)\le\dfrac{1}{k^2}\).
Moment generating function: \(M_X(t)=\mathbb{E}(\exp(tX))\).
Another representation of Chernoff bound: \(\mathbb{P}(X\ge a)\le\dfrac{M_X(t)}{e^{ta}}\).
Proof:
\[\mathbb{P}(X\ge a)=\int_a^{\infty}f(x)\mathrm dx\le\int_a^{\infty}\dfrac{e^{tx}}{e^{ta}}f(x)\mathrm dx\le\dfrac{M_X(t)}{e^{ta}} \]
Hoeffding bound: let \(Z_1,Z_2,\cdots,Z_n\) be independent random variables such that \(Z_i\in[a_i,b_i]\), then
High-dimensional geometry.
For high-dimensional objects, most of their volume is near its surface. For an object \(A\) in \(\mathbb{R}^d\), shrink \(A\) by a small amount \(\epsilon\) to produce a new object \((1-\epsilon)A\), then we have \(\text{volume}((1-\epsilon)A)=(1-\epsilon)^d\text{volume}(A)\). Since \((1-\epsilon)^d\le e^{-\epsilon d}\), fix \(\epsilon\) and as \(d\to\infty\), \((1-\epsilon)^d\) will rapidly approach \(0\).
For unit ball in \(d\)-dimensional space:
- Surface area \(A(d)=\dfrac{2\pi^{\frac{d}{2}}}{\Gamma(\frac{d}{2})}\).
- Volume \(V(d)=\dfrac{2\pi^{\frac{d}{2}}}{d\Gamma(\frac{d}{2})}\).
Where \(\Gamma(n+1)=n\Gamma(n),\Gamma(1)=1,\Gamma(\dfrac{1}{2})=\sqrt{\pi}\).
Properties of high-dimensional ball:
Volume lies near equator:
For \(c\ge 1\) and \(d\ge 3\), at least a \(1-\dfrac{2}{c}e^{-\frac{c^2}{2}}\) fraction of the volume of the \(d\)-dimensional unit ball has \(|x_1|\le\dfrac{c}{\sqrt{d-1}}\).
Proof: By symmetry, we just need to prove that at most a \(\dfrac{2}{c}e^{-\frac{c^2}{2}}\) fraction of the half of the ball with \(x_1\ge 0\) has \(x_1\ge\dfrac{c}{\sqrt{d-1}}\), let \(A\) denote the portion of the ball with \(x_1\ge\dfrac{c}{\sqrt{d-1}}\) and \(H\) denote the upper hemisphere, then
\[\text{volume}(A)=\int_{\frac{c}{\sqrt{d-1}}}^1(1-x_1^2)^{\frac{d-1}{2}}V(d-1)\mathrm dx_1 \]To get the upper bound of \(\text{volume}(A)\), use \(1-x\le e^{-x}\) and integrate to infinite, thus
\[\begin{aligned} \text{volume}(A)&\le\int_{\frac{c}{\sqrt{d-1}}}^{\infty}\dfrac{x_1\sqrt{d-1}}{c}e^{-\frac{d-1}{2}x_1^2}V(d-1)\mathrm dx_1\\ &=V(d-1)\dfrac{\sqrt{d-1}}{c}\int_{\frac{c}{\sqrt{d-1}}}^{\infty}x_1e^{-\frac{d-1}{2}x_1^2}\mathrm dx_1\\ &=V(d-1)\dfrac{\sqrt{d-1}}{c}(-\dfrac{1}{d-1}e^{-\frac{d-1}{2}x_1^2})|_{\frac{c}{\sqrt{d-1}}}^{\infty}\\ &=\dfrac{V(d-1)}{c\sqrt{d-1}}e^{-\frac{c^2}{2}} \end{aligned} \]Since \(\text{volume}(H)\ge\dfrac{V(d-1)}{2\sqrt{d-1}}\), we have \(\dfrac{\text{volume}(A)}{\text{volume}(H)}\le\dfrac{2}{c}e^{-\frac{c^2}{2}}\).
Volume lies on the shell, random vectors are orthogonal:
Consider drawing \(n\) points \(x_1,x_2,\cdots,x_n\) at random from the unit ball, with probability \(1-O(n^{-1})\):
- \(|x_i|\ge 1-\dfrac{2\ln n}{d}\) for all \(i\).
- \(|x_i\cdot x_j|\le\dfrac{\sqrt{6\ln n}}{\sqrt{d-1}}\) for all \(i\ne j\).
Proof:
For the first part, since \(\text{Pr}(|x_i|<1-\dfrac{2\ln n}{d})\le e^{-\frac{2\ln n}{d}\cdot d}=\dfrac{1}{n^2}\), by union bound, \(\text{Pr}(|x_i|\ge 1-\dfrac{2\ln n}{d},\forall i)\le\dfrac{1}{n}\).
For the second part, fix some \(i,j\), \(\text{Pr}(|x_i\cdot x_j|\le\dfrac{\sqrt{6\ln n}}{\sqrt{d-1}})\le O(e^{-\frac{6\ln n}{2}})=O(n^{-3})\), so by union bound, \(\text{Pr}(|x_i\cdot x_j|\le\dfrac{\sqrt{6\ln n}}{\sqrt{d-1}},\forall i\ne j)\le O(\dbinom{n}{2}n^{-3})=O(\dfrac{1}{n})\).
Gaussian Annulus Theorem: For a \(d\)-dimensional spherical Gaussian with unit variance in each direction, for any \(\beta\le\sqrt{d}\), all but at most \(3e^{-c\beta^2}\) of the probability mass lies within the annulus \(\sqrt{d}-\beta\le|x|\le\sqrt{d}+\beta\), where \(c\) is a fixed positive constant.
Lemma: Let \(X_1,X_2,\cdots,X_n\) be independent \(\sigma\)-subgaussian random variables (\(\mathbb{E}(e^{\lambda(X_i-\mu_i)})\le e^{\lambda^2\sigma^2/2}\)). Let \(S=\sum X_i\), then for any \(t>0\): \(\mathbb{P}(|S-\mathbb{E}(S)|\ge t)\le 2\exp(-\dfrac{t^2}{2n\sigma^2})\).
Proof: \(\mathbb{E}(e^{\lambda(S-n\mu)})=\prod\limits_{i}\mathbb{E}(e^{\lambda(X_i-\mu)})\le e^{n\sigma^2\lambda^2/2}\). So by Chernoff bound, \(\mathbb{P}(S-n\mu\ge t)\le e^{-\lambda t}\mathbb{E}(e^{\lambda(S-n\mu)})\le e^{-\lambda t+n\sigma^2\lambda^2/2}\). Plug \(\lambda=\dfrac{t}{n\sigma^2}\) into the bound we get \(\mathbb{P}(S-n\mu\ge t)\le\exp(-\dfrac{t^2}{2n\sigma^2})\).
For sub-exponential random variables, we have:
Let \(X_1,X_2,\cdots,X_n\) be i.i.d sub-exponential random variables with parameter \(\nu,b\) (\(\mathbb{E}(e^{\lambda(X_i-\mu_i)})\le e^{\lambda^2\nu^2/2}\) for all \(|\lambda|<\dfrac{1}{b})\). Let \(S=\sum X_i\), then for any \(t>0\): \(\mathbb{P}(|S-\mathbb{E}(S)|\ge t)\le 2\exp(-\min(\dfrac{t^2}{2n\sigma^2},\dfrac{t}{2b}))\).
Since \(X_i^2\) is sub-exponential with parameters \((2,4)\), apply sub-exponential tail we have \(\mathbb{P}(|S-\sqrt{d}|\ge\beta)\le 2\exp(-\min(\dfrac{(2\beta\sqrt{d})^2}{8d},\dfrac{2\beta\sqrt{d}}{8}))\). Since \(\beta\le\sqrt{d}\), the \(\dfrac{(2\beta\sqrt{d})^2}{8d}\le\dfrac{2\beta\sqrt{d}}{8}\), so \(\mathbb{P}(|S-\sqrt{d}|\ge\beta)\le 2e^{-\beta^2/2}\).
Lecture 10
Application of GAT: Random Projection.
Consider random projection \(f:\mathbb{R}^d\to\mathbb{R}^k\), given \(v\), let \(u_1,u_2,\cdots,u_k\) be \(k\) Gaussian vectors in \(\mathbb{R}^d\), then define \(f(v)=(u_1\cdot v,u_2\cdot v,\cdots,u_k\cdot v)\).
Then \(f\) preserves norms under scaling: Let \(v\) be a fixed vector, then there exists \(c>0\) such that \(\forall\epsilon\in(0,1)\), \(\text{Pr}(||f(v)|-\sqrt{k}|v||\ge\epsilon\sqrt{k}|v|)\le 3e^{-ck\epsilon^2}\).
Proof:
Assume that \(|v|=1\), then \(\text{Var}(u_i\cdot v)=\text{Var}(\sum\limits_{j=1}^du_{i,j}v_j)=\sum\limits_{j=1}^dv_j^2\text{Var}(u_{i,j})=1\). So by Gaussian Annulus Theorem, \(\text{Pr}(||f(v)|-\sqrt{k}|v||\ge\epsilon\sqrt{k}|v|)\le 3e^{-ck\epsilon^2}\).
\(f\) preserves pairwise distance as well. (JL Lemma) For any \(0<\epsilon<1\) and any integer \(n\), let \(k\ge\dfrac{3}{c\epsilon^2}\ln n\) with \(c\) as in Gaussian Annulus Theorem. For any set of \(n\) points in \(\mathbb{R}^d\), the random projection \(\mathbb{R}^d\to\mathbb{R}^k\) defined above has the property that for all pairs of points \(v_i\) and \(v_j\), with probability at least \(1-\dfrac{3}{2n}\), \((1-\epsilon)\sqrt{k}|v_i-v_j|\le|f(v_i)-f(v_j)|\le(1+\epsilon)\sqrt{k}|v_i-v_j|\).
The compressed dimension is \(d\)-independent!!!
Singular value decomposition
For a matrix \(n\times m\) matrix \(M\) with rank \(r\):
- There exists a \(n\times r\) matrix \(U\) and a \(r\times m\) matrix \(V\) such that \(M=UV\).
- There are \(n-r\) eigenvectors with eigenvalue \(0\).
Background: given \(n\) data points in a \(d\)s-dimensional space, find a line such that the sum of the square of the distance from all data points to the line is minimal.
Let \(v\) be the unit vector along the line, then the square of the length of the projection of point \(x\) onto the line is \(\lang x,v\rang^2\). By Pythagorean theorem, the square of the distance of point \(x\) to the line is \(|x|^2-\lang x,v\rang^2\). Since \(\sum|x|^2\) is a fixed value, let \(A\) be the \(n\times d\) matrix such that the \(i\)-th row vector equals to the \(i\)-th data point, the the optimal \(v\) is \(\arg\max\limits_{|v|=1}|Av|\).
Let \(v_1=\arg\max\limits_{|v|=1}|Av|\), \(v_2=\arg\max\limits_{|v|=1,v\perp v_1}|Av|,v_3=\arg\max\limits_{|v|=1,v\perp v_1,v\perp v_2}|Av|\) and so on, and let \(\sigma_i=|Av_i|\). We define the right singular values as \(v_1,v_2,\cdots,v_r\) and the singular values as \(\sigma_1\ge\sigma_2\ge\cdots\ge\sigma_r\), then define the left singular values \(u_i=\dfrac{1}{\sigma_i}Av_i\), then for different \(i,j\), \(u_i\) and \(u_j\) are orthogonal.
Proof: Assume that for some \(i<j\), \(u_i^Tu_j=\delta>0\), then let \(v'_i=\dfrac{v_i+\epsilon v_j}{|v_i+\epsilon v_j|}\), then \(|Av_i|>u_i^T(\dfrac{\sigma_i u_i+\epsilon\sigma_j u_j}{\sqrt{1+\epsilon^2}})>(\sigma_i+\epsilon\sigma_j\delta)(1-\dfrac{\epsilon^2}{2})=\sigma_i-\dfrac{\epsilon^2}{2}\sigma_i+\epsilon\sigma_j\delta-\dfrac{\epsilon^3}{2}\sigma_j\delta\), when \(\epsilon\) is small enough, the term is \(>\sigma_i\). Since \(v'_i\) is orthogonal to all \(v_k\) where \(k<i\), \(v'_i\) fits the subspace better than \(v_i\), a contradiction. So \(u_i^Tu_j=0\) for all \(i<j\).
Let \(U\) be the \(n\times r\) matrix such that the \(i\)-th column vector is \(u_i\), \(D\) be the \(r\times r\) diagonal matrix with \(D_{i,i}=\sigma_i\), \(V^T\) be the \(r\times d\) matrix such that the \(i\)-th row vector is \(v_i\), then \(A=UDV^T\). In other words, \(A=\sum\limits_{i=1}^r\sigma_iu_iv_i^T\).
Proof: \(\forall j\), \(\sum\limits_{i=1}^j\sigma_iu_iv_i^Tv_j=\sigma_ju_j=Av_j\), so \(A=\sum\limits_{i=1}^r\sigma_iu_iv_i^T=UDV^T\).
Low rank approximation.
What if we project the data points onto a \(k\)-dimensional space?
Let \(A\) be an \(n\times d\) matrix with singular vectors \(v_1,v_2,\cdots,v_r\), for \(1\le k\le r\), let \(V_k\) be the subspace spanned by \(v_1,v_2,\cdots,v_k\). For each \(k\), \(V_k\) is the best-fit \(k\)-dimensional subspace for \(A\).
Define the Frobenius norm of matrix \(A\) as \(|A|_F=\sqrt{\sum\limits_{i=1}^n\sum\limits_{j=1}^dA_{i,j}^2}\), then \(|A|_F^2=\sum\limits_{i=1}^r\sigma_i^2\).
Proof: \(\sum\limits_{i=1}^n|a_i|^2=\sum\limits_{i=1}^n\sum\limits_{j=1}^r(a_i\cdot v_j)^2=\sum\limits_{j=1}^r\sum\limits_{i=1}^n(a_i\cdot v_j)^2=\sum\limits_{j=1}^r|Av_j|^2=\sum\limits_{i=j}^r\sigma_j^2\).
Low rank approximation for Frobenius norm: given matrix \(A\), find matrix \(B\) such that \(|A-B|_F\) is minimal, and \(\text{rank}(B)\le k\).
Analysis: For any of rank at most \(k\), \(|A-A_k|_F\le|A-B|_F\). So the minimum value of \(|A-B|_F\) is \(\sum\limits_{i=k+1}^r\sigma_i^2\).
Proof: Since orthogonal matrix preserves Frobenius norm, so \(|A-B|_F=|U^T(A-B)V|_F=|D-U^TBV|\), let \(C=U^TBV\), since \(\text{rank}(B)=k\), \(\text{rank}(C)\le k\). Since \(D\) is diagonal, so \(C^*=(\sigma_1,\sigma_2,\cdots,\sigma_k,0,0,0)\), \(B^*=A_k\).
Define the L2-norm of matrix \(A\) as \(|A|_2=\max\limits_{|x|\le 1}|Ax|\).
Low rank approximation for L2-norm:
Lemma 3.8. \(|A-A_k|_2^2=\sigma^2_{k+1}\).
Proof: consider an arbitrary vector \(v=\sum\limits_{j=1}^rc_jv_j\), then
\[\begin{aligned} |(A-A_k)v|&=|\sum\limits_{i=k+1}^r\sigma_iu_iv_i^T\sum\limits_{j=1}^rc_jv_j|\\&=|\sum\limits_{i=k+1}^rc_i\sigma_iu_iv_i^Tv_i|\\ &=|\sum\limits_{i=k+1}^rc_i\sigma_iu_i|\\ &=\sqrt{\sum\limits_{i=k+1}^rc_i^2\sigma_i^2} \end{aligned} \]Since \(|v|\le 1\), \(\sum\limits_{i=k+1}^{r}c_i^2\le 1\), so the maximum value of \(\sqrt{\sum\limits_{i=k+1}^rc_i^2\sigma_i^2}\) is \(\sigma_{k+1}\).
Theorem 3.9. Let \(A\) be an \(n\times d\) matrix. For any matrix of rank at most \(k\), \(|A-A_k|_2\le|A-B|_2\).
Proof: Assume \(B\) is not \(A_k\), then find \(|z|=1\) in \(\text{Null}(B)\cap\text{Span}(v_1,v_2,\cdots,v_{k+1})\), then
\[\begin{aligned} |Az|^2&=|\sum\limits_{i=1}^n\sigma_iu_iv_i^Tz|^2\\&=\sum\limits_{i=1}^n\sigma_i^2(v_i^Tz)^2\\&=\sum\limits_{i=1}^{k+1}\sigma_i^2(v_i^Tz)^2\\&\ge\sigma_{k+1}^2\sum\limits_{i=1}^{k+1}(v_i^Tz)^2\\&=\sigma_{k+1}^2 \end{aligned} \]This is no better than \(A_k\).
Lecture 12
Power methods for computing SVD:
Raise to power \(k\): \(B^k=\sum\limits_{i}\sigma_i^{2k}v_iv_i^T\).
Then by exponential growth if \(x=\sum\limits_{i}c_iv_i\), then when \(k\) is large enough, \(B^kx\approx\sigma_1^{2k}c_1v_1\).
PageRank: \(u\): hub score, \(v\): authority score. \(A_{i,j}=1\) means there is a hyperlink from hub \(i\) to authority \(j\).
Then \(v_j\propto\sum\limits_{i=1}^du_iA_{i,j}\).
Lemma 10. \(Av_i=\sigma_iu_i,A^Tu_i=\sigma_iv_i\).
Solve \(u,v\) by SVD we obtain \(u=\dfrac{Av}{|Av|}\), \(v=\dfrac{A^Tu}{|A^Tu|}\).
Community detecting: There are two groups in a graph, the probability that there is an edge between the same group is \(p\) and between different groups is \(q\), classify the nodes.
Spectral clustering algorithm: compute the second eigenvector \(U_2\) of \(A\) and cluster nodes based on the sign of entries of \(U_2\). Total error count: \(\text{err}\le 2\dfrac{(p\lor q)\log n}{(p-q)^2n}\).
Lecture 13
A Markov chain consists of:
- State space: \(S=\{1,2,3,\cdots,m\}\)
- A state distribution over states: \(\mathbb{p}(t)=(p_1(t),p_2(t),\cdots,p_m(t))\in[0,1]^m\) such that \(\sum\limits_{i=1}^mp_i(t)=1\).
- State transition: \(P=[P_{i,j}]_{i,j=1}^m\in[0,1]^{m\times m}\) such that \(\sum\limits_{j=1}^mP_{i,j}=1\).
State evolution: \(\mathbb{p}(t+1)=\mathbb{p}(t)P\).
2D random walk. A drunk man walking in Manhattan, can the drunk man find his way home?
Analysis: Define some notations:
- First return time: \(T_i=\inf\{n\ge 1:X_n=i\}\)
- Return probability: \(f_i=\text{Pr}(T_i<\infty)\).
- Number of visits: \(N_i=\sum\limits_{n=1}^{\infty}[X_n=i]\)
- Recurrent state: a state \(i\) is recurrent if \(f_i=1\).
Lemma: \(\mathbb{E}(N_i)=\dfrac{1}{1-f_i}\). Proof: By Markovian, \(N_i=1+1_{T_i<\infty}N^*_i\), so \(\mathbb{E}(N_i)=1+\text{Pr}(T_i<\infty)\mathbb{E}(N_i)\).
For the 1D version, we have \(\text{Pr}[X_{2n}=0]=\dbinom{2n}{n}4^{-n}\), use Stirling approximation \(\dbinom{2m}{m}\sim\dfrac{4^m}{\sqrt{\pi m}}\), we have \(\text{Pr}[X_{2n}=0]=\dfrac{1}{\sqrt{\pi n}}(1+O(n^{-1}))\). So for the 2D version, \(p_{2n}=\text{Pr}[X_{2n}=0]^2=\dfrac{1}{\pi n}(1+O(n^{-1}))\). The number of visits of \((0,0)\) is \(\sum\limits_{n=0}^{\infty}p_n=\infty\), so \((0,0)\) is recurrent.
“A Drunk Man Will Find His Way Home but a Drunk Bird May Get Lost Forever”: \(f_{(0,0,0)}<1\) in 3D random walk.
Definition: A Markov chain is connected if the probability of moving from every state \(i\) to every state \(j\) is non-zero.
Lemma 4.1: Let \(P\) be the transition for a connected Markov chain. The \(n\times(n+1)\) matrix \(A=\begin{bmatrix}P-I&1\end{bmatrix}\) by augmenting \(P-I\) with an additional column of all ones has rank \(n\).
Prove: Apparently \((1,1,1,\cdots,1,0)\) is in the null space of \(A\). Assume that some other vector \(v=(x,\alpha)\) orthogonal to \((1,1,1,\cdots,1,0)\) is in the null space, then \(Av=0\), so \((P-I)x+\alpha\bold{1}=0\), which means that \(x_i=\sum\limits_{j}p_{i,j}x_j+\alpha\). Since \(v\) is orthogonal to \((1,1,1,\cdots,1,0)\), \(x\) should not be all equal. By the connectedness, there exists neighboring \((i,j)\) such that \(x_i>x_j\) and \(x_i\ge\) all neighbors of \(i\), so \(\alpha>0\). Similarly there exists neighboring \((i,j)\) such that \(x_i<x_j\) and \(x_i\le\) all neighbors of \(i\), so \(\alpha<0\). A contradiction.
Theorem 4.2: For a connected Markov chain, there is a unique probability vector \(\pi\) satisfying \(\pi P=\pi\). Moreover, let \(a(t)=\dfrac{1}{t}(p(0)+p(1)+\cdots+p(t-1))\), then \(\lim\limits_{t\to\infty}a(t)\) exists and equal to \(\pi\).
Lecture 14
Detailed balance equality: For a random walk on a connected Markov chain, if the vector \(\pi\) satisfies \(\pi_xp_{x,y}=\pi_yp_{y,x}\) for all \(x,y\) and \(\sum\limits_{x}\pi_x=1\), then \(\pi\) is the stationary distribution of the Markov chain.
If the stationary distribution \(\pi\) satisfies \(\pi_xp_{x,y}=\pi_yp_{y,x}\), then we call the Markov chain "reversible".
MCMC algorithms: If we want to estimate \(\mathbb{E}(f)=\sum\limits_{x}f(x)\pi(x)\), we design a Markov chain \(P\) such that \(\pi P\), then consider \(\dfrac{1}{t}\sum\limits_{i=1}^tf(x_i)\), as \(t\to\infty\), the value converges to \(\mathbb{E}(f)\).
Metropolis hasting: given a distribution \(p\), let \(r\) be a hyperparameter \(>1\), then let \(P_{i,j}=\min(\dfrac{p_j}{p_i},1)\cdot\dfrac{1}{r}\), \(P_{i,i}=1-\sum\limits_{j\ne i}P_{i,j}\). Then we have \(p_iP_{i,j}=p_jP_{j,i}\), so \(p\) is the stationary distribution of \(P\).
Advantage: The normalizing coefficient \(Z\) in some cases will cancel out so that the ratio is easy to compute.
Gibbs sampling: Repeatedly update some random index \(x_i\) based on \(\mathbb{P}(x_i|\{x_{j\ne i}\})\).
Mixing time: For \(\epsilon>0\), the \(\epsilon\)-mixing time of a Markov chain is the minimum integer \(t\) such that for any starting distribution \(p\), the L1-norm difference between the \(t\)-step running average probability distribution and the stationary distribution is \(\le\epsilon\), i.e., \(|a(t)-\pi|\le\epsilon\).
Normalized conductance: For a subset \(S\) of vertices, let \(\pi(S)=\sum\limits_{s\in S}\pi_s\), the normalized conductance of \(S\) is \(\Phi(S)=\dfrac{\sum\limits_{(x,y)\in (S,\bar{S})}\pi_xp_{x,y}}{\min(\pi(S),\pi(\bar{S}))}\), the normalized conductance of the Markov chain is \(\min\limits_{S}\Phi(S)\).
Theorem 4.5: The \(\epsilon\)-mixing time of a random walk on an undirected graph is \(O(\dfrac{\ln(1/\pi_{min})}{\Phi^2\epsilon^3})\) where \(\pi_{min}\) is the stationary probability of any state.
Applications of mixing time:
- Random walk over 1D lattice of size \(n\) with loop: \(\Phi(S)=\Omega(\dfrac{1}{n})\), total mixing time is \(O(n^2\log n/\epsilon^3)\).
- 2D lattice: \(\Phi(S)=\Omega(\dfrac{1}{n})\).
- Clique: \(\Phi(S)=\Omega(1)\), total mixing time is \(O(\log n/\epsilon^3)\).
- A connected graph with \(m\) edges: worst case bound for mixing time is \(O(m^2\ln n/\epsilon^3)\).