【智应数】Markov chains
Markov Chain & Stationary Distribution
Def(Finite Markov Chain). Let \(\Omega\) be a finite set of states, \(\forall x,y\in\Omega,P(x,y)\ge 0\) be a transition function, i.e., \(\sum\limits_{y\in\Omega} P(x,y)=1.\) A finite Markov chain is a sequence of random variables
if \(\forall t\), for any \(x_0,x_1,...,x_t\in\Omega\) we have
Given the initial distribution \(\mu_0\) over \(\Omega\) and the transition matrix \(P\), we can calculate the distribution after \(t\) transitions
Def. A distribution \(\pi\) is called stationary if \(\pi P=\pi\).
Lem(Brouwer's fixed point theorem). Every continuous map \(f\) from a convex, closed, bounded set to itself, has a fixed point \(f(x)=x\).
Thm(Perron-Frobemins thm). Any Markov chain over finite states has a stationary distribution.
Pf. Let \(|\Omega|=n,\Delta^{n-1}=\{\mu\in\mathbb{R}^n\mid \mu_i\ge 0,\sum\mu_i=1\}\). We can show that \(\Delta^{n-1}\) is convex, closed, bounded. Since \(P\) is a linear map from \(\Delta^{n-1}\) to itself, by Brouwer's fixed point theorem there is a stationary distribution \(\pi\in\Delta^{n-1}\) s.t. \(\pi P=\pi\).
Def. A finite Markov chain is irreducible (connected) if \(\forall x,y\in\Omega,\exists t>0\) s.t. \(P^t(x,y)>0\).
Thm(Fundamental Theorem of Markov Chains). Any connected Markov chain has a unique stationary distribution \(\pi\).
Pf. Suppose \(Ph=h\) for some \(h\in\mathbb{R}^n\). Suppose \(i_0=\arg\max\limits_i h_i\). Let \(h(i_0)=M\). Then \(\forall j\in\Omega\) s.t. \(P(i_0,j)>0\), then \(h(j)=M\). By irreducibility, \(\forall i, h(i)=M\). Thus \(Ph=h\) implies \(h=M\vec{1}\). It immediately gives \(\text{rank}(P-I)=n-1\) and the theorem.
Def(Reversed Markov Chain). Given a Markov Chain \((P,\Omega)\). Denote \(\pi\) as a stationary distribution. Let $$\hat{P}(x,y)=\frac{\pi(y)P(y,x)}{\pi(x)}.$$ Then \((\hat{P},\Omega)\) is the reversed Markov chain.
Def. If \(\hat{P}=P\), then \((P,\Omega)\) is reversible.
Markov Chain Monte Carlo
Given a distribution \(\pi\) and a function \(f\) over \(\Omega\) (usually \(\Omega=\{0,1...,n-1\}^d\)), estimate
MCMC method:
- Design \(P\) s.t. \(\pi P=\pi\).
- Generate \(X_0,X_1,...,X_T\). \(X_t\) is transferred from \(X_{t-1}\) according to \(P\).
- Calculate \(\frac{1}{T}\sum\limits_{t=1}^Tf(X_T)\) as an estimate of \(\mathbb{E}(f)\).
Let \(\hat{\mu}=\frac{1}{T}\sum\limits_{t=1}^T\mu_t\). If \(\lim\limits_{t\rightarrow\infty}\mu_t=\pi\), then \(\lim\limits_{T\rightarrow\infty}\underset{\hat{\mu}}{\mathbb{E}}(f)=\underset{\pi}{\mathbb{E}}(f)\), where. Since \(X_t\sim \mu_t\), it works.
Consider a graph \(G=(\Omega, E)\), where \(E=\{(x,y)\mid P(x,y)>0\}.\)
To ensure the time complexity, the degree of vertices in \(G\) should be small.
To ensure the convergence rate, the diameter of \(G\) should be small (or holds some other property).
For example, if \(\Omega=\{0,1\}^d\), a possible way is to let \(G\) be the hypercube.
Metropolis-Hasting Algorithm
here \(\Phi\) is a symmetric transition matrix.
Lem(Detailed balance equation). For a given transition matrix \(P\), if there exists a distribution \(\pi\) s.t.
then \(\pi\) is a stationary distribution and \(P\) is reversible.
Ex. Given a graph \(G=(V,E)\). Let \(f:\{0,1\}^{|V|}\rightarrow\mathbb{Z}^{+}\) be the size of cut. Find \(\max\limits_x f(x).\)
Solution: Define \(\pi_{\lambda}(x)=\lambda^{f(x)}\). Applying MH algorithm with \(\Phi(x,y)=\left[\sum\limits_i[x(i)\neq y(i)]=1\right]\). Increase \(\lambda\) from \(1\) to \(\infty\).(Simulated Annealing.)
Gibbs Sampling
Intuitively, choose a variables \(x_i\) randomly and update it according to \(\pi\). A alternative scheme is choose \(x_i\) by sequentially scanning from \(x_1\) to \(x_d\).
Mixing time
Def. The total variation distance between two distributions \(\mu,v\) is
Prop. \(d_{TV}(\mu,v)=\frac{1}{2}\sum\limits_{x\in\Omega}|\mu_x-v_x|.\)
Let \(d(t)=\max\limits_{x\in\Omega}\Vert P^t(x,\cdot)-\pi\Vert_{TV},\overline{d}(t)=\max\limits_{x,y\in\Omega}\Vert P^t(x,\cdot)-P^t(y,\cdot)\Vert_{TV}.\)
Lem 1. \(d(t)\le\overline{d}(t)\le 2d(t)\).
Lem 2. \(\overline{d}(s+t)\le\overline{d}(s)\overline{d}(t).\)
Pf. $$P^ {s+t}(x,w)=\sum\limits_{z\in \Omega}P^s (x,z)P^t (z,w)= \mathbb{E}_ {X_s}(P^t(X_s,w)),$$ where \(X_s\sim P^s(x,\cdot)\).
Cor. For any positive integer \(c\),
Def. The mixing time of a Markov chain is
remark: \(t_{ave}\) exists without the assumption that MC is aperiodic. For aperiodic chains, \(t_{mix}(\varepsilon)<t_{ave}(\varepsilon)\)
Random Walks on Undirected Graphs
Given an edge-weighted undirected graph, let \(w_{x,y}\) denote the weight of the edge between nodes \(x\) and \(y\). Let \(w_x=\sum\limits_y w_{x,y}\).
Let \(\Omega =V,P_{x,y}=\frac{w_{x,y}}{w_x}\). One can check that \((\Omega,P)\) is reversible and \(\pi(x)\propto w_x\) is the stationary distribution.
Thm. If Markov chain is reversible, finite, aperiodic, then for \(\pi_*=\min\limits_{x\in\Omega}\pi(x)>0\),
where \(\delta=\lambda_1-\lambda_2\) is the eigen gap of \(P\).
Takeaway: \(\frac{1}{\delta}\) reflects "conductance".
Def 4.2. For a subset \(S\) of vertices, the normalized conductance of \(S\) is
Def 4.3. The normalized conductance of a Markov chain is
Thm(Cheeger's Inequality). Let \(\delta\) be the eigen gap of \(P\),
Combined with the previous theorem,
Thm 4.5.$$t_{ave}(\varepsilon)=O\left(\frac{\ln(1/\pi_* )}{\phi^2 \varepsilon^3 }\right).$$
Ex(1-D lattice). Let \(G\) be the ring with \(n\) vertices, i.e., \(i\) is neighbor to \(i-1,i+1\) and \(P_{x,y}=\frac{1}{2}\) for all the edges.
Since it's periodic, we can make the process aperiodic by being "lazy": \(P'(x,y)=\begin{cases} \frac{1}{2}P(x,y) & x\neq y\\ \frac{1}{2} & x =y. \end{cases} \). For constant \(\varepsilon\), \(t_{mix}\) of the lazy process \(\ge\) \(ct_{ave}\) of the original process.
Let $S_* $ be half of the ring, \(\phi=\phi(S_* )=\frac{2}{n}.\) Thus \(t_{ave}=O(n^2)\).
Ex(2-D lattice). Let \(G\) be the cyclic chessboard of size \(n\times n\), \(P_{x,y}=\frac{1}{4}\) for all the edges.
\(\Phi(S)=\frac{\#(\text{across edges})/4n^2}{|S|/n^2}\ge \Omega(\frac{\sqrt{S}/4n^2}{|S|/n^2})\ge\Omega(\frac{1}{n}).\) Thus \(t_{mix}=O(n^2)\).
Remark: 2-D lattice mixes faster than 1-D lattice and is better connected.
Ex(clique). \(t_{mix}=\Omega(1)\).
Coupling
Def. A coupling of Markov chain with transition \(P\) are \(\{X_1,...,X_T\},\{Y_1,...,Y_T\}\) s.t.
- \(X_{t+1}\sim P(X_t,\cdot), Y_{t+1}\sim P(Y_t,\cdot).\)
- If \(X_s=Y_s\) then \(X_t=Y_t\) for \(\forall t\ge s\).
Thm. Assume \(X_0=x,Y_0=y\). Let \(\tau_{couple}=\min\{t:X_t=Y_t\}\),
Remark: \(d(t)\le \max\limits_{x,y\in\Omega}{\mathbb{P} _ {(x,y)}} ( \tau_{couple} > t).\)
Ex(1-D lattice). \(P(x,y)=\begin{cases} \frac{1}{2} & y=x\\ \frac{1}{4} & |y-x|=1 \end{cases}.\)
The coupling constructed: \(X_t\) moves and \(Y_t\) stays w.p. \(\frac{1}{2}\),\(Y_t\) moves and \(X_t\) stays w.p. \(\frac{1}{2}\). Then \(\mathbb{E}(\tau)=k(n-k)\), where \(k=|x-y|\). Thus \(d(t)\le \frac{n^2}{4t}\).
Ex(hypercube). \(\Omega=\{0,1\}^n,|\Omega|=2^n,P(x,y)=\begin{cases}
\frac{1}{2} & y=x\\
\frac{1}{2n} & \Vert y-x\Vert=1
\end{cases}.\)
The coupling constructed: Pick a random coordinate \(i\in[n]\), generate \(b_t\in \{0,1\}\) and let \(X_{t+1}^{(i)}=Y_{t+1}^{(i)}=b_t\). Then \(\mathbb{E}(\tau)=\mathbb{E}(\text{all coordinates are selected at least once})=n\ln n.\)