【智应数】Markov chains

Markov Chain & Stationary Distribution

Def(Finite Markov Chain). Let \(\Omega\) be a finite set of states, \(\forall x,y\in\Omega,P(x,y)\ge 0\) be a transition function, i.e., \(\sum\limits_{y\in\Omega} P(x,y)=1.\) A finite Markov chain is a sequence of random variables

\[(X_0,X_1,X_2,...) \]

if \(\forall t\), for any \(x_0,x_1,...,x_t\in\Omega\) we have

\[\mathbb{P}(X_{t+1}=x_{t+1}\mid X_0=x_0,X_1=x_1,...,X_t=x_t)=\mathbb{P}(X_{t+1}=x_{t+1}\mid X_t=x_t)=P(x_t,x_{t+1}). \]

Given the initial distribution \(\mu_0\) over \(\Omega\) and the transition matrix \(P\), we can calculate the distribution after \(t\) transitions

\[\mu_t=\mu_{t-1}P=\mu_0P^t. \]

Def. A distribution \(\pi\) is called stationary if \(\pi P=\pi\).

Lem(Brouwer's fixed point theorem). Every continuous map \(f\) from a convex, closed, bounded set to itself, has a fixed point \(f(x)=x\).

Thm(Perron-Frobemins thm). Any Markov chain over finite states has a stationary distribution.

Pf. Let \(|\Omega|=n,\Delta^{n-1}=\{\mu\in\mathbb{R}^n\mid \mu_i\ge 0,\sum\mu_i=1\}\). We can show that \(\Delta^{n-1}\) is convex, closed, bounded. Since \(P\) is a linear map from \(\Delta^{n-1}\) to itself, by Brouwer's fixed point theorem there is a stationary distribution \(\pi\in\Delta^{n-1}\) s.t. \(\pi P=\pi\).

Def. A finite Markov chain is irreducible (connected) if \(\forall x,y\in\Omega,\exists t>0\) s.t. \(P^t(x,y)>0\).

Thm(Fundamental Theorem of Markov Chains). Any connected Markov chain has a unique stationary distribution \(\pi\).

Pf. Suppose \(Ph=h\) for some \(h\in\mathbb{R}^n\). Suppose \(i_0=\arg\max\limits_i h_i\). Let \(h(i_0)=M\). Then \(\forall j\in\Omega\) s.t. \(P(i_0,j)>0\), then \(h(j)=M\). By irreducibility, \(\forall i, h(i)=M\). Thus \(Ph=h\) implies \(h=M\vec{1}\). It immediately gives \(\text{rank}(P-I)=n-1\) and the theorem.

Def(Reversed Markov Chain). Given a Markov Chain \((P,\Omega)\). Denote \(\pi\) as a stationary distribution. Let $$\hat{P}(x,y)=\frac{\pi(y)P(y,x)}{\pi(x)}.$$ Then \((\hat{P},\Omega)\) is the reversed Markov chain.

Def. If \(\hat{P}=P\), then \((P,\Omega)\) is reversible.

Markov Chain Monte Carlo

Given a distribution \(\pi\) and a function \(f\) over \(\Omega\) (usually \(\Omega=\{0,1...,n-1\}^d\)), estimate

\[\underset{\pi}{\mathbb{E}}(f)=\sum\limits_{x\in\Omega}f(x)\pi(x). \]

MCMC method:

  • Design \(P\) s.t. \(\pi P=\pi\).
  • Generate \(X_0,X_1,...,X_T\). \(X_t\) is transferred from \(X_{t-1}\) according to \(P\).
  • Calculate \(\frac{1}{T}\sum\limits_{t=1}^Tf(X_T)\) as an estimate of \(\mathbb{E}(f)\).

Let \(\hat{\mu}=\frac{1}{T}\sum\limits_{t=1}^T\mu_t\). If \(\lim\limits_{t\rightarrow\infty}\mu_t=\pi\), then \(\lim\limits_{T\rightarrow\infty}\underset{\hat{\mu}}{\mathbb{E}}(f)=\underset{\pi}{\mathbb{E}}(f)\), where. Since \(X_t\sim \mu_t\), it works.

Consider a graph \(G=(\Omega, E)\), where \(E=\{(x,y)\mid P(x,y)>0\}.\)

To ensure the time complexity, the degree of vertices in \(G\) should be small.

To ensure the convergence rate, the diameter of \(G\) should be small (or holds some other property).

For example, if \(\Omega=\{0,1\}^d\), a possible way is to let \(G\) be the hypercube.

Metropolis-Hasting Algorithm

\[P(x,y)= \begin{cases} \Phi(x,y)(1\wedge\frac{\pi(y)}{\pi(x)}) & x\neq y\\ 1-\sum\limits_{z\neq x}\Phi(x,z)(1\wedge\frac{\pi(z)}{\pi(x)}) & x=y \end{cases}\]

here \(\Phi\) is a symmetric transition matrix.

Lem(Detailed balance equation). For a given transition matrix \(P\), if there exists a distribution \(\pi\) s.t.

\[\pi (x) P(x,y)=P(y) \pi (y,x),\forall x,y \in \Omega , \]

then \(\pi\) is a stationary distribution and \(P\) is reversible.

Ex. Given a graph \(G=(V,E)\). Let \(f:\{0,1\}^{|V|}\rightarrow\mathbb{Z}^{+}\) be the size of cut. Find \(\max\limits_x f(x).\)

Solution: Define \(\pi_{\lambda}(x)=\lambda^{f(x)}\). Applying MH algorithm with \(\Phi(x,y)=\left[\sum\limits_i[x(i)\neq y(i)]=1\right]\). Increase \(\lambda\) from \(1\) to \(\infty\).(Simulated Annealing.)

Gibbs Sampling

\[P(x,y)= \begin{cases} \frac{1}{d}\frac{\pi(y)}{\sum\limits_a\pi(x_1,...,x_{i-1},a,x_{i+1},...,x_n)} & \{j\mid x_j\neq y_j\}=\{i\}\\ 0 & otherwise \end{cases}\]

Intuitively, choose a variables \(x_i\) randomly and update it according to \(\pi\). A alternative scheme is choose \(x_i\) by sequentially scanning from \(x_1\) to \(x_d\).

Mixing time

Def. The total variation distance between two distributions \(\mu,v\) is

\[d_{TV}(\mu,v)=\max\limits_{A\subseteq\Omega}|\mu(A)-v(A)|. \]

Prop. \(d_{TV}(\mu,v)=\frac{1}{2}\sum\limits_{x\in\Omega}|\mu_x-v_x|.\)

Let \(d(t)=\max\limits_{x\in\Omega}\Vert P^t(x,\cdot)-\pi\Vert_{TV},\overline{d}(t)=\max\limits_{x,y\in\Omega}\Vert P^t(x,\cdot)-P^t(y,\cdot)\Vert_{TV}.\)

Lem 1. \(d(t)\le\overline{d}(t)\le 2d(t)\).

Lem 2. \(\overline{d}(s+t)\le\overline{d}(s)\overline{d}(t).\)

Pf. $$P^ {s+t}(x,w)=\sum\limits_{z\in \Omega}P^s (x,z)P^t (z,w)= \mathbb{E}_ {X_s}(P^t(X_s,w)),$$ where \(X_s\sim P^s(x,\cdot)\).

\[\Vert P^{s+t}(x,\cdot)-P^{s+t}(y,\cdot)\Vert_{TV} \]

\[=\frac{1}{2}\sum\limits_{w}|\mathbb{E}_{X_s}(P^t(X_s,w))-\mathbb{E}_{Y_s}(P^t(Y_s,w))| \]

\[\le \mathbb{P}(X_s\neq Y_s)\overline{d}(t)\le \overline{d}(s)\overline{d}(t). \]

Cor. For any positive integer \(c\),

\[d(ct)\le\overline{d}(ct)\le\overline{d}(t)^c\le (2d(t))^c. \]

Def. The mixing time of a Markov chain is

\[t_{mix}(\varepsilon)=\min\{t:d(t)\le \varepsilon\}. \]

\[t_{mix}=t_{mix}(\frac{1}{4}),t_{mix}(\varepsilon)\le \ln\frac{1}{\varepsilon}t_{mix}. \]

\[t_{ave}(\varepsilon)=\min\{t:\max\Vert a_t-\pi\Vert_1\le\varepsilon\}. \]

remark: \(t_{ave}\) exists without the assumption that MC is aperiodic. For aperiodic chains, \(t_{mix}(\varepsilon)<t_{ave}(\varepsilon)\)

Random Walks on Undirected Graphs

Given an edge-weighted undirected graph, let \(w_{x,y}\) denote the weight of the edge between nodes \(x\) and \(y\). Let \(w_x=\sum\limits_y w_{x,y}\).

Let \(\Omega =V,P_{x,y}=\frac{w_{x,y}}{w_x}\). One can check that \((\Omega,P)\) is reversible and \(\pi(x)\propto w_x\) is the stationary distribution.

Thm. If Markov chain is reversible, finite, aperiodic, then for \(\pi_*=\min\limits_{x\in\Omega}\pi(x)>0\),

\[\frac{1-\delta}{\delta}\log\left(\frac{1}{2\varepsilon}\right)\le t_{mix}(\varepsilon)\le \frac{1}{2\delta}\log\left(\frac{1}{\pi_*\varepsilon}\right), \]

where \(\delta=\lambda_1-\lambda_2\) is the eigen gap of \(P\).

Takeaway: \(\frac{1}{\delta}\) reflects "conductance".

Def 4.2. For a subset \(S\) of vertices, the normalized conductance of \(S\) is

\[\Phi(S)=\frac{\sum\limits_{x\in S,y\in\overline{S}} \pi_x p_{x,y}}{\min(\pi(S),\pi(\overline{S}))}. \]

Def 4.3. The normalized conductance of a Markov chain is

\[\Phi=\min\limits_{S\subset \Omega, S\neq \emptyset}\Phi(S) \]

Thm(Cheeger's Inequality). Let \(\delta\) be the eigen gap of \(P\),

\[\frac{\delta}{2}\le \Phi\le \sqrt{2\delta}. \]

Combined with the previous theorem,

\[\frac{c}{\Phi }\log\frac{1}{\varepsilon} \le t_{mix} (\varepsilon) \le \frac{c}{\Phi^2 } \log\frac{1}{\pi_* \varepsilon} \]

Thm 4.5.$$t_{ave}(\varepsilon)=O\left(\frac{\ln(1/\pi_* )}{\phi^2 \varepsilon^3 }\right).$$

Ex(1-D lattice). Let \(G\) be the ring with \(n\) vertices, i.e., \(i\) is neighbor to \(i-1,i+1\) and \(P_{x,y}=\frac{1}{2}\) for all the edges.

Since it's periodic, we can make the process aperiodic by being "lazy": \(P'(x,y)=\begin{cases} \frac{1}{2}P(x,y) & x\neq y\\ \frac{1}{2} & x =y. \end{cases} \). For constant \(\varepsilon\), \(t_{mix}\) of the lazy process \(\ge\) \(ct_{ave}\) of the original process.

Let $S_* $ be half of the ring, \(\phi=\phi(S_* )=\frac{2}{n}.\) Thus \(t_{ave}=O(n^2)\).

Ex(2-D lattice). Let \(G\) be the cyclic chessboard of size \(n\times n\), \(P_{x,y}=\frac{1}{4}\) for all the edges.

\(\Phi(S)=\frac{\#(\text{across edges})/4n^2}{|S|/n^2}\ge \Omega(\frac{\sqrt{S}/4n^2}{|S|/n^2})\ge\Omega(\frac{1}{n}).\) Thus \(t_{mix}=O(n^2)\).

Remark: 2-D lattice mixes faster than 1-D lattice and is better connected.

Ex(clique). \(t_{mix}=\Omega(1)\).

Coupling

Def. A coupling of Markov chain with transition \(P\) are \(\{X_1,...,X_T\},\{Y_1,...,Y_T\}\) s.t.

  • \(X_{t+1}\sim P(X_t,\cdot), Y_{t+1}\sim P(Y_t,\cdot).\)
  • If \(X_s=Y_s\) then \(X_t=Y_t\) for \(\forall t\ge s\).

Thm. Assume \(X_0=x,Y_0=y\). Let \(\tau_{couple}=\min\{t:X_t=Y_t\}\),

\[\Vert P^t (x,\cdot)-P^t (y,\cdot)\Vert_{TV}\le \mathbb{P}(X_t \neq Y_t )= {\mathbb{P} _ {(x,y)}} ( \tau_{couple} > t). \]

Remark: \(d(t)\le \max\limits_{x,y\in\Omega}{\mathbb{P} _ {(x,y)}} ( \tau_{couple} > t).\)

Ex(1-D lattice). \(P(x,y)=\begin{cases} \frac{1}{2} & y=x\\ \frac{1}{4} & |y-x|=1 \end{cases}.\)

The coupling constructed: \(X_t\) moves and \(Y_t\) stays w.p. \(\frac{1}{2}\)\(Y_t\) moves and \(X_t\) stays w.p. \(\frac{1}{2}\). Then \(\mathbb{E}(\tau)=k(n-k)\), where \(k=|x-y|\). Thus \(d(t)\le \frac{n^2}{4t}\).

Ex(hypercube). \(\Omega=\{0,1\}^n,|\Omega|=2^n,P(x,y)=\begin{cases} \frac{1}{2} & y=x\\ \frac{1}{2n} & \Vert y-x\Vert=1 \end{cases}.\)
The coupling constructed: Pick a random coordinate \(i\in[n]\), generate \(b_t\in \{0,1\}\) and let \(X_{t+1}^{(i)}=Y_{t+1}^{(i)}=b_t\). Then \(\mathbb{E}(\tau)=\mathbb{E}(\text{all coordinates are selected at least once})=n\ln n.\)

posted @ 2024-05-25 16:01  xcyle  阅读(111)  评论(1编辑  收藏  举报