【智应数】Markov chains

Markov Chain & Stationary Distribution

Def(Finite Markov Chain). Let $\Omega$ be a finite set of states, $\forall x,y\in\Omega,P(x,y)\ge 0$ be a transition function, i.e., $\sum\limits_{y\in\Omega} P(x,y)=1.$ A finite Markov chain is a sequence of random variables

\[(X_0,X_1,X_2,...) \]

if $\forall t$, for any $x_0,x_1,...,x_t\in\Omega$ we have

\[\mathbb{P}(X_{t+1}=x_{t+1}\mid X_0=x_0,X_1=x_1,...,X_t=x_t)=\mathbb{P}(X_{t+1}=x_{t+1}\mid X_t=x_t)=P(x_t,x_{t+1}). \]

Given the initial distribution $\mu_0$ over $\Omega$ and the transition matrix $P$, we can calculate the distribution after $t$ transitions

\[\mu_t=\mu_{t-1}P=\mu_0P^t. \]

Def. A distribution $\pi$ is called stationary if $\pi P=\pi$.

Lem(Brouwer's fixed point theorem). Every continuous map $f$ from a convex, closed, bounded set to itself, has a fixed point $f(x)=x$.

Thm(Perron-Frobemins thm). Any Markov chain over finite states has a stationary distribution.

Pf. Let $|\Omega|=n,\Delta^{n-1}=\{\mu\in\mathbb{R}^n\mid \mu_i\ge 0,\sum\mu_i=1\}$. We can show that $\Delta^{n-1}$ is convex, closed, bounded. Since $P$ is a linear map from $\Delta^{n-1}$ to itself, by Brouwer's fixed point theorem there is a stationary distribution $\pi\in\Delta^{n-1}$ s.t. $\pi P=\pi$.

Def. A finite Markov chain is irreducible (connected) if $\forall x,y\in\Omega,\exists t>0$ s.t. $P^t(x,y)>0$.

Thm(Fundamental Theorem of Markov Chains). Any connected Markov chain has a unique stationary distribution $\pi$.

Pf. Suppose $Ph=h$ for some $h\in\mathbb{R}^n$. Suppose $i_0=\arg\max\limits_i h_i$. Let $h(i_0)=M$. Then $\forall j\in\Omega$ s.t. $P(i_0,j)>0$, then $h(j)=M$. By irreducibility, $\forall i, h(i)=M$. Thus $Ph=h$ implies $h=M\vec{1}$. It immediately gives $\text{rank}(P-I)=n-1$ and the theorem.

Def(Reversed Markov Chain). Given a Markov Chain $(P,\Omega)$. Denote $\pi$ as a stationary distribution. Let $$\hat{P}(x,y)=\frac{\pi(y)P(y,x)}{\pi(x)}.$$ Then $(\hat{P},\Omega)$ is the reversed Markov chain.

Def. If $\hat{P}=P$, then $(P,\Omega)$ is reversible.

Markov Chain Monte Carlo

Given a distribution $\pi$ and a function $f$ over $\Omega$ (usually $\Omega=\{0,1...,n-1\}^d$), estimate

\[\underset{\pi}{\mathbb{E}}(f)=\sum\limits_{x\in\Omega}f(x)\pi(x). \]

MCMC method:

Design $P$ s.t. $\pi P=\pi$.
Generate $X_0,X_1,...,X_T$. $X_t$ is transferred from $X_{t-1}$ according to $P$.
Calculate $\frac{1}{T}\sum\limits_{t=1}^Tf(X_T)$ as an estimate of $\mathbb{E}(f)$.

Let $\hat{\mu}=\frac{1}{T}\sum\limits_{t=1}^T\mu_t$. If $\lim\limits_{t\rightarrow\infty}\mu_t=\pi$, then $\lim\limits_{T\rightarrow\infty}\underset{\hat{\mu}}{\mathbb{E}}(f)=\underset{\pi}{\mathbb{E}}(f)$, where. Since $X_t\sim \mu_t$, it works.

Consider a graph $G=(\Omega, E)$, where $E=\{(x,y)\mid P(x,y)>0\}.$

To ensure the time complexity, the degree of vertices in $G$ should be small.

To ensure the convergence rate, the diameter of $G$ should be small (or holds some other property).

For example, if $\Omega=\{0,1\}^d$, a possible way is to let $G$ be the hypercube.

Metropolis-Hasting Algorithm

\[P(x,y)= \begin{cases} \Phi(x,y)(1\wedge\frac{\pi(y)}{\pi(x)}) & x\neq y\\ 1-\sum\limits_{z\neq x}\Phi(x,z)(1\wedge\frac{\pi(z)}{\pi(x)}) & x=y \end{cases}\]

here $\Phi$ is a symmetric transition matrix.

Lem(Detailed balance equation). For a given transition matrix $P$, if there exists a distribution $\pi$ s.t.

\[\pi (x) P(x,y)=P(y) \pi (y,x),\forall x,y \in \Omega , \]

then $\pi$ is a stationary distribution and $P$ is reversible.

Ex. Given a graph $G=(V,E)$. Let $f:\{0,1\}^{|V|}\rightarrow\mathbb{Z}^{+}$ be the size of cut. Find $\max\limits_x f(x).$

Solution: Define $\pi_{\lambda}(x)=\lambda^{f(x)}$. Applying MH algorithm with $\Phi(x,y)=\left[\sum\limits_i[x(i)\neq y(i)]=1\right]$. Increase $\lambda$ from $1$ to $\infty$.(Simulated Annealing.)

Gibbs Sampling

\[P(x,y)= \begin{cases} \frac{1}{d}\frac{\pi(y)}{\sum\limits_a\pi(x_1,...,x_{i-1},a,x_{i+1},...,x_n)} & \{j\mid x_j\neq y_j\}=\{i\}\\ 0 & otherwise \end{cases}\]

Intuitively, choose a variables $x_i$ randomly and update it according to $\pi$. A alternative scheme is choose $x_i$ by sequentially scanning from $x_1$ to $x_d$.

Mixing time

Def. The total variation distance between two distributions $\mu,v$ is

\[d_{TV}(\mu,v)=\max\limits_{A\subseteq\Omega}|\mu(A)-v(A)|. \]

Prop. $d_{TV}(\mu,v)=\frac{1}{2}\sum\limits_{x\in\Omega}|\mu_x-v_x|.$

Let $d(t)=\max\limits_{x\in\Omega}\Vert P^t(x,\cdot)-\pi\Vert_{TV},\overline{d}(t)=\max\limits_{x,y\in\Omega}\Vert P^t(x,\cdot)-P^t(y,\cdot)\Vert_{TV}.$

Lem 1. $d(t)\le\overline{d}(t)\le 2d(t)$.

Lem 2. $\overline{d}(s+t)\le\overline{d}(s)\overline{d}(t).$

Pf. $$P^ {s+t}(x,w)=\sum\limits_{z\in \Omega}P^s (x,z)P^t (z,w)= \mathbb{E}_ {X_s}(P^t(X_s,w)),$$ where $X_s\sim P^s(x,\cdot)$.

\[\Vert P^{s+t}(x,\cdot)-P^{s+t}(y,\cdot)\Vert_{TV} \]

\[=\frac{1}{2}\sum\limits_{w}|\mathbb{E}_{X_s}(P^t(X_s,w))-\mathbb{E}_{Y_s}(P^t(Y_s,w))| \]

\[\le \mathbb{P}(X_s\neq Y_s)\overline{d}(t)\le \overline{d}(s)\overline{d}(t). \]

Cor. For any positive integer $c$,

\[d(ct)\le\overline{d}(ct)\le\overline{d}(t)^c\le (2d(t))^c. \]

Def. The mixing time of a Markov chain is

\[t_{mix}(\varepsilon)=\min\{t:d(t)\le \varepsilon\}. \]

\[t_{mix}=t_{mix}(\frac{1}{4}),t_{mix}(\varepsilon)\le \ln\frac{1}{\varepsilon}t_{mix}. \]

\[t_{ave}(\varepsilon)=\min\{t:\max\Vert a_t-\pi\Vert_1\le\varepsilon\}. \]

remark: $t_{ave}$ exists without the assumption that MC is aperiodic. For aperiodic chains, $t_{mix}(\varepsilon)<t_{ave}(\varepsilon)$

Random Walks on Undirected Graphs

Given an edge-weighted undirected graph, let $w_{x,y}$ denote the weight of the edge between nodes $x$ and $y$. Let $w_x=\sum\limits_y w_{x,y}$.

Let $\Omega =V,P_{x,y}=\frac{w_{x,y}}{w_x}$. One can check that $(\Omega,P)$ is reversible and $\pi(x)\propto w_x$ is the stationary distribution.

Thm. If Markov chain is reversible, finite, aperiodic, then for $\pi_*=\min\limits_{x\in\Omega}\pi(x)>0$,

\[\frac{1-\delta}{\delta}\log\left(\frac{1}{2\varepsilon}\right)\le t_{mix}(\varepsilon)\le \frac{1}{2\delta}\log\left(\frac{1}{\pi_*\varepsilon}\right), \]

where $\delta=\lambda_1-\lambda_2$ is the eigen gap of $P$.

Takeaway: $\frac{1}{\delta}$ reflects "conductance".

Def 4.2. For a subset $S$ of vertices, the normalized conductance of $S$ is

\[\Phi(S)=\frac{\sum\limits_{x\in S,y\in\overline{S}} \pi_x p_{x,y}}{\min(\pi(S),\pi(\overline{S}))}. \]

Def 4.3. The normalized conductance of a Markov chain is

\[\Phi=\min\limits_{S\subset \Omega, S\neq \emptyset}\Phi(S) \]

Thm(Cheeger's Inequality). Let $\delta$ be the eigen gap of $P$,

\[\frac{\delta}{2}\le \Phi\le \sqrt{2\delta}. \]

Combined with the previous theorem,

\[\frac{c}{\Phi }\log\frac{1}{\varepsilon} \le t_{mix} (\varepsilon) \le \frac{c}{\Phi^2 } \log\frac{1}{\pi_* \varepsilon} \]

Thm 4.5.$$t_{ave}(\varepsilon)=O\left(\frac{\ln(1/\pi_* )}{\phi^2 \varepsilon^3 }\right).$$

Ex(1-D lattice). Let $G$ be the ring with $n$ vertices, i.e., $i$ is neighbor to $i-1,i+1$ and $P_{x,y}=\frac{1}{2}$ for all the edges.

Since it's periodic, we can make the process aperiodic by being "lazy": $P'(x,y)=\begin{cases} \frac{1}{2}P(x,y) & x\neq y\\ \frac{1}{2} & x =y. \end{cases} $. For constant $\varepsilon$, $t_{mix}$ of the lazy process $\ge$ $ct_{ave}$ of the original process.

Let $S_* $ be half of the ring, $\phi=\phi(S_* )=\frac{2}{n}.$ Thus $t_{ave}=O(n^2)$.

Ex(2-D lattice). Let $G$ be the cyclic chessboard of size $n\times n$, $P_{x,y}=\frac{1}{4}$ for all the edges.

$\Phi(S)=\frac{\#(\text{across edges})/4n^2}{|S|/n^2}\ge \Omega(\frac{\sqrt{S}/4n^2}{|S|/n^2})\ge\Omega(\frac{1}{n}).$ Thus $t_{mix}=O(n^2)$.

Remark: 2-D lattice mixes faster than 1-D lattice and is better connected.

Ex(clique). $t_{mix}=\Omega(1)$.

Coupling

Def. A coupling of Markov chain with transition $P$ are $\{X_1,...,X_T\},\{Y_1,...,Y_T\}$ s.t.

$X_{t+1}\sim P(X_t,\cdot), Y_{t+1}\sim P(Y_t,\cdot).$
If $X_s=Y_s$ then $X_t=Y_t$ for $\forall t\ge s$.

Thm. Assume $X_0=x,Y_0=y$. Let $\tau_{couple}=\min\{t:X_t=Y_t\}$,

\[\Vert P^t (x,\cdot)-P^t (y,\cdot)\Vert_{TV}\le \mathbb{P}(X_t \neq Y_t )= {\mathbb{P} _ {(x,y)}} ( \tau_{couple} > t). \]

Remark: $d(t)\le \max\limits_{x,y\in\Omega}{\mathbb{P} _ {(x,y)}} ( \tau_{couple} > t).$

Ex(1-D lattice). $P(x,y)=\begin{cases} \frac{1}{2} & y=x\\ \frac{1}{4} & |y-x|=1 \end{cases}.$

The coupling constructed: $X_t$ moves and $Y_t$ stays w.p. $\frac{1}{2}$，$Y_t$ moves and $X_t$ stays w.p. $\frac{1}{2}$. Then $\mathbb{E}(\tau)=k(n-k)$, where $k=|x-y|$. Thus $d(t)\le \frac{n^2}{4t}$.

Ex(hypercube). $\Omega=\{0,1\}^n,|\Omega|=2^n,P(x,y)=\begin{cases} \frac{1}{2} & y=x\\ \frac{1}{2n} & \Vert y-x\Vert=1 \end{cases}.$
The coupling constructed: Pick a random coordinate $i\in[n]$, generate $b_t\in \{0,1\}$ and let $X_{t+1}^{(i)}=Y_{t+1}^{(i)}=b_t$. Then $\mathbb{E}(\tau)=\mathbb{E}(\text{all coordinates are selected at least once})=n\ln n.$

posted @ 2024-05-25 16:01 xcyle 阅读(111) 评论(1) 编辑收藏举报

刷新页面返回顶部

xcyle