【智应数】Markov chains

Markov Chain & Stationary Distribution#

Def(Finite Markov Chain). Let Ω be a finite set of states, x,yΩ,P(x,y)0 be a transition function, i.e., yΩP(x,y)=1. A finite Markov chain is a sequence of random variables

(X0,X1,X2,...)

if t, for any x0,x1,...,xtΩ we have

P(Xt+1=xt+1X0=x0,X1=x1,...,Xt=xt)=P(Xt+1=xt+1Xt=xt)=P(xt,xt+1).

Given the initial distribution μ0 over Ω and the transition matrix P, we can calculate the distribution after t transitions

μt=μt1P=μ0Pt.

Def. A distribution π is called stationary if πP=π.

Lem(Brouwer's fixed point theorem). Every continuous map f from a convex, closed, bounded set to itself, has a fixed point f(x)=x.

Thm(Perron-Frobemins thm). Any Markov chain over finite states has a stationary distribution.

Pf. Let |Ω|=n,Δn1={μRnμi0,μi=1}. We can show that Δn1 is convex, closed, bounded. Since P is a linear map from Δn1 to itself, by Brouwer's fixed point theorem there is a stationary distribution πΔn1 s.t. πP=π.

Def. A finite Markov chain is irreducible (connected) if x,yΩ,t>0 s.t. Pt(x,y)>0.

Thm(Fundamental Theorem of Markov Chains). Any connected Markov chain has a unique stationary distribution π.

Pf. Suppose Ph=h for some hRn. Suppose i0=argmaxihi. Let h(i0)=M. Then jΩ s.t. P(i0,j)>0, then h(j)=M. By irreducibility, i,h(i)=M. Thus Ph=h implies h=M1. It immediately gives rank(PI)=n1 and the theorem.

Def(Reversed Markov Chain). Given a Markov Chain (P,Ω). Denote π as a stationary distribution. Let P^(x,y)=π(y)P(y,x)π(x). Then (P^,Ω) is the reversed Markov chain.

Def. If P^=P, then (P,Ω) is reversible.

Markov Chain Monte Carlo#

Given a distribution π and a function f over Ω (usually Ω={0,1...,n1}d), estimate

Eπ(f)=xΩf(x)π(x).

MCMC method:

  • Design P s.t. πP=π.
  • Generate X0,X1,...,XT. Xt is transferred from Xt1 according to P.
  • Calculate 1Tt=1Tf(XT) as an estimate of E(f).

Let μ^=1Tt=1Tμt. If limtμt=π, then limTEμ^(f)=Eπ(f), where. Since Xtμt, it works.

Consider a graph G=(Ω,E), where E={(x,y)P(x,y)>0}.

To ensure the time complexity, the degree of vertices in G should be small.

To ensure the convergence rate, the diameter of G should be small (or holds some other property).

For example, if Ω={0,1}d, a possible way is to let G be the hypercube.

Metropolis-Hasting Algorithm

P(x,y)={Φ(x,y)(1π(y)π(x))xy1zxΦ(x,z)(1π(z)π(x))x=y

here Φ is a symmetric transition matrix.

Lem(Detailed balance equation). For a given transition matrix P, if there exists a distribution π s.t.

π(x)P(x,y)=P(y)π(y,x),x,yΩ,

then π is a stationary distribution and P is reversible.

Ex. Given a graph G=(V,E). Let f:{0,1}|V|Z+ be the size of cut. Find maxxf(x).

Solution: Define πλ(x)=λf(x). Applying MH algorithm with Φ(x,y)=[i[x(i)y(i)]=1]. Increase λ from 1 to .(Simulated Annealing.)

Gibbs Sampling

P(x,y)={1dπ(y)aπ(x1,...,xi1,a,xi+1,...,xn){jxjyj}={i}0otherwise

Intuitively, choose a variables xi randomly and update it according to π. A alternative scheme is choose xi by sequentially scanning from x1 to xd.

Mixing time#

Def. The total variation distance between two distributions μ,v is

dTV(μ,v)=maxAΩ|μ(A)v(A)|.

Prop. dTV(μ,v)=12xΩ|μxvx|.

Let d(t)=maxxΩPt(x,)πTV,d(t)=maxx,yΩPt(x,)Pt(y,)TV.

Lem 1. d(t)d(t)2d(t).

Lem 2. d(s+t)d(s)d(t).

Pf. Ps+t(x,w)=zΩPs(x,z)Pt(z,w)=EXs(Pt(Xs,w)), where XsPs(x,).

Ps+t(x,)Ps+t(y,)TV

=12w|EXs(Pt(Xs,w))EYs(Pt(Ys,w))|

P(XsYs)d(t)d(s)d(t).

Cor. For any positive integer c,

d(ct)d(ct)d(t)c(2d(t))c.

Def. The mixing time of a Markov chain is

tmix(ε)=min{t:d(t)ε}.

tmix=tmix(14),tmix(ε)ln1εtmix.

tave(ε)=min{t:maxatπ1ε}.

remark: tave exists without the assumption that MC is aperiodic. For aperiodic chains, tmix(ε)<tave(ε)

Random Walks on Undirected Graphs

Given an edge-weighted undirected graph, let wx,y denote the weight of the edge between nodes x and y. Let wx=ywx,y.

Let Ω=V,Px,y=wx,ywx. One can check that (Ω,P) is reversible and π(x)wx is the stationary distribution.

Thm. If Markov chain is reversible, finite, aperiodic, then for π=minxΩπ(x)>0,

1δδlog(12ε)tmix(ε)12δlog(1πε),

where δ=λ1λ2 is the eigen gap of P.

Takeaway: 1δ reflects "conductance".

Def 4.2. For a subset S of vertices, the normalized conductance of S is

Φ(S)=xS,ySπxpx,ymin(π(S),π(S)).

Def 4.3. The normalized conductance of a Markov chain is

Φ=minSΩ,SΦ(S)

Thm(Cheeger's Inequality). Let δ be the eigen gap of P,

δ2Φ2δ.

Combined with the previous theorem,

cΦlog1εtmix(ε)cΦ2log1πε

Thm 4.5.tave(ε)=O(ln(1/π)ϕ2ε3).

Ex(1-D lattice). Let G be the ring with n vertices, i.e., i is neighbor to i1,i+1 and Px,y=12 for all the edges.

Since it's periodic, we can make the process aperiodic by being "lazy": P(x,y)={12P(x,y)xy12x=y.. For constant ε, tmix of the lazy process ctave of the original process.

Let S be half of the ring, ϕ=ϕ(S)=2n. Thus tave=O(n2).

Ex(2-D lattice). Let G be the cyclic chessboard of size n×n, Px,y=14 for all the edges.

Φ(S)=#(across edges)/4n2|S|/n2Ω(S/4n2|S|/n2)Ω(1n). Thus tmix=O(n2).

Remark: 2-D lattice mixes faster than 1-D lattice and is better connected.

Ex(clique). tmix=Ω(1).

Coupling

Def. A coupling of Markov chain with transition P are {X1,...,XT},{Y1,...,YT} s.t.

  • Xt+1P(Xt,),Yt+1P(Yt,).
  • If Xs=Ys then Xt=Yt for ts.

Thm. Assume X0=x,Y0=y. Let τcouple=min{t:Xt=Yt},

Pt(x,)Pt(y,)TVP(XtYt)=P(x,y)(τcouple>t).

Remark: d(t)maxx,yΩP(x,y)(τcouple>t).

Ex(1-D lattice). P(x,y)={12y=x14|yx|=1.

The coupling constructed: Xt moves and Yt stays w.p. 12Yt moves and Xt stays w.p. 12. Then E(τ)=k(nk), where k=|xy|. Thus d(t)n24t.

Ex(hypercube). Ω={0,1}n,|Ω|=2n,P(x,y)={12y=x12nyx=1.
The coupling constructed: Pick a random coordinate i[n], generate bt{0,1} and let Xt+1(i)=Yt+1(i)=bt. Then E(τ)=E(all coordinates are selected at least once)=nlnn.

posted @   xcyle  阅读(200)  评论(1编辑  收藏  举报
相关博文:
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
点击右上角即可分享
微信分享提示
CONTENTS