Markov Chains
1. Introduction
Let \(\{X_n,n=0,1,2,...\}\) be a stochastic process that takes on a finite number of possible values. If \(X_n=i\), then the process is said to be in state \(i\) at time \(n\).
We suppose that whatever the process is in state \(i\), there is a fixed probability \(P_{ij}\) that it will next be in state \(j\).
2. Chapman-Kolmogorov Equations
We now define the \(n\)-step transition probabilities \(P_{ij}^n\) to be the probability that a process in state \(i\) will be in state \(j\) after \(n\) additional transitions. That is,
The Chapman-Kolmogorov equations tell us that
3. Classification of States
State \(j\) is said to be accessible from state \(i\) if \(P_{ij}^n>0\) for some \(n\ge0\). Two states \(i\) and \(j\) are accessible to each other are said to communicate, we write \(i\lrarr j\).
Obviously, the relation of communication has the following properties:
- if \(i\lrarr j\), then \(j\lrarr i\)
- if \(i\lrarr j\) and \(j\lrarr k\), then \(i\lrarr k\).
Two states that communicate are said to be in the same class. Any two classes of states are either identical or disjoint.
The concept of communication divides the state space into a number of separate classes. The Markov chain is said to be irreducible if there is only one class, that is, if all states communicate with each other. In other words, it is possible to move from any state to any other state.
For any state \(i\), we let \(f_i\) denote the probability that, starting in state \(i\), the process will ever reenter state \(i\). State \(i\) is said to be recurrent if \(f_i=1\) and transient if \(f_i<1\).
Each time the process enters a transient state \(i\), there will be a positive probability \(1-f_i\) that the process will never enter state \(i\) again. Therefore, starting in state \(i\), the probability that the process will be in state \(i\) for exactly \(n\) time periods equals \(f_i^{n-1}(1-f_i),n\ge1\). In other words, starting in state \(i\), the number of time periods that the process will be in state \(i\) has a geometric distribution with finite mean \(1/(1-f_i)\).
A transient state will only be visited a finite number of times. A finite-state Markov chain not all states can be transient.
Corollary 3.1 If state \(i\) is recurrent, and state \(i\) communicates with state \(j\), then state \(j\) is recurrent. Conversely, if \(i\) is transient and communicates with state \(j\), then state \(j\) must also be transient.
4. Limiting Probabilities
2nd definition of period: The states in a recurrent class are periodic if they can be grouped into \(d>1\) groups so that all transitions from one group lead to the next group.
State \(i\) is said to have period \(d\) if \(P_{ii}^n=0\) whenever \(n\) is not divisible by \(d\), and \(d\) is the largest integer with this property. A state with period 1 is said to be aperiodic. Periodicity is a class property, that is, if state \(i\) has period \(d\), and \(i\) communicate with \(j\), then \(j\) also has period \(d\).
If state \(i\) is recurrent, then it is said to be positive recurrent if, starting in \(i\), the expected time until the process returns to state \(i\) is finite. Positive recurrent, aperiodic states are called ergodic.
Theorem 4.1 For an irreducible ergodic Markov chain \(\lim_{n\rarr\infin}P_{ij}^n\) exists and is independent of \(i\). Furthermore, letting
then \(\pi_j\) is the unique nonnegative solution of
The equation can be interpreted as such: The probability of into state \(j\) is the sum of probability of the last state \(i\) into \(j\) weighted by the probability of the transition from \(i\) to \(j\).
\(\pi_j\) can be considered as the frequency of the state \(j\) will be visited when the number of total visiting time \(N\rarr\infin\).
The workers problem
Consider an organization whose workers are of \(r\) distinct types. Suppose that a worker who is currently type \(i\) will in the next period become type \(j\) with probability \(q_{ij}\) for \(j=1,...,r\) or will leave the organization with probability \(1-\displaystyle\sum_{j=1}^rq_{ij}\). In addition, suppose that new workers are hired each period, and that the numbers of types \(1,...,r\) workers hired are independent Poisson random variables with mean \(\lambda_1,...,\lambda_r\). If we let \(\mathrm{\textbf{X}}_n=(X_n(1),...,X_n(r))\), where \(X_n(i)\) is the number of type \(i\) workers in the organization at the beginning of period \(n\), then \(\mathrm{\textbf{X}}_n,n\ge0\) is a Markov chain.
To compute its stationary probability distribution, suppose that the initial state is chosen so that the number of workers of different types are independent Poison random variables, with \(\alpha_i\) being the mean number of type \(i\) workers. Also, let \(N_j,j=1,...,r\) be the number of new type \(j\) workers being hired during the initial period. Now, fix \(i\), and for \(j=1,...,r\), let \(M_i(j)\) be the number of the \(X_0(i)\) type \(i\) workers who become type \(j\) in the next period. Then
are independent Poisson random variables with means
Hence, if we let
then the stationary distribution of the Markov chain is the distribution that takes the number of workers in each type to be independent Poisson random variables with means \(\alpha_1^o,...,\alpha_r^o\). That is
Mean Pattern Times in Markov Chain Generated Data
Consider an irreducible Markov chain \(\{X_n,n\ge0\}\) with transition probabilities \(P_{ij}\) and stationary probabilities \(\pi_j,j\ge0\). Starting in state \(r\), we are interested in determining the expected number of transitions until the pattern \(i_1,i_2,...,i_k\) appears. That is, with
we are interested in
Let \(\mu(i,i_1)\) be the mean number of transitions for the chain to enter state \(i_1\), given that the initial state is \(i,i\ge0\). The quantities \(\mu(i,i_1)\) can be determined as the solution of the following set of equations, obtained by conditioning on the first transition out of state \(i\):
For the Markov chain \(\{X_n,n\ge0\}\) associate a corresponding Markov chain, which we will refer to as the \(k\)-chain, whose state at any time is the sequence of the most recent \(k\) states of the original chain. Let \(\pi(j_1,...,j_k)\) be the stationary probabilities for the \(k\)-chain. Because \(\pi(j_1,...,j_k)\) is the proportion of time that the original Markov chain \(k\) units ago was \(j_1\) and the following \(k-1\) states, in sequence, were \(j_2,...,j_k\), we can conclude that
Moreover, because the mean number of transitions between successive visits of the \(k\)-chain to the state \(i_1,i_2,...,i_k\) is equal to the inverse of the stationary probability of that state, we have that
Let \(A(i_1,...,i_m)\) be the additional number of transitions needed until the pattern appears, given that the first \(m\) transitions have taken the chain into states \(X_1=i_1,...,X_m=i_m\).
Now we consider whether the pattern has overlaps, where we say that the pattern \(i_1,i_2,...,i_k\) has an overlap of size \(j,j<k\), if the sequence of its final \(j\) elements is the same as that of its first \(j\) elements.
Case 1
Later
Proposition 4.1 Let \(\{X_n,n\ge1\}\) be a irreducible Markov chain with stationary probabilities \(\pi_j,j\ge0\), and let \(r\) be a bounded function on the state space. Then
Proof If we let \(a_j(N)\) be the amount of time the Markov chain spends in state \(j\) during periods \(1,...,N\), then
Since \(a_j(N)/N\rarr\pi_j\), we can divide both sides by \(N\) and let \(N\rarr\infin\).
5. Some Applications
The Gambler's Ruin Problem
A Model for Algorithmic Efficiency
6. Mean Time Spent in Transient States
For transient state \(i\) and \(j\), let \(s_{ij}\) denote the expected number of time periods that the Markov chain is in state \(j\), given that it starts in state \(i\). Let \(\delta_{ij}=1\) when \(i=j\) and let it be 0 otherwise. Condition on the initial transition to obtain
where the final equality follows since it is impossible to go from a recurrent to a transient state, implying that \(s_{kj}=0\) when \(k\) is a recurrent state.
Let \(S\) denote the matrix of \(s_{ij},i,j=1,...,t\), the above equation can be written as
7. Branching Processes
Consider a population consisting of individuals able to produce offspring of the same kind. Suppose that each individual will have produced \(j\) new offsprings with probability \(P_j,j\ge0\), independently of the numbers produced by other individuals. We suppose that \(P_j<1\) for all \(j\ge0\). The size of the \(n\)th generation is denoted as \(X_n,n=0,1,...\). It follows that \(\{X_n,n=0,1,...\}\) is a Markov chain having as its state space the set of nonnegative integers.
Note that state \(0\) is a recurrent state, since clearly \(P_{00}=1\). Also, if \(P_0>0\), all other states are transient, since \(P_{i0}=P_0^i\), which implies that starting with \(i\) individuals there is a positive probability that no longer generation will ever consist of \(i\) individuals. This leads to the important conclusion that, if \(P_0>0\), the population will either die out or its size will converge to infinity.
Let \(\mu=\displaystyle\sum_{j=0}^\infin jP_j\) denote the mean number of offspring of a single individual, and let \(\sigma^2=\displaystyle\sum_{j=0}^\infin(j-\mu)^2P_j\) be the variance of the number of offspring produced by a single individual.
\(X_n\) can be written as \(X_n=\displaystyle\sum_{i=1}^{X_{n-1}}Z_i\), where \(Z_i\) represents the number of offsprings produced by the \(i\)st individual of the \((n-1)\)st generation. By conditioning on \(X_{n-1}\), we obtain
Since \(E[X_0]=1\), the equation yields that \(E[X_n]=\mu^n\).
Similarly, \(\mathrm{Var}(X_n)\) may be obtained by using the conditional variance formula
The law of total variance or conditional variance formulas, states that if \(X\) and \(Y\) are random variables on the same probability space, and the variance of \(Y\) is finite, then
\[\mathrm{Var}(Y) = E[\mathrm{Var}(Y|X)] + \mathrm{Var}(E[Y|X]) \]
Given \(X_{n-1}\), \(X_n\) is just the sum of \(X_{n-1}\) independent random variables each having the distribution \(\{P_j,j\ge0\}\). Hence,
Therefore,
Let \(\pi_0\) denote the probability that the population will die out (under the assumption that \(X_0=1\)).
Suppose that \(X_1=j\), and since all individuals are independent, then the probability of these \(j\) individuals to die out is \(\pi_0^j\), that is
Thus the following equation holds:
If fact when \(\mu>1\), it can be shown that \(\pi_0\) is smallest positive number satisfying the equation.
8. Time Reversible Markov Chains
Suppose that starting at some time we trace the sequence of states going backward in time. That is, starting at time \(n\), consider the sequence \(X_n,X_{n-1},...\). It turns out that this sequence of states is itself a Markov chain with transition probabilities \(Q_{ij}\) defined by
To prove that the reversed process is indeed a Markov chain, we must verify that
As independence is a mutual relationship, when we say future states \(X_{m+1},X_{m+2},...\) given the present state \(X_m\) are independent of the past state \(X_{m-1}\), it is also true that past states \(X_{m-1},X_{m-2},...\) given the present state \(X_{m}\) are independent of the future state \(X_{m+1}\).
Thus, the reversed process is also a Markov chain with transition probabilities given by \(Q_{ij}=\displaystyle\frac{P_{ji}\pi_j}{\pi_i}\). If \(Q_{ij}=P_{ij}\), the Markov chain is said to be time reversible. Which means
This condition can be interpreted as for all states \(i\) and \(j\), the rate at which the process goes from \(i\) to \(j\) is equal to the rate from \(j\) to \(i\).
Theorem 8.1 A Markov chain with transition probabilities \(P_{ij}\) is said to be time reversible if for any two states \(i\) and \(j\), the equation \(P_{ij}\pi_i=P_{ji}\pi_j\) holds.
If we can find nonnegative numbers, summing to one, that satisfy Equation (8.1), then we can say the Markov chain is time reversible. Actually, if we sum over \(i\) yields
which is equivalent to Equation (4.1).
Theorem 8.2 An ergodic Markov chain for which \(P_{ij}=0\) whenever \(P_{ji}=0\) is time reversible if and only if starting in state \(i\), any path back to \(i\) has the same probability as the reversed path. That is, if
for all states \(i,i_1,...,i_k\). The converse is not true.
Proposition 8.1 Consider an irreducible Markov chain with transition probabilities \(P_{ij}\). If we can find positive numbers \(\pi_i,i\ge0\), summing to one, and a transition probability matrix \(Q=[Q_{ij}]\) such that
then the \(Q_{ij}\) are the transition probabilities of the reversed chain and the \(\pi_i\) are the stationary probabilities both for the original and reversed chain.
9. Markov Chain Monte Carlo Methods
A Monte Carlo method usually follows the following pattern:
- Define a domain of possible inputs
- Generate inputs randomly from a probability distribution over the domain
- Perform a deterministic computation on the inputs
- Aggregate the results
However, when the inputs are vectors, it is difficult to generate a random vector having the specified probability mass function. In these cases, we can generate a sequence, not of independent random vectors, but of the successive states of a vector-valued Markov chain \(\mathrm{\textbf X}_1,\mathrm{\textbf X}_2,...\) whose stationary probabilities are \(P\{\mathrm{\textbf X}=\mathrm{\textbf x}_j\},j\ge1\).
Let \(b(j),j=1,2,...\) be positive numbers whose sum \(B=\sum_{j=1}^\infin b_j\) is finite. The following, known as Hastings-Metropolis algorithm, can be used to generate a time reversible Markov chain whose stationary probabilities are
To begin, let \(Q\) be any specified irreducible Markov transition probability matrix on the integers, with \(q(i,j)\) representing the row \(i\) column \(j\) element of Q. Now define a Markov chain \(\{X_n,n\ge0\}\) as follows. When \(X_n=i\), generate a random variable \(Y\) such that \(P\{Y=j\}=q(i,j),j=1,2,...\). If \(Y=j\),then set \(X_{n+1}\) equal to \(j\) with probability \(\alpha(i,j)\), and set it equal to \(i\) with probability \(1-\alpha(i,j)\). Under these conditions, it is easy to see that the sequence of states constitutes a Markov chain with transition probabilities \(P_{ij}\) given by
This Markov chain will be time reversible and have stationary probabilities \(\pi(j)\) if
which is equivalent to
If we take \(\pi_j=b(j)/B\) and set
For if \(\alpha(i,j)=\displaystyle\frac{\pi(j)q(j,i)}{\pi(i)q(i,j)}\), then \(\alpha(j,i)=1\); if \(\alpha(i,j)=1\), then \(\alpha(j,i)=\displaystyle\frac{\pi(i)q(i,j)}{\pi(j)q(j,i)}\).
Since \(\pi(j)=b(j)/B\), then
which shows that the value of \(B\) is not needed to define the Markov chain.
10. Markov Decision Processes
Consider a process with \(M\) possible states, after every transition, an action must be chosen. The next state of the system is determined according to the transition probabilities \(P_{ij}(a)\). Let \(X_n\) denote the process state at time \(n\), let \(a_n\) denote the action chosen at time \(a\). Then
By a policy, we mean a rule for choosing actions. The rule shall be restricted to choosing the action at time \(n\) only depends on time \(n\) and state of process at time \(n\). On the other hand, we allow the policy to be "randomized", that it may choose actions according to a probability distribution.
Under any given policy \(\beta\), the sequence of states constitutes a Markov chain with transition probabilities \(P_{ij}(\beta)\) defined by
Let us suppose for every choice of a policy \(\beta\), the resultant Markov chain is irreducible.
For any policy \(\beta\), let \(\pi_{ia}\) denote the stationary probability that the process will be in state \(i\) and action \(a\) will be chosen if policy \(\beta\) is employed.
The vector \(\pi=(\pi_{ia})\) must satisfy
- \(\pi_{ia}\ge0\) for all \(i,a\)
- \(\sum_i\sum_a\pi_{ia}=1\)
- \(\sum_a\pi_{ja}=\sum_i\sum_aP_{ij}(a)\) for all \(j\)
It turns out that for any policy \(\beta\), the vector satisfying these three conditions exists. The converse is also true, suppose that \(\pi=(\pi_{ia})\) is a vector that satisfies the three conditions, then let the policy \(\beta=(\beta_i(a))\) be
we see that \((P_{ia})\) is the unique solution of
Hence, to show that \(P_{ia}=\pi_{ia}\), we need show that
The third equation is equivalent to
Thus we have shown that a vector \(\pmb\beta=(\pi_{ia})\) will satisfy the three conditions if and only if there exists a policy \(\pmb\beta\) such that \(\pi_{ia}\) is equal to the steady-state probability of being in state \(i\) and choosing action \(a\) when \(\pmb\beta\) is used. In fact, the policy is defined by \(\pmb\beta=\pi_{ia}/\sum_a\pi_{ia}\).
The above conclusion is quite important in the determination of "optimal" policies. Consider that a reward \(R(i,a)\) is earned whenever action \(a\) is chosen in state \(i\). Let \(R(X_i,a_i)\) denote the reward earned at time \(i\). The expected average reward per time unit under policy \(\pmb\beta\) can be expressed as
The limiting expected reward at time \(n\) equals
Hence, the problem of determining the policy that maximizes the expected average reward is
11. Hidden Markov Chains
Let \(\{X_n,n=1,2,...\}\) be a Markov chain with transition probabilities \(P_{i,j}\) and initial state probabilities \(p_i=P(X_1=i),i\ge0\). Suppose that there is a finite set \(\mathscr I\) of signals, and that a signal from \(\mathscr I\) is emitted each time the Markov chain enters a state. Further, suppose that when the Markov chain enters state \(j\) then, independently of previous Markov chain states, the signal emitted is \(s\) with probability \(p(s|j),\displaystyle\sum_{s\in\mathscr I}p(s|j)=1\).
A model of the preceding type in which the sequence of signals \(S_1,S_2,...\) is observed, while the sequence of the underlying Markov chain states \(X_1,X_2,...\) is unobserved, is called a hidden Markov chain. That is, if \(S_n\) represents the \(n\)th signals emitted, then
Let \(S^n=\{s_1,s_2,...,s_n\}\) denote the random vector of the first \(n\) signals. For a fixed sequence of signals \(s_1,s_2,...,s_n\), let \(\mathrm{\pmb s}_k=\{s_1,...,s_k\},k\le n\). To begin, we determine the conditional probability of the Markov chain state at time \(n\) given that \(S^n=\mathrm{\pmb s}_n\). Let
Then
In case you don't understand, in the preceding we used that
So
Based on \(F_{n}(i)\), we can recursively determine \(F_{n+1}(i)\).
This computation of \(P\{S^n=\mathrm{\pmb s}_n\}\) by recursively determining the functions \(F_k(i)\) is known as the forward approach. There is also a backward approach, which is based on the quantities \(B_k(i)\) defined by
A recursive formula for \(B_k(i)\) can be obtained by conditioning on \(X_{k+1}\).
Starting with
we would then use Equation (11.2) to determine the function \(B_{n-2}(i)\), then recursively down to \(B_1(i)\). This would then yield \(P\{S^n=\mathrm{\pmb s}_n\}\) via
Another approach to obtaining \(P\{S^n=\mathrm{\pmb s}_n\}\) is to combine the forward and backward approaches. Suppose that for some \(k\) we have computed both functions \(F_k(j)\) and \(B_k(j)\). Because
we see that
Now we may simultaneously compute the sequence of forward functions, starting with \(F_1\), as well as the sequence of backward functions, starting at \(B_{n-1}\). The parallel computations can then be stopped once we have computed both \(F_{k}\) and \(B_k\) at some \(k\).
11.1 Predicating the States
Suppose that the first \(n\) observed signals are \(\{S_1,S_2,...,S_n\}\), and that given this data, we want to predict the first \(n\) states of the Markov chain. The best predicator depends on what we are trying to accomplish. If our objective is to maximize the expected number of correctly predicated states, then for each \(j=1,...,n\), we compute \(P(X_k=j|S^n=\mathrm{\pmb s}_n)\) and let the value of \(j\) that maximizes this quantity be the predicator of \(X_k\).
If we consider a sequence of states as an entity, then our objective is to choose a sequence of states that maximizes the conditional probability, given the sequence of signals, is maximal. Letting \(\mathrm{\pmb X}_k=(X_1,...,X_k)\) be the vector of the first \(k\) states, the problem of interest is to find the sequence of states \(i_1,...,i_n\) that maximizes \(P\{\mathrm{\pmb X}_n=(i_1,...,i_n)|S^n=\mathrm{\pmb s}_n\}\). Because
this is equivalent to finding the sequence \(i_1,...,i_n\) that maximizes $P{\mathrm{\pmb X}_n=(i_1,...,i_n),\mathrm{\pmb S}^n=\mathrm{\pmb s}_n} $.
Let, for \(k\le n\),
To recursively solve \(V_k(j)\),
To obtain the maximizing sequence of states, we work in the reverse direction. Let \(j_n\) be the value that maximizes \(V_n(j)\). Also, for \(k<n\), let \(i_k(j)\) be a value of \(i\) that maximizes \(P_{ij}V_k(i)\). Then
Thus, \(i_{n-1}(j_n)\) is the next to last state of the maximizing sequence. Continuing in this manner, the second from the last state of the maximizing sequence if \(i_{n-2}(i_{n-1}(j_n))\), and so on.
The preceding approach to finding the most likely sequence of states given a prescribed sequence of signals is known as the Viterbi Algorithm.