Markov chains are essential in initiating the study of stochastic processes. A typical example is a random walk, where the transition (probabilities) of going from one state to the next depends only on the current state (location), with no relation to the past. That is, we can ignore and throw away the past history as we continue to take steps forward in the time horizon.

Let's look at some examples in action and build some intuition and theory.

Risk Classification of Insured Drivers

Suppose an automobile insurance company classifies its insureds in three levels of risk: high, medium, low (where low risk drivers are preferred). This insurer categorizes drivers yearly, and drivers only move between adjacent risk levels. A stochastic model finds that 40% of insureds improve from high risk to medium, 30% at medium risk improve to low risk and 10% move to high risk, and 20% of low risk insureds move down to medium risk. Find the distribution of drivers in the three risk categories in 20 years.

For simplicity, let's label $A$ as low risk, $B$ as medium risk, and $C$ as high risk. Then our transition probabilities, where the row label indicates the current state, and the column label indicates the next year's state, is:

\begin{align*}
\begin{matrix}
& A & B & C \\
A & 0.8 & 0.2 & 0 \\
B & 0.3 & 0.6 & 0.1 \\
C & 0 & 0.4 & 0.6
\end{matrix}
\end{align*}

where this reflects that the probabilities in each row sum to 1, all entries are nonnegative, and the probability of skipping over risk level $B$ is zero as given in the scenario.

Then we can compactly put this into a "transition matrix" as:

$$ P := \begin{pmatrix}  0.8 & 0.2 & 0 \\
0.3 & 0.6 & 0.1 \\
0 & 0.4 & 0.6 \end{pmatrix} $$

where $P(i,j)$, the matrix element of the $i$th row and $j$th column, is the transition probability of going from state $i$ to $j$ under this 3-state Markov chain of driver risk classification.

For projection purposes, we may want to know the long-run behavior of drivers and how they place into these three risk classes. Using this stochastic model, per the Chapman-Kolmogorov equations, we can simply put this into R and compute high powers of this matrix $P$.

We define a quick recursive matrix power function matpow to do this correctly.

matpow <- function(P, n) {
  if (n == 1) {
    return(P)
  } else {
    return(P %*% matpow(P, n-1)) # %*% denotes matrix multiplication
  }
}

Now we can work with our matrix:

> P <- matrix(data = c(
+   c(0.8, 0.2, 0), 
+   c(0.3, 0.6, 0.1), 
+   c(0, 0.4, 0.6)
+ ), byrow = TRUE, nrow = 3, ncol = 3, dimnames = list(c('A','B','C'), c('A','B','C')))
> 
> 
> matpow(P, 20)
       A      B       C
A 0.5456 0.3636 0.09084
B 0.5454 0.3637 0.09095
C 0.5451 0.3638 0.09113
> matpow(P, 21)
       A      B       C
A 0.5455 0.3636 0.09087
B 0.5454 0.3637 0.09094
C 0.5452 0.3637 0.09106
> matpow(P, 100) == matpow(P, 101)
     A    B    C
A TRUE TRUE TRUE
B TRUE TRUE TRUE
C TRUE TRUE TRUE

Of course, the rows sum to 1, but notice that we seem to have convergence. In fact, the probabilities after 20 years and those after 21 years are nearly equal. For example,
$$P^{20}(A,A) = 0.5456 \approx 0.5455 = P ^{21}(A,A)$$
And comparing 100 and 101 years, we have exact floating point equality (see my article on Numerical Analysis, iterations, and machine accuracy). The result we get from matpow(P,20) is certainly sufficient as a reference to model the interactions between the risk classifications for drivers. To understand this result and this Markov chain a bit better, we consider the stationary distribution $\pi$, a row vector of probabilities for each state. That is,
$$ \pi = \begin{pmatrix} \pi(A) & \pi(B) &  \pi(C) \end{pmatrix} $$ so that $\pi P  = p$ gives:
\begin{align*}
\begin{pmatrix} \pi(A) & \pi(B) &  \pi(C)  \end{pmatrix} \begin{pmatrix}  0.8 & 0.2 & 0 \\
0.3 & 0.6 & 0.1 \\
0 & 0.4 & 0.6 \end{pmatrix} &= \begin{pmatrix} \pi(A) & \pi(B) &  \pi(C)  \end{pmatrix}  
\end{align*}

Then this gives rise to a system of linear equations:
\begin{align*}
\pi(A) &= 0.8 \pi(A) + 0.3 \pi(B) \\
\pi(B) &= 0.2 \pi(A) + 0.6 \pi(B) + 0.4 \pi(C) \\
\pi(C) &= 0.1 \pi(B) + 0.6 \pi (C) \\
1 &= \pi(A) + \pi(B) + \pi(C),
\end{align*}
where the last equation is from normalization (probabilities in the stationary probability distribution $\pi$ sum to 1). Solving this can also be done via computing as an alternative to computing high powers, and is as simple as the backslash command in Matlab or solve in R, or inverting a $3\times 3$ matrix. There are many ways we can use a computer to arrive at the stationary distribution.  


A Golden Staircase

Now let's consider a different example of a stochastic problem.

Suppose we're climbing a staircase with $N$ steps in total, and we can go up $1$ or $2$ steps at a time. We may ask, how many ways can we climb all the steps?

This problem seems innocuous a priori, and it truly isn't bad. But a naive attempt would be met with quite some difficult. Suppose we have $N := 20$ stairs in total. Obviously, we could take all single steps. Alternatively, we could take all double steps. On the other hand, we could take a mix of single and double steps, and this gets complicated quite quickly. We might find ourselves trying to count all of these! Instead, we'll break down the problem.

Let $w$ be the number of ways we can climb the $N$ steps (the number $N$ is specified and fixed). Then we reason that for some $m \in [2, N]$, before getting to step $m$, we were either at step $m-1$ or $m-2$. Then we can write:

$$ w_m =  w_{m-1} + w_{m-2}, \forall_{m \geq 2}.$$

Let $m=0$ denote the bottom floor, before ascending the first step. Then we have:

\begin{align*}
w_0 &= 1 \\
w_1 &= 1 \\
w_2 &= w_1 + w_0 = 2 \\
w_3 &= w_2 + w_1 = 3 \\
w_4 &= w_3 + w_2 = 5 \\
w_5 &= w_4 + w_3 = 8 \\
w_6 &= w_5 + w_4 = 13,
\end{align*}

and onwards. Looks familiar? Surely we can say that the number of ways to climb these $n$ steps is the $n+1$th Fibonacci number, but in a general setting, we may not have such an elegant pattern. Or perhaps we want a stronger and immediate closed form formula for this (as opposed to the recursive nature of the Fibonacci number). We'll want the help of a generating function.

Let
$$ W(s) := \sum_{m \geq 0} w_m s^m $$ be the generating function for $w_m$ with $m \geq 0$. Then the generating function of $w_{m+1}$ for $m \geq 0$ is
$$ \sum_{m \geq 0} w_{m+1} s^m = s^{-1}[W(s) - w_0] $$
and the generating function of $w_{m+2}$ for $m \geq 0$ is
$$\sum_{m \geq 0} w_{m+2} s^m = s^{-2}[W(s) - w_0 - s w_1].$$

Then the recursive formula (that of Fibonacci's)
$$ w_{m+2} = w_{m+1} + w_m $$ gives that the generating function of $w_{m+2}$ equals the sum of generating functions for $w_{m+1}$ and $w_m$. That is,

$$ s^{-2} [ W(s) - w_0 - s w_1 ] = s ^{-1} [ W(s) - w_0] + W(s). $$

Now because we know $w_0 = w_1 = 1$, we can then solve for $W(s)$ to get:

$$ W(s) = \frac{-1}{s^2 + s - 1}. $$

At this point, we then ask which sequence we've found that has this generating function. First, we notice that the polynomial $s^2 + s - 1$ in the denominator has two roots: $\frac{\sqrt{5} - 1}{2}$ and $\frac{-\sqrt{5} - 1}{2}$. Recognizing the golden ratio $\varphi$, these two roots are $\{ -1/\varphi, - \varphi  \}$ or equivalently $\{ 1 - \varphi, - \varphi\}$. After some algebra, we have:

$$ w_n = \frac{(1 + \sqrt{5} )^{n+1} - (1 - \sqrt{5}) ^{n+1} }{ 2 ^{n+1} \sqrt{5}},$$

which happens to give the $n+1$th Fibonacci number exactly (the form above always returns an integer for natural number input $n$.

In this staircase example, we took generating functions for granted. Let's take a deeper look into probability generating functions and how they work.


Generating Functions

Consider a random variable $X$ over state space $\Omega := \{1,2,3,\dots\} \cup \{\infty \}$, and generating function $\varphi(z) = \mathbb{E} (z^X)$.

Notice that

\begin{align*} \varphi(0) &= \varphi(z) \Big|_{z = 0} \\
&= \sum_{k=0}^{\infty} \mathbb{P} (X = k) z ^k \Big|_{z = 0} \\
 &= \mathbb{P}(X = 0),
\end{align*}
so then $\mathbb{P}(X = 0) = \varphi(0)$.

Consider then that

\begin{align*}
\underbrace{\sum_{k = 0}^\infty \mathbb{P}(X = k)}_{= \varphi(1)} + \mathbb{P}(X = \infty) = 1 \end{align*} so then
$$ \mathbb{P}(X = \infty) = 1 - \varphi(1).$$

Expectation ($\mathbb{E}(X)$)

Now we have

\begin{align*}
\mathbb{E} (X) &= \sum_{k=0}^{\infty} k \mathbb{P} (X = k) \\\varphi ' (z) &= \sum_{k=0} ^ \infty k \mathbb{P} (X = k) z ^{k-1}
\end{align*}
where we notice that $\varphi ' (1)$ returns all zeros in the summation besides $\mathbb{E} (X)$. So we conclude that

$$ \mathbb{E} (X) =  \varphi' (1). $$

Variance ($\text{Var}(X)$)

Now, let's look further into the second derivative of $\varphi(z)$. That is,
$$ \varphi'' (z) = \sum_{k=2}^\infty k (k-1) p_k z ^{k-2}, $$
so that taking $z := 1$ gives:
\begin{align*}
\varphi''(1) &= \sum_{k=2} ^\infty (k ^2 p_k - k p_k) \\
&= \sum_{k=2} ^\infty k ^2 p_k - \sum_{k=2} ^\infty k p_k \\
&= \sum_{k =0} ^\infty k ^2 p_k - \sum_{k=0} ^\infty k p_k \\
&= \mathbb{E}(X ^2) - \mathbb{E} (X) \\
&= \mathbb{E} (X ^2) - \varphi'(1),
\end{align*}
so then rearranging gives:
$$\mathbb{E} (X ^2) = \varphi' (1) + \varphi'' (1) .$$

Then our familiar expression for variance gives:
\begin{align*}
\text{Var}(X) &= \mathbb{E} (X ^2) - [\mathbb{E}(X)] ^2 \\&= \varphi' (1) + \varphi'' (1) - [\varphi'(1) ] ^2
\end{align*}
in terms of our generating function $\varphi$.


Coin Tosses

Let's look at a simple example to practice finding generating functions, from which we can use the above expressions to get the expected value and variance and other measures.

Consider $N$ (independent, identically distributed) coin tosses with probability $p$ of landing heads. Let $X$ be the total number of heads in these $N$ tosses. Also, let $Y$ be the random variable with
$$ \mathbb{P} (Y = k) = (1 - p) ^{k-1} p, $$
for $k = 1,2,3,\dots$. Find the generating function of $X$ and that of $Y$.

For $X$, the number of heads in the $N$ tosses, notice that each individual toss is i.i.d. Bernoulli$(p)$, so the total number $X$ is then Binomial$(N,p)$. This follows directly from the counting argument behind the binomial theorem. Then we have:

\begin{align*}
\varphi(z) &:= \mathbb{E} (z ^X) \\
&= \sum_{k=0} ^n \mathbb{P} (X = k) z ^k \\
&= \sum_{k=0} ^n \binom{n}{k} (1 - p) ^{n-k} (pz) ^k \\
&= [ (1 - p) + z p ] ^n \\
&= (q + zp) ^n,
\end{align*}
where $q := 1 - p$ is the probability of landing tails.

Now for $Y$, notice that $Y$ essentially models the probability of taking $k$ tosses to (finally) get one head. We're given that
$$ \mathbb{P} (Y = k) = (1-p) ^{k-1} p $$
so our generating function for $Y$ is

\begin{align*}
\varphi(z) &:= \mathbb{E} (z ^Y) \\
&= \sum_{k=1} ^\infty \mathbb{P} (Y =k ) z ^k \\
&= \sum_{k=1} ^\infty ( 1 - p) ^{k-1} p \cdot z ^k \\
&= \frac{p}{1-p} \left[ \frac{ (1- p) z } {1 - z(1 - p) } \right] \\
&= \frac{pz}{1 - zq}.
\end{align*}


Galton-Watson Branching Processes and Extinction Probability

Consider a branching process with $Z_0 := 1$ (where $Z_n$ denotes the size of the $n$th generation) with the branching mechanism:
$$ p_0 = \frac{1}{10}, \quad p_1 = \frac{7}{10}, \quad p_2 = \frac{2}{10} .$$
What is the probability of ultimate extinction? What is the expected (mean) size of the $n$th generation? What is the variance of the size of the $n$th generation?

Here, from any given population $Z_i$, the probability of each node extending to (giving birth to) 0, 1, or 2 children is given by $p_0, p_1, p_2$, independently.

Then the generating function of the Galton-Watson branching mechanism is

$$ \varphi(z) = \frac{1}{10} z ^0 + \frac{7}{10} z ^1 + \frac{2}{10} z ^2 = \frac{1}{10} ( 1 + 7 z + 2 z ^2 ). $$

The ultimate extinction probability (where no nodes have further children), denoted by $\varepsilon$, is the smallest positive $z$ such that
$$ \varphi (z) = z . $$

Then we simply solve $1 + 7z +2 z ^2 = 10 z$, which gives $z = \{ \frac{1}{2}, 1 \}$, and hence the probability of ultimate extinction is $$\varepsilon = \frac{1}{2}.$$

Recall that above, we found that $\mathbb{E}(X) = \varphi'(1)$. Then the mean (expected) number $m$ of offspring of a given individual is given by

\begin{align*}
\varphi(z) &= \frac{1}{10} ( 1 + 7z + 2 z ^2 ) \\
\varphi'(z) &= \frac{7}{10} + \frac{4z}{10} \\
m := \varphi'(1) &= \frac{11}{10}
\end{align*}

Then the mean size of the entire $n$th generation (as this is a stochastic process) is

$$ \mathbb{E} (Z_n) = m^N = \left( \frac{11}{10} \right) ^ N. $$

We can find the variance similarly.

Now suppose we start with $Z_0 := N$ individuals at the start, where $N \in \mathbb{Z} ^+$. What is the extinction probability, and what is the expected population size at time (generation) $n$?

We note that this variant of the branching process behaves as the superposition of $N$ independent and identically distributed copies of the individual branching process earlier where $Z_0 := 1$. Then this variant branching process becomes extinct only if every one of the $N$ copies (of individual branching processes) becomes extinct. Then by independence, we have that the overall ultimate extinction probability is

$$ \varepsilon ^N = \left( \frac{1}{2} \right) ^N .$$

Similarly, the $n$th generation of this variant process is the sum of the populations of the independent generations of the individual $N$ single branching processes. Then the mean size of the $n$th generation here is simply

$$ N \left( \frac{11}{10} \right) ^ n . $$

By the same line of reasoning, we have that the variance of the size of the $n$th generation is $N \cdot \sigma_n ^2$, where $\sigma_n ^2$ is the variance of the individual branching process from earlier.


Stationary Distributions in the Branching Process

Earlier in the driver risk classification example, we found that taking successive powers of the transition probability matrix converged to a stationary distribution $\pi$. This had interesting properties and applications in computing, so does a branching process have a stationary distribution?

If the expected number of offspring is $m \leq 1$, then we know that the process will enter the absorbing state 0 (the population will eventually go extinct with probability 1).

Then the only nondegenerate stationary distribution $\pi$ (not $\pi = 0$) must have the properties:

$$ \pi(0) = 1, \quad \pi(i) = 0, \quad \forall_{i \geq 1}. $$

That is, if the mean number $m$ of offspring is greater than 1, then we know that its extinction probability is $\varepsilon < 1$. That is,

$$ \mathbb{P}_1 (\tau_0 = \infty) = 1 - \varepsilon > 0. $$

However, we know that for all $i$,

$$ \mathbb{P}_i ( \tau _0 = \infty ) = 1  - \varepsilon ^ i > 0, $$

and so this branching process is transient, and there is no such stationary distribution.