Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
260 changes: 256 additions & 4 deletions lectures/likelihood_ratio_process.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ jupytext:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.17.2
jupytext_version: 1.17.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
Expand Down Expand Up @@ -57,6 +57,7 @@ from scipy.optimize import brentq, minimize_scalar
from scipy.stats import beta as beta_dist
import pandas as pd
from IPython.display import display, Math
import quantecon as qe
```

## Likelihood Ratio Process
Expand Down Expand Up @@ -1764,6 +1765,260 @@ From the figure above, we can see:

**Remark:** Think about how laws of large numbers are applied to compute error probabilities for the model selection problem and the classification problem.

## Special case: Markov chain models

Consider two $n$-state irreducible and aperiodic Markov chain models on the same state space $\{1, 2, \ldots, n\}$ with positive transition matrices $P^{(f)}$, $P^{(g)}$ and initial distributions $\pi_0^{(f)}$, $\pi_0^{(g)}$.

In this section, we assume nature chooses $f$.

For a sample path $(x_0, x_1, \ldots, x_T)$, let $N_{ij}$ count transitions from state $i$ to $j$.

The likelihood under model $m \in \{f, g\}$ is

$$
L_T^{(m)} = \pi_{0,x_0}^{(m)} \prod_{i=1}^n \prod_{j=1}^n \left(P_{ij}^{(m)}\right)^{N_{ij}}
$$

Hence,

$$
\log L_T^{(m)} =\log\pi_{0,x_0}^{(m)} +\sum_{i,j}N_{ij}\log P_{ij}^{(m)}
$$

The log-likelihood ratio is

$$
\log \frac{L_T^{(f)}}{L_T^{(g)}} = \log \frac{\pi_{0,x_0}^{(f)}}{\pi_{0,x_0}^{(g)}} + \sum_{i,j}N_{ij}\log \frac{P_{ij}^{(f)}}{P_{ij}^{(g)}}
$$ (eq:llr_markov)

### KL divergence rate

By the ergodic theorem for irreducible, aperiodic Markov chains, we have

$$
\frac{N_{ij}}{T} \xrightarrow{a.s.} \pi_i^{(f)}P_{ij}^{(f)} \quad \text{as } T \to \infty
$$

where $\boldsymbol{\pi}^{(f)}$ is the stationary distribution satisfying $\boldsymbol{\pi}^{(f)} = \boldsymbol{\pi}^{(f)} P^{(f)}$.

Therefore,

$$
\frac{1}{T}\log \frac{L_T^{(f)}}{L_T^{(g)}} = \frac{1}{T}\log \frac{\pi_{0,x_0}^{(f)}}{\pi_{0,x_0}^{(g)}} + \frac{1}{T}\sum_{i,j}N_{ij}\log \frac{P_{ij}^{(f)}}{P_{ij}^{(g)}}
$$

Taking the limit as $T \to \infty$, we have
- The first term: $\frac{1}{T}\log \frac{\pi_{0,x_0}^{(f)}}{\pi_{0,x_0}^{(g)}} \to 0$
- The second term: $\frac{1}{T}\sum_{i,j}N_{ij}\log \frac{P_{ij}^{(f)}}{P_{ij}^{(g)}} \xrightarrow{a.s.} \sum_{i,j}\pi_i^{(f)}P_{ij}^{(f)}\log \frac{P_{ij}^{(f)}}{P_{ij}^{(g)}}$

Define the **KL divergence rate** as

$$
h_{KL}(f, g) = \sum_{i=1}^n \pi_i^{(f)} \underbrace{\sum_{j=1}^n P_{ij}^{(f)} \log \frac{P_{ij}^{(f)}}{P_{ij}^{(g)}}}_{=: KL(P_{i\cdot}^{(f)}, P_{i\cdot}^{(g)})}
$$

where $KL(P_{i\cdot}^{(f)}, P_{i\cdot}^{(g)})$ is the row-wise KL divergence.


By the ergodic theorem, we have

$$
\frac{1}{T}\log \frac{L_T^{(f)}}{L_T^{(g)}} \xrightarrow{a.s.} h_{KL}(f, g) \quad \text{as } T \to \infty
$$

Taking expectations and using the dominated convergence theorem, we can get

$$
\frac{1}{T}E_f\left[\log \frac{L_T^{(f)}}{L_T^{(g)}}\right] \to h_{KL}(f, g) \quad \text{as } T \to \infty
$$

Here we invite readers to pause and compare this result with {eq}`eq:kl_likelihood_link`.

Let's confirm this in the simulation below.

### Simulations

Let's implement simulations to illustrate these concepts with a three-state Markov chain.

We start with writing out functions to compute the stationary distribution and the KL divergence rate for Markov chain models

```{code-cell} ipython3
def compute_stationary_dist(P):
"""
Compute stationary distribution of transition matrix P
"""

eigenvalues, eigenvectors = np.linalg.eig(P.T)
idx = np.argmax(np.abs(eigenvalues))
stationary = np.real(eigenvectors[:, idx])
return stationary / stationary.sum()

def markov_kl_divergence(P_f, P_g, pi_f):
"""
Compute KL divergence rate between two Markov chains
"""
if np.any((P_f > 0) & (P_g == 0)):
return np.inf

valid_mask = (P_f > 0) & (P_g > 0)

log_ratios = np.zeros_like(P_f)
log_ratios[valid_mask] = np.log(P_f[valid_mask] / P_g[valid_mask])

Comment on lines +1860 to +1867
Copy link

Copilot AI Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if P_f > 0 where P_g == 0, but this creates a logical issue. When P_g has zero entries where P_f is positive, the KL divergence should be infinite, but the function should handle this consistently with proper mathematical treatment of log(0) cases.

Suggested change
if np.any((P_f > 0) & (P_g == 0)):
return np.inf
valid_mask = (P_f > 0) & (P_g > 0)
log_ratios = np.zeros_like(P_f)
log_ratios[valid_mask] = np.log(P_f[valid_mask] / P_g[valid_mask])
# Use np.errstate to handle log(0) cases gracefully
with np.errstate(divide='ignore', invalid='ignore'):
# Where P_f == 0, set log ratio to 0 (since 0 * log(0/q) = 0 by convention)
log_ratios = np.where(P_f > 0, np.log(P_f / P_g), 0.0)

Copilot uses AI. Check for mistakes.
# Weight by stationary probabilities and sum
kl_rate = np.sum(pi_f[:, np.newaxis] * P_f * log_ratios)

return kl_rate

def simulate_markov_chain(P, pi_0, T, N_paths=1000):
"""
Simulate N_paths sample paths from a Markov chain
"""
mc = qe.MarkovChain(P, state_values=None)

initial_states = np.random.choice(len(P), size=N_paths, p=pi_0)

paths = np.zeros((N_paths, T+1), dtype=int)

for i in range(N_paths):
path = mc.simulate(T+1, init=initial_states[i])
paths[i, :] = path

return paths


def compute_likelihood_ratio_markov(paths, P_f, P_g, π_0_f, π_0_g):
"""
Compute likelihood ratio process for Markov chain paths
"""
N_paths, T_plus_1 = paths.shape
T = T_plus_1 - 1
L_ratios = np.ones((N_paths, T+1))

L_ratios[:, 0] = π_0_f[paths[:, 0]] / π_0_g[paths[:, 0]]

for t in range(1, T+1):
prev_states = paths[:, t-1]
curr_states = paths[:, t]

transition_ratios = P_f[prev_states, curr_states] \
/ P_g[prev_states, curr_states]
L_ratios[:, t] = L_ratios[:, t-1] * transition_ratios

return L_ratios
```

Now let's create an example with two different 3-state Markov chains

```{code-cell} ipython3
P_f = np.array([[0.7, 0.2, 0.1],
[0.3, 0.5, 0.2],
[0.1, 0.3, 0.6]])

P_g = np.array([[0.5, 0.3, 0.2],
[0.2, 0.6, 0.2],
[0.2, 0.2, 0.6]])

# Compute stationary distributions
π_f = compute_stationary_dist(P_f)
π_g = compute_stationary_dist(P_g)

print(f"Stationary distribution (f): {π_f}")
print(f"Stationary distribution (g): {π_g}")

# Compute KL divergence rate
kl_rate_fg = markov_kl_divergence(P_f, P_g, π_f)
kl_rate_gf = markov_kl_divergence(P_g, P_f, π_g)

print(f"\nKL divergence rate h(f, g): {kl_rate_fg:.4f}")
print(f"KL divergence rate h(g, f): {kl_rate_gf:.4f}")
```

We are now ready to simulate paths and visualize how likelihood ratios evolve.

We'll verify $\frac{1}{T}E_f\left[\log \frac{L_T^{(f)}}{L_T^{(g)}}\right] = h_{KL}(f, g)$ starting from the stationary distribution by plotting both the empirical average and the line predicted by the theory

```{code-cell} ipython3
T = 500
N_paths = 1000
paths_from_f = simulate_markov_chain(P_f, π_f, T, N_paths)

L_ratios_f = compute_likelihood_ratio_markov(paths_from_f,
P_f, P_g,
π_f, π_g)

plt.figure(figsize=(10, 6))

# Plot individual paths
n_show = 50
for i in range(n_show):
plt.plot(np.log(L_ratios_f[i, :]),
alpha=0.3, color='blue', lw=0.8)

# Compute theoretical expectation
theory_line = kl_rate_fg * np.arange(T+1)
plt.plot(theory_line, 'k--', linewidth=2.5,
label=r'$T \times h_{KL}(f,g)$')

# Compute empirical mean
avg_log_L = np.mean(np.log(L_ratios_f), axis=0)
plt.plot(avg_log_L, 'r-', linewidth=2.5,
label='empirical average', alpha=0.5)

plt.axhline(y=0, color='gray',
linestyle='--', alpha=0.5)
plt.xlabel(r'$T$')
plt.ylabel(r'$\log L_T$')
plt.title('nature = $f$')
plt.legend()
plt.show()
```

Let's examine how the model selection error probability depends on sample size using the same simulation strategy in the previous section

```{code-cell} ipython3
def compute_selection_error(T_values, P_f, P_g, π_0_f, π_0_g, N_sim=1000):
"""
Compute model selection error probability for different sample sizes
"""
errors = []

for T in T_values:
# Simulate from both models
paths_f = simulate_markov_chain(P_f, π_0_f, T, N_sim//2)
paths_g = simulate_markov_chain(P_g, π_0_g, T, N_sim//2)

# Compute likelihood ratios
L_f = compute_likelihood_ratio_markov(paths_f,
P_f, P_g,
π_0_f, π_0_g)
L_g = compute_likelihood_ratio_markov(paths_g,
P_f, P_g,
π_0_f, π_0_g)

# Decision rule: choose f if L_T >= 1
error_f = np.mean(L_f[:, -1] < 1) # Type I error
error_g = np.mean(L_g[:, -1] >= 1) # Type II error

total_error = 0.5 * (error_f + error_g)
errors.append(total_error)

return np.array(errors)

# Compute error probabilities
T_values = np.arange(10, 201, 10)
errors = compute_selection_error(T_values,
P_f, P_g,
π_f, π_g)

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(T_values, errors, linewidth=2)
plt.xlabel('$T$')
plt.ylabel('error probability')
plt.show()
```

## Measuring discrepancies between distributions

A plausible guess is that the ability of a likelihood ratio to distinguish distributions $f$ and $g$ depends on how "different" they are.
Expand Down Expand Up @@ -2405,6 +2660,3 @@ $$

```{solution-end}
```



Loading
Loading