From 23cbc2303f923198984b40e509f5f87f7780a95c Mon Sep 17 00:00:00 2001 From: thomassargent30 Date: Tue, 29 Jul 2025 17:36:05 -0600 Subject: [PATCH 1/6] Tom's second edits of June 29 --- lectures/likelihood_bayes.md | 131 ++++++++++++--------------- lectures/likelihood_ratio_process.md | 2 +- 2 files changed, 59 insertions(+), 74 deletions(-) diff --git a/lectures/likelihood_bayes.md b/lectures/likelihood_bayes.md index 0f2211f06..5e346b715 100644 --- a/lectures/likelihood_bayes.md +++ b/lectures/likelihood_bayes.md @@ -194,6 +194,8 @@ l_arr_f = simulate(F_a, F_b, N=50000) l_seq_f = np.cumprod(l_arr_f, axis=1) ``` + + ## Likelihood Ratio Process and Bayes’ Law Let $\pi_{t+1}$ be a Bayesian posterior probability defined as @@ -205,7 +207,54 @@ $$ (eq:defbayesposterior) The likelihood ratio process is a principal actor in the formula that governs the evolution of the posterior probability $\pi_t$, an instance of **Bayes' Law**. -Bayes' law is just the following application of the standardformula for conditional probability: +Let's derive a couple of formulas for $\pi_{t+1}$, one in terms of likelihood ratio $l(w_t)$, the other in terms of +$L(w^t)$. + +To begin, define + +Let + +* $f(w^{t+1}) \equiv f(w_1) f(w_2) \cdots f(w_{t+1})$ +* $g(w^{t+1}) \equiv g(w_1) g(w_2) \cdots g(w_{t+1})$ +* $\pi_0 ={\rm Prob}(q=f |{\rm no \ data})$ +* $\pi_t = {\rm Prob}(q=f |w^t)$ + +Then + + +$$ +{\rm Prob}(w^{t+1} |{\rm no \ data}) = \pi_o f(w^{t+1})+ (1 - \pi_0) +$$ + +Probability laws connecting joint and conditional distributions imply that + +$$ +{\rm Prob}(q=f |w^{t+1}) {\rm Prob}(w^{t+1} |{\rm no \ data}) = {\rm Prob}(w^{t+1} |q=f) {\rm Prob}(q=f | {\rm no \ data}) +$$ + +or + +$$ +\pi_{t+1} [\pi_0 f(w^{t+1}) + (1- \pi_0) g(w^{t+1})] = f(w^{t+1}) \pi_0 +$$ + +or + +$$ +\pi_{t+1} = \frac{ f(w^{t+1}) \pi_0 }{\pi_0 f(w^{t+1}) + (1- \pi_0) g(w^{t+1})} +$$ + +Dividing both the numerator and the denominator on the right side of the above equation by $g(w^{t+1})$ implies + +```{math} +:label: eq_Bayeslaw1033 + +\pi_{t+1}=\frac{\pi_{0}L\left(w^{t+1}\right)}{\pi_{0}L\left(w^{t+1}\right)+1-\pi_{0}} . +``` + +We can use a similar line of argument to get a recursive version of formula {eq}`eq_Bayeslaw1033`. + +The laws of probability imply $$ {\rm Prob}(q=f|w^{t+1}) = \frac { {\rm Prob}(q=f|w^{t} ) f(w_{t+1})}{ {\rm Prob}(q=f|w^{t} ) f(w_{t+1}) + (1 - {\rm Prob}(q=f|w^{t} )) g(w_{t+1})} @@ -249,81 +298,17 @@ def update(π, l): return π ``` -Formula {eq}`eq_recur1` can be generalized by iterating on it and thereby deriving an -expression for the time $t$ posterior $\pi_{t+1}$ as a function -of the time $0$ prior $\pi_0$ and the likelihood ratio process -$L(w^{t+1})$ at time $t$. -To begin, notice that the updating rule +Formula {eq}`eq_Bayeslaw1033` generalizes formula {eq}`eq_recur1`. -$$ -\pi_{t+1} -=\frac{\pi_{t}\ell \left(w_{t+1}\right)} -{\pi_{t}\ell \left(w_{t+1}\right)+\left(1-\pi_{t}\right)} -$$ - -implies - -$$ -\begin{aligned} -\frac{1}{\pi_{t+1}} - &=\frac{\pi_{t}\ell \left(w_{t+1}\right) - +\left(1-\pi_{t}\right)}{\pi_{t}\ell \left(w_{t+1}\right)} \\ - &=1-\frac{1}{\ell \left(w_{t+1}\right)} - +\frac{1}{\ell \left(w_{t+1}\right)}\frac{1}{\pi_{t}}. -\end{aligned} -$$ - -$$ -\Rightarrow -\frac{1}{\pi_{t+1}}-1 -=\frac{1}{\ell \left(w_{t+1}\right)}\left(\frac{1}{\pi_{t}}-1\right). -$$ - -Therefore - -$$ -\begin{aligned} - \frac{1}{\pi_{t+1}}-1 - =\frac{1}{\prod_{i=1}^{t+1}\ell \left(w_{i}\right)} - \left(\frac{1}{\pi_{0}}-1\right) - =\frac{1}{L\left(w^{t+1}\right)}\left(\frac{1}{\pi_{0}}-1\right). -\end{aligned} -$$ - -Since $\pi_{0}\in\left(0,1\right)$ and -$L\left(w^{t+1}\right)>0$, we can verify that -$\pi_{t+1}\in\left(0,1\right)$. - -After rearranging the preceding equation, we can express $\pi_{t+1}$ as a -function of $L\left(w^{t+1}\right)$, the likelihood ratio process at $t+1$, -and the initial prior $\pi_{0}$ - -```{math} -:label: eq_Bayeslaw103 - -\pi_{t+1}=\frac{\pi_{0}L\left(w^{t+1}\right)}{\pi_{0}L\left(w^{t+1}\right)+1-\pi_{0}} . -``` - -Formula {eq}`eq_Bayeslaw103` generalizes formula {eq}`eq_recur1`. - -```{note} -Fomula {eq}`eq_Bayeslaw103` can also be derived by starting from the formula for conditional probability - -$$ -\pi_{t+1} \equiv {\rm Prob}(q=f|w^{t+1}) = \frac { \pi_0 f(w^{t+1})}{ \pi_0 f(w^{t+1}) + (1 - \pi_0) g(w^{t+1})} -$$ - -and then dividing the numerator and the denominator on the right side by $g(w^{t+1})$. -``` -Formula {eq}`eq_Bayeslaw103` can be regarded as a one step revision of prior probability $\pi_0$ after seeing +Formula {eq}`eq_Bayeslaw1033` can be regarded as a one step revision of prior probability $\pi_0$ after seeing the batch of data $\left\{ w_{i}\right\} _{i=1}^{t+1}$. -Formula {eq}`eq_Bayeslaw103` shows the key role that the likelihood ratio process $L\left(w^{t+1}\right)$ plays in determining +Formula {eq}`eq_Bayeslaw1033` shows the key role that the likelihood ratio process $L\left(w^{t+1}\right)$ plays in determining the posterior probability $\pi_{t+1}$. -Formula {eq}`eq_Bayeslaw103` is the foundation for the insight that, because of how the likelihood ratio process behaves +Formula {eq}`eq_Bayeslaw1033` is the foundation for the insight that, because of how the likelihood ratio process behaves as $t \rightarrow + \infty$, the likelihood ratio process dominates the initial prior $\pi_0$ in determining the limiting behavior of $\pi_t$. @@ -417,7 +402,7 @@ for i in range(2): np.abs(π_seq - π_seq_f).max() < 1e-10 ``` -We thus conclude that the likelihood ratio process is a key ingredient of the formula {eq}`eq_Bayeslaw103` for +We thus conclude that the likelihood ratio process is a key ingredient of the formula {eq}`eq_Bayeslaw1033` for a Bayesian's posterior probabilty that nature has drawn history $w^t$ as repeated draws from density $f$. @@ -442,7 +427,7 @@ $x \in (0,1)$ -Thus, the Bayesian prior $\pi_0$ and the sequence of posterior probabilities described by equation {eq}`eq_Bayeslaw103` should **not** be interpreted as the statistician's opinion about the mixing parameter $x$ under the alternative timing protocol in which nature draws from an $x$-mixture of $f$ and $g$. +Thus, the Bayesian prior $\pi_0$ and the sequence of posterior probabilities described by equation {eq}`eq_Bayeslaw1033` should **not** be interpreted as the statistician's opinion about the mixing parameter $x$ under the alternative timing protocol in which nature draws from an $x$-mixture of $f$ and $g$. This is clear when we remember the definition of $\pi_t$ in equation {eq}`eq:defbayesposterior`, which for convenience we repeat here: @@ -574,7 +559,7 @@ Since $KL(m, f) < KL(m, g)$, $f$ is "closer" to the mixture distribution $m$. Hence by our discussion on KL divergence and likelihood ratio process in {doc}`likelihood_ratio_process`, $log(L_t) \to \infty$ as $t \to \infty$. -Now looking back to the key equation {eq}`eq_Bayeslaw103`. +Now looking back to the key equation {eq}`eq_Bayeslaw1033`. Consider the function @@ -716,7 +701,7 @@ We can see that the posterior mean of $x$ converges to the true value $x=0.5$. ```{solution-end} ``` -## Behavior of posterior probability $\{\pi_t\}$ under the subjective probability distribution +## Behavior of Posterior Probability $\{\pi_t\}$ Under Subjective Probability Distribution We'll end this lecture by briefly studying what our Baysian learner expects to learn under the subjective beliefs $\pi_t$ cranked out by Bayes' law. diff --git a/lectures/likelihood_ratio_process.md b/lectures/likelihood_ratio_process.md index 624bb8b3c..681a4a504 100644 --- a/lectures/likelihood_ratio_process.md +++ b/lectures/likelihood_ratio_process.md @@ -1464,7 +1464,7 @@ def protocol_2(π_minus_1, T, N=1000): return sequences, true_models_F ``` -**Remark:** Under timing protocol 2, the $\{w_t\}_{t=1}^T$ is a sequence of IID draws from $h(w)$. Under timing protocol 1, the the $\{w_t\}_{t=1}^T$ is +**Remark:** Under timing protocol 2, the $\{w_t\}_{t=1}^T$ is a sequence of IID draws from $h(w)$. Under timing protocol 1, the $\{w_t\}_{t=1}^T$ is not IID. It is **conditionally IID** -- meaning that with probability $\pi_{-1}$ it is a sequence of IID draws from $f(w)$ and with probability $1-\pi_{-1}$ it is a sequence of IID draws from $g(w)$. For more about this, see {doc}`this lecture about exchangeability `. We again deploy a **likelihood ratio process** with time $t$ component being the likelihood ratio From 3b1b4f5d05d5b05df4c0a451afad457a22e10099 Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Wed, 30 Jul 2025 16:42:50 +1000 Subject: [PATCH 2/6] updates --- lectures/likelihood_bayes.md | 303 ++--------------------------------- lectures/mix_model.md | 224 +++++++++++++++++++------- 2 files changed, 179 insertions(+), 348 deletions(-) diff --git a/lectures/likelihood_bayes.md b/lectures/likelihood_bayes.md index 5e346b715..262a1f3b8 100644 --- a/lectures/likelihood_bayes.md +++ b/lectures/likelihood_bayes.md @@ -216,20 +216,20 @@ Let * $f(w^{t+1}) \equiv f(w_1) f(w_2) \cdots f(w_{t+1})$ * $g(w^{t+1}) \equiv g(w_1) g(w_2) \cdots g(w_{t+1})$ -* $\pi_0 ={\rm Prob}(q=f |{\rm no \ data})$ +* $\pi_0 ={\rm Prob}(q=f |\emptyset)$ * $\pi_t = {\rm Prob}(q=f |w^t)$ Then $$ -{\rm Prob}(w^{t+1} |{\rm no \ data}) = \pi_o f(w^{t+1})+ (1 - \pi_0) +{\rm Prob}(w^{t+1} |\emptyset) = \pi_0 f(w^{t+1})+ (1 - \pi_0) $$ Probability laws connecting joint and conditional distributions imply that $$ -{\rm Prob}(q=f |w^{t+1}) {\rm Prob}(w^{t+1} |{\rm no \ data}) = {\rm Prob}(w^{t+1} |q=f) {\rm Prob}(q=f | {\rm no \ data}) +{\rm Prob}(q=f |w^{t+1}) {\rm Prob}(w^{t+1} |\emptyset) = {\rm Prob}(w^{t+1} |q=f) {\rm Prob}(q=f | \emptyset) $$ or @@ -410,7 +410,7 @@ $f$. ## Another timing protocol Let's study how the posterior probability $\pi_t = {\rm Prob}(q=f|w^{t}) $ behaves when nature generates the -history $w^t = w_1, w_2, \ldots, w_t$ under a different timing protocol. +history $w^t = \{w_1, w_2, \dots, w_t\}$ under a different timing protocol. Until now we assumed that before time $1$ nature somehow chose to draw $w^t$ as an iid sequence from **either** $f$ **or** $g$. @@ -524,10 +524,7 @@ Why does this happen? Given $x = 0.5$, the data generating process is a mixture of $f$ and $g$: $m(w) = \frac{1}{2}f(w) + \frac{1}{2}g(w)$. -A widely used measure of "closeness" between two distributions is the Kullback-Leibler (KL) divergence. - - -Let's check the KL divergence of the mixture distribution $m$ from both $f$ and $g$. +Let's check the [KL divergence](rel_entropy) of the mixture distribution $m$ from both $f$ and $g$. ```{code-cell} ipython3 def compute_KL(f, g): @@ -573,137 +570,16 @@ Hence $\pi_t \to 1$ as $t \to \infty$ for any $\pi_0 \in (0,1)$. This explains what we observed in the plot above. - - But how can we learn the true mixing parameter $x$? -This topic is taken up in {doc}`this lecture `. - -We also explore this topic in the exrecise below. - -```{exercise} -:label: likelihood_bayes_ex1 - - -The true data generating process is a mixture, and one of the parameters to be learned is the mixing proportion $x$. - -A correct Bayesian approach should directly model the uncertainty about $x$ and update beliefs about it as new data arrives. - -Here is the algorithm: - -First we specify a prior distribution for $x$ given by $x \sim \text{Beta}(\alpha_0, \beta_0)$ with sexpectation $\mathbb{E}[x] = \frac{\alpha_0}{\alpha_0 + \beta_0}$. - -The likelihood for a single observation $w_t$ is $p(w_t|x) = x f(w_t) + (1-x) g(w_t)$. - -For a sequence $w^t = (w_1, \dots, w_t)$, the likelihood is $p(w^t|x) = \prod_{i=1}^t p(w_i|x)$. - -The posterior distribution is updated using $p(x|w^t) \propto p(w^t|x) p(x)$. - -Recursively, the posterior after $w_t$ is $p(x|w^t) \propto p(w_t|x) p(x|w^{t-1})$. - -Without a conjugate prior, we can approximate the posterior by discretizing $x$ into a grid. - -Your task is to implement this algorithm in Python. - -You can verify your implementation by checking that the posterior mean converges to the true value of $x$ -as $t$ increases. -``` - -```{solution-start} likelihood_bayes_ex1 -:class: dropdown -``` - -Here is one solution: - -```{code-cell} ipython3 -@jit -def learn_x_bayesian(observations, α0, β0, grid_size=2000): - """ - Sequential Bayesian learning of the mixing probability x - using a grid approximation. - """ - w = np.asarray(observations) - T = w.size - - x_grid = np.linspace(1e-3, 1 - 1e-3, grid_size) - - # Log prior - log_prior = (α0 - 1) * np.log(x_grid) + (β0 - 1) * np.log1p(-x_grid) - - μ_path = np.empty(T + 1) - μ_path[0] = α0 / (α0 + β0) - - log_post = log_prior.copy() - - for t in range(T): - wt = w[t] - # P(w_t | x) = x f(w_t) + (1 - x) g(w_t) - like = x_grid * f(wt) + (1 - x_grid) * g(wt) - log_post += np.log(like) - - # normalize - log_post -= log_post.max() - post = np.exp(log_post) - post /= post.sum() - - μ_path[t + 1] = np.sum(x_grid * post) - - return μ_path - -x_posterior_means = [learn_x_bayesian(w_mix, α0, β0) for α0, β0 in prior_params] -``` - -Let's visualize how the posterior mean of $x$ evolves over time, starting from three different prior beliefs. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(10, 6)) - -for i, (x_means, mean0) in enumerate(zip(x_posterior_means, prior_means)): - ax.plot(range(T_mix + 1), x_means, - label=fr'Prior mean = ${mean0:.2f}$', - color=colors[i], linewidth=2) - -ax.axhline(y=x_true, color='black', linestyle='--', - label=f'True x = {x_true}', linewidth=2) -ax.set_xlabel('$t$') -ax.set_ylabel('Posterior mean of $x$') -ax.legend() -plt.show() -``` - -The plot shows that regardless of the initial prior belief, all three posterior means eventually converge towards the true value of $x=0.5$. - -Next, let's look at multiple simulations with a longer time horizon, all starting from a uniform prior. - -```{code-cell} ipython3 -set_seed() -n_paths = 20 -T_long = 10_000 - -fig, ax = plt.subplots(figsize=(10, 5)) - -for j in range(n_paths): - w_path = simulate_mixture_path(x_true, T_long) - x_means = learn_x_bayesian(w_path, 1, 1) # Uniform prior - ax.plot(range(T_long + 1), x_means, alpha=0.5, linewidth=1) - -ax.axhline(y=x_true, color='red', linestyle='--', - label=f'True x = {x_true}', linewidth=2) -ax.set_ylabel('Posterior mean of $x$') -ax.set_xlabel('$t$') -ax.legend() -plt.tight_layout() -plt.show() -``` - -We can see that the posterior mean of $x$ converges to the true value $x=0.5$. +This topic is taken up in {doc}`mix_model`. -```{solution-end} -``` +We explore how to learn the true mixing parameter $x$ in the exercise +of {doc}`mix_model`. ## Behavior of Posterior Probability $\{\pi_t\}$ Under Subjective Probability Distribution -We'll end this lecture by briefly studying what our Baysian learner expects to learn under the +We'll end this lecture by briefly studying what our Bayesian learner expects to learn under the subjective beliefs $\pi_t$ cranked out by Bayes' law. This will provide us with some perspective on our application of Bayes's law as a theory of learning. @@ -1118,163 +994,4 @@ The conditional variance is nearly zero only when the agent is almost sure that ## Related Lectures This lecture has been devoted to building some useful infrastructure that will help us understand inferences that are the foundations of -results described in {doc}`this lecture ` and {doc}`this lecture ` and {doc}`this lecture `. - -```{exercise} -:label: likelihood_bayes_ex2 - -In the [first exercise](likelihood_bayes_ex1), we implemented a Bayesian learning algorithm to estimate the mixing parameter $x$ in a mixture model using a grid approximation method. - -In this exercise, we will explore sequential Bayesian updating using NumPyro. - -Please follow these steps: - -1. Generate a dataset from a mixture model with known mixing parameter $x_{true} = 0.5$. -2. Process the data in chunks of 100 observations, updating the posterior sequentially. -3. For each chunk, use the posterior from the previous chunk as the prior for the next chunk (using moment matching to fit a Beta distribution). -4. Create a visualization showing how the posterior distribution evolves, using gradually darker shades to represent later time periods. - -In the exercise, set $\alpha_0 = 1$ and $\beta_0 = 2$. -``` - -```{solution-start} likelihood_bayes_ex2 -:class: dropdown -``` - -First, let's install and import the necessary packages: - -```{code-cell} ipython3 -:tags: [hide-output] - -!pip install numpyro jax -``` - -```{code-cell} ipython3 -import jax -import jax.numpy as jnp -import numpyro -import numpyro.distributions as dist -from numpyro.infer import NUTS, MCMC -import seaborn as sns -``` - -Define the mixture model and helper functions that helps us -to fit a Beta distribution to the posterior and obtain the parameter for the next chunk - -```{code-cell} ipython3 -def mixture_model(w, α0=1.0, β0=1.0): - x = numpyro.sample("x", dist.Beta(α0, β0)) - f_dist = dist.Beta(F_a, F_b) - g_dist = dist.Beta(G_a, G_b) - mix = dist.Mixture( - dist.Categorical(probs=jnp.array([x, 1 - x])), - [f_dist, g_dist] - ) - with numpyro.plate("data", w.shape[0]): - numpyro.sample("w", mix, obs=w) - -def β_moment_match(samples, eps=1e-12): - m = float(samples.mean()) - v = float(samples.var()) - v = max(v, eps) - t = m * (1 - m) / v - 1.0 - α = max(m * t, eps) - β = max((1 - m) * t, eps) - return α, β -``` - -Now we implement sequential Bayesian updating - -```{code-cell} ipython3 -def sequential_update(w_all, α0=1.0, β0=1.0, - chunk_size=100, - num_warmup=1000, - num_samples=2000, - seed=0): - n = len(w_all) - - # Create chunks - chunks = [slice(i, min(i + chunk_size, n)) - for i in range(0, n, chunk_size)] - α, β = α0, β0 - - keys = jax.random.split( - jax.random.PRNGKey(seed), len(chunks)) - means = [α / (α + β)] - posts = [] - - # Run MCMC for each chunk - for i, sl in enumerate(chunks): - kernel = NUTS(mixture_model) - mcmc = MCMC(kernel, - num_warmup=num_warmup, - num_samples=num_samples) - mcmc.run(keys[i], w_all[sl], α, β) - xs = mcmc.get_samples()["x"] - - posts.append(xs) - - # Posterior becomes prior for next chunk - α, β = β_moment_match(xs) - means.append(xs.mean()) - - return np.array(means), posts, chunks -``` - -Generate data and run sequential updates: - -```{code-cell} ipython3 -:tags: [hide-output] - -x_true = 0.5 -T_total = 2000 - -set_seed() -w_mix = simulate_mixture_path(x_true, T_total) - -means, posts, chunks = sequential_update( - w_mix, α0=1.0, β0=2.0, chunk_size=200, - num_warmup=2000, num_samples=1000, seed=123) -``` - -Create visualization with gradually darker lines: - -```{code-cell} ipython3 -fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5)) - -# Plot posterior means over time -n_seen = np.cumsum([0] + [c.stop - c.start for c in chunks]) - -# Posterior mean trajectory -ax1.plot(n_seen, means, 'o-', - color='darkblue', label='Posterior mean of $x$') -ax1.axhline(x_true, ls="--", - color="red", alpha=0.7, label=f'True $x$ = {x_true}') -ax1.set_xlabel('$t$') -ax1.set_ylabel('Posterior mean of $x$') -ax1.legend() - -# Posterior densities at each chunk -n_chunks = len(posts) -colors = plt.cm.Blues(np.linspace(0.01, 0.99, n_chunks)) - -for i, (xs, color) in enumerate(zip(posts, colors)): - sns.kdeplot(xs, color=color, ax=ax2, - alpha=0.7, label=f'n={chunks[i].stop}') - -ax2.axvline(x_true, ls="--", color="red", - alpha=0.7, label=f'True $x$ = {x_true}') -ax2.set_xlabel('$x$') -ax2.set_ylabel('density') -ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left') - -plt.tight_layout() -plt.show() -``` - -The left panel shows how the posterior mean converges to the true value as more data is observed. - -The right panel shows that the distribution of $x$ becomes more concentrated around the true value with more observations. - -```{solution-end} -``` +results described in {doc}`this lecture ` and {doc}`this lecture ` and {doc}`this lecture `. \ No newline at end of file diff --git a/lectures/mix_model.md b/lectures/mix_model.md index 0cb249cbc..2f306c6c5 100644 --- a/lectures/mix_model.md +++ b/lectures/mix_model.md @@ -65,7 +65,7 @@ Thus, we change the {doc}`quantecon lecture ` specification in Now, **each period** $t \geq 0$, nature flips a possibly unfair coin that comes up $f$ with probability $\alpha$ and $g$ with probability $1 -\alpha$. -Thus, naturally perpetually draws from the **mixture distribution** with c.d.f. +Thus, nature perpetually draws from the **mixture distribution** with c.d.f. $$ H(w ) = \alpha F(w) + (1-\alpha) G(w), \quad \alpha \in (0,1) @@ -261,7 +261,6 @@ def draw_lottery(p, N): draws.append(np.random.beta(F_a, F_b)) else: draws.append(np.random.beta(G_a, G_b)) - return np.array(draws) def draw_lottery_MC(p, N): @@ -294,18 +293,6 @@ plt.legend() plt.show() ``` -```{code-cell} ipython3 -# %%timeit # compare speed -# sample1 = draw_lottery(α, N=int(1e6)) -``` - -```{code-cell} ipython3 -# %%timeit -# sample2 = draw_lottery_MC(α, N=int(1e6)) -``` - -**Note:** With numba acceleration the first method is actually only slightly slower than the second when we generated 1,000,000 samples. - ## Type 1 Agent We'll now study what our type 1 agent learns @@ -462,7 +449,7 @@ def plot_π_seq(α, π1=0.2, π2=0.8, T=200): ax1.set_ylabel(r"$\pi_t$") ax1.set_xlabel("t") ax1.legend() - ax1.set_title("when $\\alpha G + (1-\\alpha)$ F governs data") + ax1.set_title("when $\\alpha F + (1-\\alpha)G$ F governs data") ax2 = ax1.twinx() ax2.plot(range(1, T+1), np.log(l_seq_mixed[0, :]), '--', color='b') @@ -496,13 +483,13 @@ $$ h(w) \equiv h(w | \alpha) = \alpha f(w) + (1-\alpha) g(w) $$ we shall compute the following two Kullback-Leibler divergences $$ -KL_g (\alpha) = \int \log\left(\frac{g(w)}{h(w)}\right) h(w) d w +KL_g (\alpha) = \int \log\left(\frac{h(w)}{g(w)}\right) h(w) d w $$ and $$ -KL_f (\alpha) = \int \log\left(\frac{f(w)}{h(w)}\right) h(w) d w +KL_f (\alpha) = \int \log\left(\frac{h(w)}{f(w)}\right) h(w) d w $$ We shall plot both of these functions against $\alpha$ as we use $\alpha$ to vary @@ -514,38 +501,38 @@ $$ \min_{f,g} \{KL_g, KL_f\} $$ The only possible limits are $0$ and $1$. -As $\rightarrow +\infty$, $\pi_t$ goes to one if and only if $KL_f < KL_g$ +As $t \rightarrow +\infty$, $\pi_t$ goes to one if and only if $KL_f < KL_g$ ```{code-cell} ipython3 @vectorize def KL_g(α): - "Compute the KL divergence between g and h." + "Compute the KL divergence KL(h, g)." err = 1e-8 # to avoid 0 at end points ws = np.linspace(err, 1-err, 10000) gs, fs = g(ws), f(ws) hs = α*fs + (1-α)*gs - return np.sum(np.log(gs/hs)*hs)/10000 + return np.sum(np.log(hs/gs)*hs)/10000 @vectorize def KL_f(α): - "Compute the KL divergence between f and h." + "Compute the KL divergence KL(h, f)." err = 1e-8 # to avoid 0 at end points ws = np.linspace(err, 1-err, 10000) gs, fs = g(ws), f(ws) hs = α*fs + (1-α)*gs - return np.sum(np.log(fs/hs)*hs)/10000 + return np.sum(np.log(hs/fs)*hs)/10000 # compute KL using quad in Scipy def KL_g_quad(α): - "Compute the KL divergence between g and h using scipy.integrate." + "Compute the KL divergence KL(h, g) using scipy.integrate." h = lambda x: α*f(x) + (1-α)*g(x) - return quad(lambda x: np.log(g(x)/h(x))*h(x), 0, 1)[0] + return quad(lambda x: h(x) * np.log(h(x)/g(x)), 0, 1)[0] def KL_f_quad(α): - "Compute the KL divergence between f and h using scipy.integrate." + "Compute the KL divergence KL(h, f) using scipy.integrate." h = lambda x: α*f(x) + (1-α)*g(x) - return quad(lambda x: np.log(f(x)/h(x))*h(x), 0, 1)[0] + return quad(lambda x: h(x) * np.log(h(x)/f(x)), 0, 1)[0] # vectorize KL_g_quad_v = np.vectorize(KL_g_quad) @@ -575,37 +562,20 @@ KL_f_arr = KL_f(α_arr) fig, ax = plt.subplots(1, figsize=[10, 6]) -ax.plot(α_arr, KL_g_arr, label='KL(g, h)') -ax.plot(α_arr, KL_f_arr, label='KL(f, h)') -ax.set_ylabel('K-L divergence') +ax.plot(α_arr, KL_g_arr, label='KL(h, g)') +ax.plot(α_arr, KL_f_arr, label='KL(h, f)') +ax.set_ylabel('KL divergence') ax.set_xlabel(r'$\alpha$') ax.legend(loc='upper right') plt.show() ``` -```{code-cell} ipython3 -# # using Scipy to compute KL divergence - -# α_arr = np.linspace(0, 1, 100) -# KL_g_arr = KL_g_quad_v(α_arr) -# KL_f_arr = KL_f_quad_v(α_arr) - -# fig, ax = plt.subplots(1, figsize=[10, 6]) - -# ax.plot(α_arr, KL_g_arr, label='KL(g, h)') -# ax.plot(α_arr, KL_f_arr, label='KL(f, h)') -# ax.set_ylabel('K-L divergence') - -# ax.legend(loc='upper right') -# plt.show() -``` - Let's compute an $\alpha$ for which the KL divergence between $h$ and $g$ is the same as that between $h$ and $f$. ```{code-cell} ipython3 # where KL_f = KL_g -α_arr[np.argmin(np.abs(KL_g_arr-KL_f_arr))] +discretion = α_arr[np.argmin(np.abs(KL_g_arr-KL_f_arr))] ``` We can compute and plot the convergence point $\pi_{\infty}$ for each $\alpha$ to verify that the convergence is indeed governed by the KL divergence. @@ -617,15 +587,15 @@ Thus, the graph below confirms how a minimum KL divergence governs what our typ ```{code-cell} ipython3 -α_arr_x = α_arr[(α_arr<0.28)|(α_arr>0.38)] +α_arr_x = α_arr[(α_arrdiscretion)] π_lim_arr = π_lim_v(α_arr_x) # plot fig, ax = plt.subplots(1, figsize=[10, 6]) -ax.plot(α_arr, KL_g_arr, label='KL(g, h)') -ax.plot(α_arr, KL_f_arr, label='KL(f, h)') -ax.set_ylabel('K-L divergence') +ax.plot(α_arr, KL_g_arr, label='KL(h, g)') +ax.plot(α_arr, KL_f_arr, label='KL(h, f)') +ax.set_ylabel('KL divergence') ax.set_xlabel(r'$\alpha$') # plot KL @@ -640,13 +610,17 @@ plt.show() ``` Evidently, our type 1 learner who applies Bayes' law to his misspecified set of statistical models eventually learns an approximating model that is as close as possible to the true model, as measured by its -Kullback-Leibler divergence. +Kullback-Leibler divergence: + +- When $\alpha$ is small, the $KL_g < KL_f$ meaning the divergence of $g$ from $h$ is smaller than that of $f$ and so the limit point of $\pi_t$ is close to $0$. + +- When $\alpha$ is large, the $KL_f < KL_g$ meaning the divergence of $f$ from $h$ is smaller than that of $g$ and so the limit point of $\pi_t$ is close to $1$. ## Type 2 Agent We now describe how our type 2 agent formulates his learning problem and what he eventually learns. -Our type 2 agent understands the correct statistical model but acknowledges does not know $\alpha$. +Our type 2 agent understands the correct statistical model but does not know $\alpha$. We apply Bayes law to deduce an algorithm for learning $\alpha$ under the assumption that the agent knows that @@ -699,8 +673,8 @@ def model(w): def MCMC_run(ws): "Compute posterior using MCMC with observed ws" - kernal = NUTS(model) - mcmc = MCMC(kernal, num_samples=5000, num_warmup=1000, progress_bar=False) + kernel = NUTS(model) + mcmc = MCMC(kernel, num_samples=5000, num_warmup=1000, progress_bar=False) mcmc.run(rng_key=random.PRNGKey(142857), w=jnp.array(ws)) sample = mcmc.get_samples() @@ -759,3 +733,143 @@ $s(x | \theta)$ to infer $\theta$ from $x$. But the scientist's model is misspecified, being only an approximation to an unknown model $h$ that nature uses to generate $X$. If the scientist uses Bayes' law or a related likelihood-based method to infer $\theta$, it occurs quite generally that for large sample sizes the inverse problem infers a $\theta$ that minimizes the KL divergence of the scientist's model $s$ relative to nature's model $h$. + + +## Exercises + +```{exercise} +:label: mix_model_ex1 + +In {doc}`likelihood_bayes`, we studied the consequence of applying likelihood ratio +and Bayes' law to a misspecified statistical model. + +In that lecture, we used a model selection algorithm to study the case where the true data generating process is a mixture. + +In this lecture, we studied how to correctly "learn" a model generated by a mixing process using a Bayesian approach. + +To fix the algorithm we used in {doc}`likelihood_bayes`. A correct Bayesian approach should directly model the uncertainty about $x$ and update beliefs about it as new data arrives. + +Here is the algorithm: + +First we specify a prior distribution for $x$ given by $x \sim \text{Beta}(\alpha_0, \beta_0)$ with expectation $\mathbb{E}[x] = \frac{\alpha_0}{\alpha_0 + \beta_0}$. + +The likelihood for a single observation $w_t$ is $p(w_t|x) = x f(w_t) + (1-x) g(w_t)$. + +For a sequence $w^t = (w_1, \dots, w_t)$, the likelihood is $p(w^t|x) = \prod_{i=1}^t p(w_i|x)$. + +The posterior distribution is updated using $p(x|w^t) \propto p(w^t|x) p(x)$. + +Recursively, the posterior after $w_t$ is $p(x|w^t) \propto p(w_t|x) p(x|w^{t-1})$. + +Without a conjugate prior, we can approximate the posterior by discretizing $x$ into a grid. + +Your task is to implement this algorithm in Python. + +You can verify your implementation by checking that the posterior mean converges to the true value of $x$ as $t$ increases in {doc}`likelihood_bayes`. +``` + +```{solution-start} mix_model_ex1 +:class: dropdown +``` + +Here is one solution: + +First we define the mixture probability +and parameters of prior distributions + +```{code-cell} ipython3 +x_true = 0.5 +T_mix = 200 + +# Three different priors with means 0.25, 0.5, 0.75 +prior_params = [(1, 3), (1, 1), (3, 1)] +prior_means = [a/(a+b) for a, b in prior_params] + +w_mix = draw_lottery(x_true, T_mix) +``` + +```{code-cell} ipython3 +@jit +def learn_x_bayesian(observations, α0, β0, grid_size=2000): + """ + Sequential Bayesian learning of the mixing probability x + using a grid approximation. + """ + w = np.asarray(observations) + T = w.size + + x_grid = np.linspace(1e-3, 1 - 1e-3, grid_size) + + # Log prior + log_prior = (α0 - 1) * np.log(x_grid) + (β0 - 1) * np.log1p(-x_grid) + + μ_path = np.empty(T + 1) + μ_path[0] = α0 / (α0 + β0) + + log_post = log_prior.copy() + + for t in range(T): + wt = w[t] + # P(w_t | x) = x f(w_t) + (1 - x) g(w_t) + like = x_grid * f(wt) + (1 - x_grid) * g(wt) + log_post += np.log(like) + + # normalize + log_post -= log_post.max() + post = np.exp(log_post) + post /= post.sum() + + μ_path[t + 1] = np.sum(x_grid * post) + + return μ_path + +x_posterior_means = [learn_x_bayesian(w_mix, α0, β0) for α0, β0 in prior_params] +``` + +Let's visualize how the posterior mean of $x$ evolves over time, starting from three different prior beliefs. + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(10, 6)) + +for i, (x_means, mean0) in enumerate(zip(x_posterior_means, prior_means)): + ax.plot(range(T_mix + 1), x_means, + label=fr'Prior mean = ${mean0:.2f}$', + color=colors[i], linewidth=2) + +ax.axhline(y=x_true, color='black', linestyle='--', + label=f'True x = {x_true}', linewidth=2) +ax.set_xlabel('$t$') +ax.set_ylabel('Posterior mean of $x$') +ax.legend() +plt.show() +``` + +The plot shows that regardless of the initial prior belief, all three posterior means eventually converge towards the true value of $x=0.5$. + +Next, let's look at multiple simulations with a longer time horizon, all starting from a uniform prior. + +```{code-cell} ipython3 +set_seed() +n_paths = 20 +T_long = 10_000 + +fig, ax = plt.subplots(figsize=(10, 5)) + +for j in range(n_paths): + w_path = draw_lottery(x_true, T_long) + x_means = learn_x_bayesian(w_path, 1, 1) # Uniform prior + ax.plot(range(T_long + 1), x_means, alpha=0.5, linewidth=1) + +ax.axhline(y=x_true, color='red', linestyle='--', + label=f'True x = {x_true}', linewidth=2) +ax.set_ylabel('Posterior mean of $x$') +ax.set_xlabel('$t$') +ax.legend() +plt.tight_layout() +plt.show() +``` + +We can see that the posterior mean of $x$ converges to the true value $x=0.5$. + +```{solution-end} +``` From 79f12784d451debd1420a66fd3bd06bbd49d66eb Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Wed, 30 Jul 2025 22:30:39 +1000 Subject: [PATCH 3/6] update new exercise for lrp and minor update to mix_model --- lectures/likelihood_ratio_process.md | 133 +++++++++++++++++++++++++++ lectures/mix_model.md | 7 +- 2 files changed, 138 insertions(+), 2 deletions(-) diff --git a/lectures/likelihood_ratio_process.md b/lectures/likelihood_ratio_process.md index 681a4a504..daa23ed6d 100644 --- a/lectures/likelihood_ratio_process.md +++ b/lectures/likelihood_ratio_process.md @@ -2139,3 +2139,136 @@ and as applied in {doc}`odu`. Likelihood ratio processes appear again in {doc}`advanced:additive_functionals`, which contains another illustration of the **peculiar property** of likelihood ratio processes described above. + + +## Exercises + +```{exercise} +:label: lr_ex1 + +Consider the setting where nature generates data from a third density $h$. + +Let $\{w_t\}_{t=1}^T$ be IID draws from $h$, and let $L_t = L(w^t)$ be the likelihood ratio process defined as in the lecture. + +Show that: + +$$ +\frac{1}{t} E_h[\log L_t] = K_g - K_f +$$ + +with finite $K_g, K_f < \infty$, $E_h |\log f(W)| < \infty$ and $E_h |\log g(W)| < \infty$. + +*Hint:* Start by expressing $\log L_t$ as a sum of $\log \ell(w_i)$ terms and compare with the definition of $K_f$ and $K_g$. +``` + +```{solution-start} lr_ex1 +:class: dropdown +``` + +Since $w_1, \ldots, w_t$ are IID draws from $h$, we can write + +$$ +\log L_t = \log \prod_{i=1}^t \ell(w_i) = \sum_{i=1}^t \log \ell(w_i) = \sum_{i=1}^t \log \frac{f(w_i)}{g(w_i)} +$$ + +Taking the expectation under $h$ + +$$ +E_h[\log L_t] += E_h\left[\sum_{i=1}^t \log \frac{f(w_i)}{g(w_i)}\right] += \sum_{i=1}^t E_h\left[\log \frac{f(w_i)}{g(w_i)}\right] +$$ + +Since the $w_i$ are identically distributed + +$$ +E_h[\log L_t] = t \cdot E_h\left[\log \frac{f(w)}{g(w)}\right] +$$ + +where $w \sim h$. + +Therefore + +$$ +\frac{1}{t} E_h[\log L_t] = E_h\left[\log \frac{f(w)}{g(w)}\right] = E_h[\log f(w)] - E_h[\log g(w)] +$$ + +Now, from the definition of Kullback-Leibler divergence + +$$ +K_f = KL(h, f) = \int h(w) \log \frac{h(w)}{f(w)} dw = E_h[\log h(w)] - E_h[\log f(w)] +$$ + +This gives us + +$$ +E_h[\log f(w)] = E_h[\log h(w)] - K_f +$$ + +Similarly + +$$ +E_h[\log g(w)] = E_h[\log h(w)] - K_g +$$ + +Substituting back + +$$ +\begin{aligned} +\frac{1}{t} E_h[\log L_t] &= E_h[\log f(w)] - E_h[\log g(w)] \\ +&= [E_h[\log h(w)] - K_f] - [E_h[\log h(w)] - K_g] \\ +&= K_g - K_f +\end{aligned} +$$ + +```{solution-end} +``` + +```{exercise} +:label: lr_ex2 + +Building on {ref}`lr_ex1`, use the result to explain what happens to $L_t$ as $t \to \infty$ in the following cases: + +1. When $K_g > K_f$ (i.e., $f$ is "closer" to $h$ than $g$ is) +2. When $K_g < K_f$ (i.e., $g$ is "closer" to $h$ than $f$ is) + +Relate your answer to the simulation results shown in the {ref}`Kullback-Leibler Divergence ` section. +``` + +```{solution-start} lr_ex2 +:class: dropdown +``` + +From {ref}`lr_ex1`, we know that: + +$$ +\frac{1}{t} E_h[\log L_t] = K_g - K_f +$$ + +**Case 1:** When $K_g > K_f$ + +Here, $f$ is "closer" to $h$ than $g$ is. Since $K_g - K_f > 0$ + +$$ +E_h[\log L_t] = t \cdot (K_g - K_f) \to +\infty \text{ as } t \to \infty +$$ + +By the Law of Large Numbers, $\frac{1}{t} \log L_t \to K_g - K_f > 0$ almost surely. + +Therefore $L_t \to +\infty$ almost surely. + +**Case 2:** When $K_g < K_f$ + +Here, $g$ is "closer" to $h$ than $f$ is. Since $K_g - K_f < 0$ + +$$ +E_h[\log L_t] = t \cdot (K_g - K_f) \to -\infty \text{ as } t \to \infty +$$ + +Therefore by similar reasoning $L_t \to 0$ almost surely. + +```{solution-end} +``` + + + diff --git a/lectures/mix_model.md b/lectures/mix_model.md index 2f306c6c5..92bfaa055 100644 --- a/lectures/mix_model.md +++ b/lectures/mix_model.md @@ -449,7 +449,7 @@ def plot_π_seq(α, π1=0.2, π2=0.8, T=200): ax1.set_ylabel(r"$\pi_t$") ax1.set_xlabel("t") ax1.legend() - ax1.set_title("when $\\alpha F + (1-\\alpha)G$ F governs data") + ax1.set_title("when $\\alpha F + (1-\\alpha)G$ governs data") ax2 = ax1.twinx() ax2.plot(range(1, T+1), np.log(l_seq_mixed[0, :]), '--', color='b') @@ -601,7 +601,10 @@ ax.set_xlabel(r'$\alpha$') # plot KL ax2 = ax.twinx() # plot limit point -ax2.scatter(α_arr_x, π_lim_arr, facecolors='none', edgecolors='tab:blue', label=r'$\pi$ lim') +ax2.scatter(α_arr_x, π_lim_arr, + facecolors='none', + edgecolors='tab:blue', + label=r'$\pi$ lim') ax2.set_ylabel('π lim') ax.legend(loc=[0.85, 0.8]) From 96954637f54c6e81ec6df831da211dfb7490630b Mon Sep 17 00:00:00 2001 From: Humphrey Yang Date: Wed, 30 Jul 2025 22:34:45 +1000 Subject: [PATCH 4/6] hide output of the pip install --- lectures/mix_model.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lectures/mix_model.md b/lectures/mix_model.md index 92bfaa055..a7eaeaaa3 100644 --- a/lectures/mix_model.md +++ b/lectures/mix_model.md @@ -18,7 +18,7 @@ kernelspec: ``` ```{code-cell} ipython3 -:tags: [no-execute] +:tags: [no-execute, hide-output] !pip install numpyro jax ``` From 58d88394b22fdbdbca25671ebcacc1fe0aa7cb7b Mon Sep 17 00:00:00 2001 From: thomassargent30 Date: Wed, 30 Jul 2025 10:50:45 -0600 Subject: [PATCH 5/6] Tom's first July 30 eduts if likelihood_bayes lecture --- lectures/likelihood_bayes.md | 41 ++++++++++++++++++++++-------------- 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/lectures/likelihood_bayes.md b/lectures/likelihood_bayes.md index 262a1f3b8..d85aaae6a 100644 --- a/lectures/likelihood_bayes.md +++ b/lectures/likelihood_bayes.md @@ -210,23 +210,25 @@ of the posterior probability $\pi_t$, an instance of **Bayes' Law**. Let's derive a couple of formulas for $\pi_{t+1}$, one in terms of likelihood ratio $l(w_t)$, the other in terms of $L(w^t)$. -To begin, define +To begin, we use the notational conventions + -Let * $f(w^{t+1}) \equiv f(w_1) f(w_2) \cdots f(w_{t+1})$ * $g(w^{t+1}) \equiv g(w_1) g(w_2) \cdots g(w_{t+1})$ * $\pi_0 ={\rm Prob}(q=f |\emptyset)$ * $\pi_t = {\rm Prob}(q=f |w^t)$ -Then +Here the symbol $\emptyset$ means "empty set" or "no data". + +With no data in hand, our Bayesian statistician thinks that the probability density of the sequence $w^{t+1}$ is $$ {\rm Prob}(w^{t+1} |\emptyset) = \pi_0 f(w^{t+1})+ (1 - \pi_0) $$ -Probability laws connecting joint and conditional distributions imply that +Probability laws connecting joint probability distributions and conditional probability distributions imply that $$ {\rm Prob}(q=f |w^{t+1}) {\rm Prob}(w^{t+1} |\emptyset) = {\rm Prob}(w^{t+1} |q=f) {\rm Prob}(q=f | \emptyset) @@ -235,7 +237,7 @@ $$ or $$ -\pi_{t+1} [\pi_0 f(w^{t+1}) + (1- \pi_0) g(w^{t+1})] = f(w^{t+1}) \pi_0 +\pi_{t+1} \left[\pi_0 f(w^{t+1}) + (1- \pi_0) g(w^{t+1})\right] = f(w^{t+1}) \pi_0 $$ or @@ -252,6 +254,19 @@ Dividing both the numerator and the denominator on the right side of the above \pi_{t+1}=\frac{\pi_{0}L\left(w^{t+1}\right)}{\pi_{0}L\left(w^{t+1}\right)+1-\pi_{0}} . ``` + +Formula {eq}`eq_Bayeslaw1033` can be regarded as a one step revision of prior probability $\pi_0$ after seeing +the batch of data $\left\{ w_{i}\right\} _{i=1}^{t+1}$. + +Formula {eq}`eq_Bayeslaw1033` shows the key role that the likelihood ratio process $L\left(w^{t+1}\right)$ plays in determining +the posterior probability $\pi_{t+1}$. + +Formula {eq}`eq_Bayeslaw1033` is the foundation for the insight that, because of how the likelihood ratio process behaves +as $t \rightarrow + \infty$, the likelihood ratio process dominates the initial prior $\pi_0$ in determining the +limiting behavior of $\pi_t$. + +### A recursive formula + We can use a similar line of argument to get a recursive version of formula {eq}`eq_Bayeslaw1033`. The laws of probability imply @@ -284,6 +299,8 @@ Dividing both the numerator and the denominator on the right side of the equat with $\pi_{0}$ being a Bayesian prior probability that $q = f$, i.e., a personal or subjective belief about $q$ based on our having seen no data. +Formula {eq}`eq_Bayeslaw1033` can be deduced by iterating on equation {eq}`eq_recur1`. + Below we define a Python function that updates belief $\pi$ using likelihood ratio $\ell$ according to recursion {eq}`eq_recur1` @@ -298,20 +315,12 @@ def update(π, l): return π ``` +As mentioned above, formula {eq}`eq_Bayeslaw1033` shows the key role that the likelihood ratio process $L\left(w^{t+1}\right)$ plays in determining the posterior probability $\pi_{t+1}$. -Formula {eq}`eq_Bayeslaw1033` generalizes formula {eq}`eq_recur1`. - - -Formula {eq}`eq_Bayeslaw1033` can be regarded as a one step revision of prior probability $\pi_0$ after seeing -the batch of data $\left\{ w_{i}\right\} _{i=1}^{t+1}$. - -Formula {eq}`eq_Bayeslaw1033` shows the key role that the likelihood ratio process $L\left(w^{t+1}\right)$ plays in determining -the posterior probability $\pi_{t+1}$. - -Formula {eq}`eq_Bayeslaw1033` is the foundation for the insight that, because of how the likelihood ratio process behaves -as $t \rightarrow + \infty$, the likelihood ratio process dominates the initial prior $\pi_0$ in determining the +As $t \rightarrow + \infty$, the likelihood ratio process dominates the initial prior $\pi_0$ in determining the limiting behavior of $\pi_t$. + To illustrate this insight, below we will plot graphs showing **one** simulated path of the likelihood ratio process $L_t$ along with two paths of $\pi_t$ that are associated with the *same* realization of the likelihood ratio process but *different* initial prior probabilities $\pi_{0}$. From 7d2230c1c23cd82f16b7d6fa888c981f83d22eab Mon Sep 17 00:00:00 2001 From: thomassargent30 Date: Wed, 30 Jul 2025 15:56:47 -0600 Subject: [PATCH 6/6] Tom's July 30 edits of wald_Friedman lecture --- lectures/wald_friedman.md | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/lectures/wald_friedman.md b/lectures/wald_friedman.md index 6f2f25682..dab4a54dd 100644 --- a/lectures/wald_friedman.md +++ b/lectures/wald_friedman.md @@ -39,15 +39,14 @@ Friedman and W. Allen Wallis during World War II when they were analysts at the This problem led Abraham Wald {cite}`Wald47` to formulate **sequential analysis**, an approach to statistical decision problems that is intimately related to dynamic programming. -In the spirit of {doc}`this earlier lecture `, the present lecture and its {doc}`sequel ` approach the problem from two distinct points of view. +In the spirit of {doc}`this earlier lecture `, the present lecture and its {doc}`sequel ` approach the problem from two distinct points of view, one frequentist, the other Bayesian. In this lecture, we describe Wald's formulation of the problem from the perspective of a statistician working within the Neyman-Pearson tradition of a frequentist statistician who thinks about testing hypotheses and consequently use laws of large numbers to investigate limiting properties of particular statistics under a given **hypothesis**, i.e., a vector of **parameters** that pins down a particular member of a manifold of statistical models that interest the statistician. * From {doc}`this earlier lecture on frequentist and bayesian statistics`, please remember that a frequentist statistician routinely calculates functions of sequences of random variables, conditioning on a vector of parameters. -In {doc}`this sequel ` we'll discuss another formulation that adopts the perspective of a **Bayesian statistician** who views -parameter vectors as vectors of random variables that are jointly distributed with observable variables that he is concerned about. +In {doc}`this sequel ` we'll discuss another formulation that adopts the perspective of a **Bayesian statistician** who views parameters as vectors of random variables that are jointly distributed with observable variables that he is concerned about. Because we are taking a frequentist perspective that is concerned about relative frequencies conditioned on alternative parameter values, i.e., alternative **hypotheses**, key ideas in this lecture @@ -55,12 +54,12 @@ alternative **hypotheses**, key ideas in this lecture - Type I and type II statistical errors - a type I error occurs when you reject a null hypothesis that is true - a type II error occures when you accept a null hypothesis that is false -- Abraham Wald's **sequential probability ratio test** - The **power** of a frequentist statistical test - The **size** of a frequentist statistical test - The **critical region** of a statistical test - A **uniformly most powerful test** - The role of a Law of Large Numbers (LLN) in interpreting **power** and **size** of a frequentist statistical test +- Abraham Wald's **sequential probability ratio test** We'll begin with some imports: @@ -119,7 +118,7 @@ Let's listen to Milton Friedman tell us what happened > $\ldots$. Friedman and Wallis worked on the problem but, after realizing that -they were not able to solve it, they described the problem to Abraham Wald. +they were not able to solve it, they told Abraham Wald about the problem. That started Wald on the path that led him to *Sequential Analysis* {cite}`Wald47`. @@ -236,7 +235,7 @@ Wald notes that > one critical region $W$ is more desirable than another if it > has smaller values of $\alpha$ and $\beta$. Although > either $\alpha$ or $\beta$ can be made arbitrarily small -> by a proper choice of the critical region $W$, it is possible +> by a proper choice of the critical region $W$, it is impossible > to make both $\alpha$ and $\beta$ arbitrarily small for a > fixed value of $n$, i.e., a fixed sample size. @@ -333,10 +332,9 @@ random variables is also independently and identically distributed (IID). But the observer does not know which of the two distributions generated the sequence. -For reasons explained in [Exchangeability and Bayesian Updating](https://python.quantecon.org/exchangeable.html), this means that the sequence is not -IID. +For reasons explained in [Exchangeability and Bayesian Updating](https://python.quantecon.org/exchangeable.html), this means that the observer thinks that sequence is not IID. -The observer has something to learn, namely, whether the observations are drawn from $f_0$ or from $f_1$. +Consequently, the observer has something to learn, namely, whether the observations are drawn from $f_0$ or from $f_1$. The decision maker wants to decide which of the two distributions is generating outcomes. @@ -356,12 +354,12 @@ To repeat ourselves ### Choices -After observing $z_k, z_{k-1}, \ldots, z_0$, the decision-maker +After observing $z_k, z_{k-1}, \ldots, z_1$, the decision-maker chooses among three distinct actions: - He decides that $f = f_0$ and draws no more $z$'s - He decides that $f = f_1$ and draws no more $z$'s -- He postpones deciding now and instead chooses to draw a +- He postpones deciding and instead chooses to draw $z_{k+1}$ @@ -369,8 +367,8 @@ Wald proceeds as follows. He defines -- $p_{0m} = f_0(z_0) \cdots f_0(z_m)$ -- $p_{1m} = f_1(z_0) \cdots f_1(z_m)$ +- $p_{0m} = f_0(z_1) \cdots f_0(z_m)$ +- $p_{1m} = f_1(z_1) \cdots f_1(z_m)$ - $L_{m} = \frac{p_{1m}}{p_{0m}}$ Here $\{L_m\}_{m=0}^\infty$ is a **likelihood ratio process**. @@ -477,11 +475,11 @@ Below is the algorithm for the simulation. for each distribution, compute the empirical type I error $\hat{\alpha}$ and type II error $\hat{\beta}$ with $$ -\hat{\alpha} = \frac{\text{\# of times reject } H_0 \text{ when } f_0 \text{ is true}}{\text{\# of replications with } f_0 \text{ true}} +\hat{\alpha} = \frac{\text{$\#$ of times reject } H_0 \text{ when } f_0 \text{ is true}}{\text{$\#$ of replications with } f_0 \text{ true}} $$ $$ -\hat{\beta} = \frac{\text{\# of times accept } H_0 \text{ when } f_1 \text{ is true}}{\text{\# of replications with } f_1 \text{ true}} +\hat{\beta} = \frac{\text{$\#$ of times accept } H_0 \text{ when } f_1 \text{ is true}}{\text{$\#$ of replications with } f_1 \text{ true}} $$ ```{code-cell} ipython3 @@ -1079,4 +1077,4 @@ We'll dig deeper into some of the ideas used here in the following earlier and l * The concept of **exchangeability**, which underlies much of statistical learning, is explored in depth in our {doc}`lecture on exchangeable random variables `. * For a deeper understanding of likelihood ratio processes and their role in frequentist and Bayesian statistical theories, see {doc}`likelihood_ratio_process`. * Building on that foundation, {doc}`likelihood_bayes` examines the role of likelihood ratio processes in **Bayesian learning**. -* Finally, {doc}`this later lecture ` revisits the subject discussed here and examines whether the frequentist decision rule that the Navy ordered the captain to use would perform better or worse than the sequential decision rule we've developed. +* Finally, {doc}`this later lecture ` revisits the subject discussed here and examines whether the frequentist decision rule that the Navy ordered the captain to use would perform better or worse than Abraham Wald's sequential decision rule.