From 9f22608486ae49a8aca02209f62b4af3f5604619 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 7 Aug 2025 04:13:18 +0000
Subject: [PATCH 1/5] Initial plan


From 990bef466a7da1f70fe7c8fc522dc83086906060 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 7 Aug 2025 04:19:29 +0000
Subject: [PATCH 2/5] Fix heading capitalization in all intermediate lecture
 files according to style guide

Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
 lectures/aiyagari.md                     |  2 +-
 lectures/ak2.md                          | 14 +++----
 lectures/ar1_bayes.md                    |  4 +-
 lectures/ar1_turningpts.md               | 14 +++----
 lectures/back_prop.md                    | 12 +++---
 lectures/bayes_nonconj.md                | 22 +++++------
 lectures/cake_eating_numerical.md        | 12 +++---
 lectures/cake_eating_problem.md          | 20 +++++-----
 lectures/career.md                       |  2 +-
 lectures/cass_fiscal.md                  | 30 +++++++-------
 lectures/cass_fiscal_2.md                |  8 ++--
 lectures/cass_koopmans_1.md              | 22 +++++------
 lectures/cass_koopmans_2.md              | 26 ++++++------
 lectures/coleman_policy_iter.md          |  8 ++--
 lectures/cross_product_trick.md          |  4 +-
 lectures/egm_policy_iter.md              |  8 ++--
 lectures/eig_circulant.md                | 10 ++---
 lectures/exchangeable.md                 | 16 ++++----
 lectures/finite_markov.md                | 34 ++++++++--------
 lectures/ge_arrow.md                     | 30 +++++++-------
 lectures/harrison_kreps.md               | 22 +++++------
 lectures/hoist_failure.md                | 18 ++++-----
 lectures/house_auction.md                | 34 ++++++++--------
 lectures/ifp.md                          | 14 +++----
 lectures/ifp_advanced.md                 | 18 ++++-----
 lectures/imp_sample.md                   | 10 ++---
 lectures/inventory_dynamics.md           |  4 +-
 lectures/jv.md                           |  6 +--
 lectures/kalman.md                       |  8 ++--
 lectures/kalman_2.md                     |  8 ++--
 lectures/kesten_processes.md             | 20 +++++-----
 lectures/lagrangian_lqdp.md              | 14 +++----
 lectures/lake_model.md                   | 24 ++++++------
 lectures/likelihood_bayes.md             | 12 +++---
 lectures/likelihood_ratio_process.md     | 28 ++++++-------
 lectures/linear_algebra.md               | 46 +++++++++++-----------
 lectures/linear_models.md                | 50 ++++++++++++------------
 lectures/lln_clt.md                      |  6 +--
 lectures/lq_inventories.md               |  4 +-
 lectures/lqcontrol.md                    | 26 ++++++------
 lectures/markov_asset.md                 | 38 +++++++++---------
 lectures/markov_perf.md                  | 16 ++++----
 lectures/mccall_correlated.md            |  6 +--
 lectures/mccall_fitted_vfi.md            |  6 +--
 lectures/mccall_model.md                 | 18 ++++-----
 lectures/mccall_model_with_separation.md | 28 ++++++-------
 lectures/mccall_q.md                     | 12 +++---
 lectures/mix_model.md                    | 12 +++---
 lectures/mle.md                          | 14 +++----
 lectures/multi_hyper.md                  | 10 ++---
 lectures/multivariate_normal.md          | 36 ++++++++---------
 lectures/navy_captain.md                 | 14 +++----
 lectures/newton_method.md                | 28 ++++++-------
 lectures/odu.md                          | 22 +++++------
 lectures/ols.md                          |  4 +-
 lectures/opt_transport.md                | 18 ++++-----
 lectures/optgrowth.md                    | 28 ++++++-------
 lectures/optgrowth_fast.md               |  4 +-
 lectures/pandas_panel.md                 |  8 ++--
 lectures/perm_income.md                  | 30 +++++++-------
 lectures/perm_income_cons.md             | 24 ++++++------
 lectures/prob_matrix.md                  | 30 +++++++-------
 lectures/prob_meaning.md                 |  6 +--
 lectures/qr_decomp.md                    | 14 +++----
 lectures/rand_resp.md                    |  6 +--
 lectures/rational_expectations.md        | 40 +++++++++----------
 lectures/re_with_feedback.md             | 26 ++++++------
 lectures/samuelson.md                    | 46 +++++++++++-----------
 lectures/sir_model.md                    | 10 ++---
 lectures/stats_examples.md               | 10 ++---
 lectures/svd_intro.md                    | 18 ++++-----
 lectures/troubleshooting.md              |  4 +-
 lectures/two_auctions.md                 | 22 +++++------
 lectures/uncertainty_traps.md            |  4 +-
 lectures/util_rand_resp.md               | 28 ++++++-------
 lectures/var_dmd.md                      | 12 +++---
 lectures/von_neumann_model.md            | 14 +++----
 lectures/wald_friedman.md                | 12 +++---
 lectures/wald_friedman_2.md              | 12 +++---
 lectures/wealth_dynamics.md              | 14 +++----
 80 files changed, 687 insertions(+), 687 deletions(-)

diff --git a/lectures/aiyagari.md b/lectures/aiyagari.md
index e0e6a8dbd..32643f5e5 100644
--- a/lectures/aiyagari.md
+++ b/lectures/aiyagari.md
@@ -71,7 +71,7 @@ A textbook treatment is available in chapter 18 of {cite}`Ljungqvist2012`.
 
 A continuous time version of the model by SeHyoun Ahn and Benjamin Moll can be found [here](https://nbviewer.org/github/QuantEcon/QuantEcon.notebooks/blob/master/aiyagari_continuous_time.ipynb).
 
-## The Economy
+## The economy
 
 ### Households
 
diff --git a/lectures/ak2.md b/lectures/ak2.md
index a28398cfc..d1d9667e1 100644
--- a/lectures/ak2.md
+++ b/lectures/ak2.md
@@ -173,7 +173,7 @@ $$
 
 
-## Activities in Factor Markets
+## Activities in factor markets
 
 **Old people:**  At each  $t \geq 0$, a representative  old person 
 
@@ -196,7 +196,7 @@ If a lump-sum tax is negative, it means that the government pays the person a su
 ``` 
 
 
-## Representative firm's problem 
+## Representative firm's problem
 
 The representative firm hires labor services from  young people  at competitive wage  rate $W_t$  and hires  capital from old  people at competitive rental rate
 $r_t$. 
@@ -319,7 +319,7 @@ $$ (eq:optsavingsplan)
 
 
 (sec-equilibrium)=
-## Equilbrium 
+## Equilbrium
 
 **Definition:** An equilibrium is an allocation,  a government policy, and a price system with the properties that
 * given the price system and the government policy, the allocation solves
@@ -687,7 +687,7 @@ closed = ClosedFormTrans(α, β)
 ```
 
 (exp-tax-cut)=
-### Experiment 1: Tax cut
+### Experiment 1: tax cut
 
 To illustrate the power of `ClosedFormTrans`, let's first experiment with the following fiscal policy change:
 
@@ -788,7 +788,7 @@ for i, name in enumerate(['τ', 'D', 'G']):
 The economy with lower tax cut rate at $t=0$ has the same transitional pattern, but is less distorted, and it converges to a new steady state with higher physical capital stock.
 
 (exp-expen-cut)=
-### Experiment 2: Government asset accumulation
+### Experiment 2: government asset accumulation
 
 Assume that the economy is initially in the same steady state.
 
@@ -832,7 +832,7 @@ Although the consumptions in the new steady state are strictly higher, it is at
 ``` 
 
 
-### Experiment 3: Temporary expenditure cut
+### Experiment 3: temporary expenditure cut
 
 Let's now investigate a   scenario in which  the government also cuts its spending by  half and accumulates the asset.
 
@@ -1207,7 +1207,7 @@ for i, name in enumerate(['τ', 'D', 'G']):
 Comparing to {ref}`exp-tax-cut`, the government raises lump-sum taxes to finance the increasing debt interest payment, which is less distortionary comparing to raising the capital income tax rate.
 
 
-### Experiment 4: Unfunded Social Security System
+### Experiment 4: unfunded social security system
 
 In this experiment,  lump-sum taxes are of equal magnitudes for old and the young, but of opposite signs.
 
diff --git a/lectures/ar1_bayes.md b/lectures/ar1_bayes.md
index e553b38c8..441316d83 100644
--- a/lectures/ar1_bayes.md
+++ b/lectures/ar1_bayes.md
@@ -178,7 +178,7 @@ Now we shall use Bayes' law to construct a posterior distribution, conditioning
 
 First we'll use **pymc4**.
 
-## PyMC Implementation
+## PyMC implementation
 
 For a normal distribution in `pymc`,
 $var = 1/\tau = \sigma^{2}$.
@@ -292,7 +292,7 @@ We'll return to this issue after we use `numpyro` to compute posteriors under ou
 
 We'll now repeat the calculations using `numpyro`.
 
-## Numpyro Implementation
+## Numpyro implementation
 
 ```{code-cell} ipython3
 
diff --git a/lectures/ar1_turningpts.md b/lectures/ar1_turningpts.md
index 3aa55a9df..b2410100c 100644
--- a/lectures/ar1_turningpts.md
+++ b/lectures/ar1_turningpts.md
@@ -57,7 +57,7 @@ logger = logging.getLogger('pymc')
 logger.setLevel(logging.CRITICAL)
 ```
 
-## A Univariate First-Order Autoregressive Process
+## A univariate first-order autoregressive process
 
 Consider the univariate AR(1) model: 
 
@@ -185,7 +185,7 @@ As functions of forecast horizon, the coverage intervals have shapes like those
 https://python.quantecon.org/perm_income_cons.html
 
 
-## Predictive Distributions of Path Properties
+## Predictive distributions of path properties
 
 Wecker {cite}`wecker1979predicting` proposed using simulation techniques to characterize  predictive distribution of some statistics that are  non-linear functions of $y$. 
 
@@ -280,7 +280,7 @@ This is designed to express the event
 
 Following {cite}`wecker1979predicting`, we can use simulations to calculate  probabilities of $P_t$ and $N_t$ for each period $t$. 
 
-## A Wecker-Like Algorithm
+## A wecker-like algorithm
 
 The procedure consists of the following steps: 
 
@@ -297,7 +297,7 @@ $$
 * consider the sets $\{W_t(\omega_i)\}^{T}_{i=1}, \ \{W_{t+1}(\omega_i)\}^{T}_{i=1}, \ \dots, \ \{W_{t+N}(\omega_i)\}^{T}_{i=1}$ as samples from the predictive distributions $f(W_{t+1} \mid \mathcal y_t, \dots)$, $f(W_{t+2} \mid y_t, y_{t-1}, \dots)$, $\dots$, $f(W_{t+N} \mid y_t, y_{t-1}, \dots)$.
 
 
-## Using Simulations to Approximate a Posterior Distribution
+## Using simulations to approximate a posterior distribution
 
 The next code cells use `pymc` to compute the time $t$ posterior distribution of $\rho, \sigma$.
 
@@ -345,7 +345,7 @@ post_samples = draw_from_posterior(initial_path)
 
 The graphs on the left portray posterior marginal distributions.
 
-## Calculating Sample Path Statistics
+## Calculating sample path statistics
 
 Our next step is to prepare Python code to compute our sample path statistics.
 
@@ -404,7 +404,7 @@ def next_turning_point(omega):
     return up_turn, down_turn
 ```
 
-## Original Wecker Method
+## Original Wecker method
 
 Now we  apply Wecker's original  method by simulating future paths and compute predictive distributions, conditioning
 on the true  parameters associated with the data-generating model.
@@ -470,7 +470,7 @@ plot_Wecker(initial_path, 1000, ax)
 plt.show()
 ```
 
-## Extended Wecker Method
+## Extended Wecker method
 
 Now we apply we apply our  "extended" Wecker method based on  predictive densities of $y$ defined by
 {eq}`ar1-tp-eq4` that acknowledge posterior uncertainty in the parameters $\rho, \sigma$.
diff --git a/lectures/back_prop.md b/lectures/back_prop.md
index 6ce9a6dab..935e87095 100644
--- a/lectures/back_prop.md
+++ b/lectures/back_prop.md
@@ -24,7 +24,7 @@ kernelspec:
 
 ```{code-cell} ipython3
 import jax
-## to check that gpu is activated in environment
+## To check that gpu is activated in environment
 print(f"JAX backend: {jax.devices()[0].platform}")
 ```
 
@@ -64,7 +64,7 @@ We'll describe the following concepts that are brick and mortar for neural netwo
  * back-propagation and its relationship  to the chain rule of differential calculus
  
 
-## A Deep (but not Wide) Artificial Neural Network
+## A deep (but not wide) artificial neural network
 
 We describe a  "deep" neural network of "width" one.  
 
@@ -145,7 +145,7 @@ starting from $x_1 = \tilde x$.
 The value of $x_{N+1}$ that emerges from this iterative scheme
 equals $\hat f(\tilde x)$.
 
-## Calibrating  Parameters
+## Calibrating parameters
 
 
 We now consider a  neural network like the one describe above  with width 1, depth $N$,  and activation functions $h_{i}$ for $1\leqslant i\leqslant N$ that map $\mathbb{R}$ into itself.
@@ -203,7 +203,7 @@ To implement one step of this parameter update rule, we want  the vector of deri
 
 In the neural network literature, this step is accomplished by what is known as **back propagation**.
 
-## Back Propagation and the Chain Rule
+## Back propagation and the chain rule
 
 Thanks to  properties of
 
@@ -304,7 +304,7 @@ We can then solve the above problem by applying our update for $p$ multiple time
 
 
-## Training Set
+## Training set
 
 Choosing a  training set amounts to a choice of measure $\mu$ in the above  formulation of our  function approximation problem as a minimization problem.
 
@@ -530,7 +530,7 @@ Image(fig.to_image(format="png"))
 # notebook locally
 ```
 
-## How Deep? 
+## How deep?
 
 It  is  fun to think about how deepening the neural net for the above example affects the quality of  approximation 
 
diff --git a/lectures/bayes_nonconj.md b/lectures/bayes_nonconj.md
index 6d586e4e3..ed8a8b837 100644
--- a/lectures/bayes_nonconj.md
+++ b/lectures/bayes_nonconj.md
@@ -83,7 +83,7 @@ from numpyro.infer import Trace_ELBO as nTrace_ELBO
 from numpyro.optim import Adam as nAdam
 ```
 
-## Unleashing MCMC on a  Binomial Likelihood
+## Unleashing MCMC on a binomial likelihood
 
 This lecture begins with the binomial example in the {doc}`quantecon lecture <prob_meaning>`.
 
@@ -103,7 +103,7 @@ We use several alternative prior distributions
 We  compare computed posteriors  with ones associated with a conjugate prior as described in  {doc}`the quantecon lecture <prob_meaning>`
 
 
-### Analytical Posterior
+### Analytical posterior
 
 Assume that the random variable $X\sim Binom\left(n,\theta\right)$.
 
@@ -183,7 +183,7 @@ def analytical_beta_posterior(data, alpha0, beta0):
     return st.beta(alpha0 + up_num, beta0 + down_num)
 ```
 
-### Two Ways to Approximate Posteriors
+### Two ways to approximate posteriors
 
 Suppose that we don't have a conjugate prior.
 
@@ -215,7 +215,7 @@ a Kullback-Leibler (KL) divergence between true posterior and the putatitive pos
 
   - minimizing the KL divergence is  equivalent with  maximizing a criterion called  the **Evidence Lower Bound** (ELBO), as we shall verify soon.
 
-## Prior Distributions
+## Prior distributions
 
 In order to be able to apply MCMC sampling or VI, `Pyro` and `Numpyro` require  that a prior distribution satisfy special properties:
 
@@ -323,7 +323,7 @@ class TruncatedvonMises(dist.Rejector):
         return constraints.interval(self.low, self.upp)
 ```
 
-### Variational Inference
+### Variational inference
 
 Instead of directly sampling from the posterior,  the **variational inference**  methodw approximates an unknown posterior distribution with  a family of tractable distributions/densities.
 
@@ -683,7 +683,7 @@ class BayesianInference:
         return params, losses
 ```
 
-## Alternative Prior Distributions
+## Alternative prior distributions
 
 Let's see how well our sampling algorithm does in approximating
 
@@ -731,7 +731,7 @@ exampleLP.show_prior(size=100000,bins=40)
 
 Having assured ourselves that our sampler seems to do a good job, let's put it to work in using MCMC to compute posterior probabilities.
 
-## Posteriors Via MCMC and VI
+## Posteriors via MCMC and VI
 
 We construct a class  `BayesianInferencePlot` to implement MCMC or VI algorithms and plot multiple posteriors for different updating data sizes and different  possible prior.
 
@@ -884,7 +884,7 @@ SVI_num_steps = 5000
 true_theta = 0.8
 ```
 
-### Beta Prior and Posteriors:
+### Beta prior and posteriors:
 
 Let's compare outcomes when we use a Beta prior.
 
@@ -953,7 +953,7 @@ will be  more accurate, as we shall see next.
 BayesianInferencePlot(true_theta, num_list, BETA_numpyro).SVI_plot(guide_dist='beta', n_steps=100000)
 ```
 
-## Non-conjugate Prior Distributions
+## Non-conjugate prior distributions
 
 Having assured ourselves that our MCMC and VI methods can work well when we have  conjugate prior and so can also compute analytically, we
 next proceed to situations in which our  prior  is not a beta distribution, so we don't have a conjugate prior.
@@ -1040,7 +1040,7 @@ To get more accuracy we will now increase the number of steps for Variational In
 SVI_num_steps = 50000
 ```
 
-#### VI with a  Truncated Normal Guide
+#### VI with a truncated normal guide
 
 ```{code-cell} ipython3
 # Uniform
@@ -1071,7 +1071,7 @@ print(f'=======INFO=======\nParameters: {example_CLASS.param}\nPrior Dist: {exam
 BayesianInferencePlot(true_theta, num_list, example_CLASS).SVI_plot(guide_dist='normal', n_steps=SVI_num_steps)
 ```
 
-#### Variational Inference with a  Beta Guide Distribution
+#### Variational inference with a Beta guide distribution
 
 ```{code-cell} ipython3
 # Uniform
diff --git a/lectures/cake_eating_numerical.md b/lectures/cake_eating_numerical.md
index 3a0c2c3aa..39917114e 100644
--- a/lectures/cake_eating_numerical.md
+++ b/lectures/cake_eating_numerical.md
@@ -42,7 +42,7 @@ import numpy as np
 from scipy.optimize import minimize_scalar, bisect
 ```
 
-## Reviewing the Model
+## Reviewing the model
 
 You might like to {doc}`review the details <cake_eating_problem>` before we start.
 
@@ -66,7 +66,7 @@ to be as follows.
 
 Our first aim is to obtain these analytical solutions numerically.
 
-## Value Function Iteration
+## Value function iteration
 
 The first approach we will take is **value function iteration**.
 
@@ -86,7 +86,7 @@ The basic idea is:
 
 Let's write this a bit more mathematically.
 
-### The Bellman Operator
+### The Bellman operator
 
 We introduce the **Bellman operator** $T$ that takes a function v as an
 argument and returns a new function $Tv$ defined by
@@ -105,7 +105,7 @@ As we discuss in more detail in later lectures, one can use Banach's
 contraction mapping theorem to prove that the sequence of functions $T^n
 v$ converges to the solution to the Bellman equation.
 
-### Fitted Value Function Iteration
+### Fitted value function iteration
 
 Both consumption $c$ and the state variable $x$ are continuous.
 
@@ -338,7 +338,7 @@ less so near the lower boundary.
 The reason is that the utility function and hence value function is very
 steep near the lower boundary, and hence hard to approximate.
 
-### Policy Function
+### Policy function
 
 Let's see how this plays out in terms of computing the optimal policy.
 
@@ -419,7 +419,7 @@ possibility of faster compute time and, at the same time, more accuracy.
 
 We explore this next.
 
-## Time Iteration
+## Time iteration
 
 Now let's look at a different strategy to compute the optimal policy.
 
diff --git a/lectures/cake_eating_problem.md b/lectures/cake_eating_problem.md
index a4627f706..92021e21b 100644
--- a/lectures/cake_eating_problem.md
+++ b/lectures/cake_eating_problem.md
@@ -45,7 +45,7 @@ plt.rcParams["figure.figsize"] = (11, 5)  #set default figure size
 import numpy as np
 ```
 
-## The Model
+## The model
 
 We consider an infinite time horizon $t=0, 1, 2, 3..$
 
@@ -115,7 +115,7 @@ In this problem, the following terminology is standard:
 * $c_t$ is called the **control variable** or the **action**
 * $\beta$ and $\gamma$ are **parameters**
 
-### Trade-Off
+### Trade-off
 
 The key trade-off in the cake-eating problem is this:
 
@@ -145,14 +145,14 @@ parameters*.
 
 Let's see if this is true.
 
-## The Value Function
+## The value function
 
 The first step of our dynamic programming treatment is to obtain the Bellman
 equation.
 
 The next step is to use it to calculate the solution.
 
-### The Bellman Equation
+### The Bellman equation
 
 To this end, we let $v(x)$ be maximum lifetime utility attainable from
 the current time when $x$ units of cake are left.
@@ -199,7 +199,7 @@ If $c$ is chosen optimally using this trade off strategy, then we obtain maximal
 
 Hence, $v(x)$ equals the right hand side of {eq}`bellman-cep`, as claimed.
 
-### An Analytical Solution
+### An analytical solution
 
 It has been shown that, with $u$ as the CRRA utility function in
 {eq}`crra_utility`, the function
@@ -249,7 +249,7 @@ ax.legend(fontsize=12)
 plt.show()
 ```
 
-## The Optimal Policy
+## The optimal policy
 
 Now that we have the value function, it is straightforward to calculate the
 optimal action at each state.
@@ -309,7 +309,7 @@ ax.legend()
 plt.show()
 ```
 
-## The Euler Equation
+## The Euler equation
 
 In the discussion above we have provided a complete solution to the cake
 eating problem in the case of CRRA utility.
@@ -323,7 +323,7 @@ Euler equation.
 This is because, for more difficult problems, this equation
 provides key insights that are hard to obtain by other methods.
 
-### Statement and Implications
+### Statement and implications
 
 The Euler equation for the present problem can be stated as
 
@@ -376,7 +376,7 @@ see proposition 2.2 of {cite}`ma2020income`.
 The following arguments focus on necessity, explaining why an optimal path or
 policy should satisfy the Euler equation.
 
-### Derivation I: A Perturbation Approach
+### Derivation I: a perturbation approach
 
 Let's write $c$ as a shorthand for consumption path $\{c_t\}_{t=0}^\infty$.
 
@@ -444,7 +444,7 @@ $$
 
 This is just the Euler equation.
 
-### Derivation II: Using the Bellman Equation
+### Derivation II: using the Bellman equation
 
 Another way to derive the Euler equation is to use the Bellman equation {eq}`bellman-cep`.
 
diff --git a/lectures/career.md b/lectures/career.md
index e2446cad1..0bce55500 100644
--- a/lectures/career.md
+++ b/lectures/career.md
@@ -58,7 +58,7 @@ from mpl_toolkits.mplot3d.axes3d import Axes3D
 from matplotlib import cm
 ```
 
-### Model Features
+### Model features
 
 * Career and job within career both chosen to maximize expected discounted wage flow.
 * Infinite horizon dynamic programming with two state variables.
diff --git a/lectures/cass_fiscal.md b/lectures/cass_fiscal.md
index fdf5c274d..bda741dce 100644
--- a/lectures/cass_fiscal.md
+++ b/lectures/cass_fiscal.md
@@ -36,7 +36,7 @@ We present two ways to approximate an equilibrium:
 
 
 (cs_fs_model)=
-## The Economy
+## The economy
 
 
 ### Technology
@@ -109,7 +109,7 @@ In the [experiment section](cf:experiments), we shall see how variations in gove
 the transition path and equilibrium.
 
 
-### Representative Household
+### Representative household
 
 A representative household has preferences over nonnegative streams of a single consumption good $c_t$ and leisure $1-n_t$ that are ordered by:
 
@@ -135,7 +135,7 @@ Here we have assumed that the government gives a depreciation allowance $\delta
 from the gross rentals on capital $\eta_t k_t$ and so collects taxes $\tau_{kt} (\eta_t - \delta) k_t$
 on rentals from capital.
 
-### Government 
+### Government
 
 Government plans $\{ g_t \}_{t=0}^\infty$ for government purchases and taxes $\{\tau_{ct}, \tau_{kt}, \tau_{nt}, \tau_{ht}\}_{t=0}^\infty$ must respect the budget constraint
 
@@ -166,7 +166,7 @@ A **competitive equilibrium with distorting taxes** is a **budget-feasible gover
 policy, the allocation solves the household's problem and the firm's problem.
 ```
 
-## No-arbitrage Condition
+## No-arbitrage condition
 
 A no-arbitrage argument implies a restriction on prices and tax rates across time.
 
@@ -229,7 +229,7 @@ $$
 \eta_t = F_{kt}, \quad w_t = F_{nt}.
 $$(eq:no_arb_firms)
 
-## Household's First Order Condition
+## Household's first order condition
 
 Household maximize {eq}`eq:utility` under {eq}`eq:house_budget`.
 
@@ -272,7 +272,7 @@ $$
 -\lim_{T \to \infty} \beta^T \frac{U_{1T}}{(1 + \tau_{cT})} k_{T+1} = 0.
 $$ (eq:terminal_final)
 
-## Computing Equilibria
+## Computing equilibria
 
 To compute an equilibrium,  we seek a  price system $\{q_t, \eta_t, w_t\}$, a budget feasible government policy $\{g_t, \tau_t\} \equiv \{g_t, \tau_{ct}, \tau_{nt}, \tau_{kt}, \tau_{ht}\}$, and an allocation $\{c_t, n_t, k_{t+1}\}$ that solve a system of nonlinear difference equations consisting of 
 
@@ -280,7 +280,7 @@ To compute an equilibrium,  we seek a  price system $\{q_t, \eta_t, w_t\}$, a bu
 - an initial condition $k_0$ and a terminal condition {eq}`eq:terminal_final`.
 
 (cass_fiscal_shooting)=
-## Python Code
+## Python code
 
 We require the following imports
 
@@ -328,7 +328,7 @@ model = create_model()
 S = 100
 ```
 
-### Inelastic Labor Supply
+### Inelastic labor supply
 
 In this lecture, we consider the special case where $U(c, 1-n) = u(c)$ and $f(k) := F(k, 1)$.
 
@@ -595,7 +595,7 @@ We describe  two ways to compute an equilibrium:
  * a shooting algorithm
  * a residual-minimization method that focuses on imposing  Euler equation {eq}`eq:diff_second` and the  feasibility condition {eq}`eq:feasi_capital`.
 
-### Shooting Algorithm
+### Shooting algorithm
 
 This algorithm deploys the following steps.
 
@@ -1205,7 +1205,7 @@ The figure indicates how:
 
 +++
 
-### Method 2: Residual Minimization 
+### Method 2: residual minimization
 
 The second method involves minimizing residuals (i.e., deviations from equalities) of the following equations:
 
@@ -1522,7 +1522,7 @@ def compute_A_path(A0, shocks, S=100):
     return A_path
 ```
 
-### Inelastic Labor Supply
+### Inelastic labor supply
 
 By linear homogeneity, the production function can be expressed as
 
@@ -1580,7 +1580,7 @@ $$
 c_{t+1} = c_t \left[ \beta \bar{R}_{t+1} \right]^{\frac{1}{\gamma}}\mu_{t+1}^{-1}
 $$ (eq:consume_r_mod)
 
-### Steady State
+### Steady state
 
 In a steady state, $c_{t+1} = c_t$. Then {eq}`eq:diff_mod` becomes
 
@@ -1609,7 +1609,7 @@ $$
 Since the algorithm and plotting routines are the same as before, we include the steady-state calculations and 
 shooting routine in the section {ref}`cass_fiscal_shooting`.
 
-### Shooting Algorithm
+### Shooting algorithm
 
 Now we can apply the shooting algorithm to compute equilibrium. We augment the vector of shock variables by including $\mu_t$, then proceed as before.
 
@@ -1622,7 +1622,7 @@ Let's run some experiments:
 
 +++
 
-#### Experiment 1: A foreseen increase in $\mu$ from 1.02 to 1.025 at t=10
+#### Experiment 1: a foreseen increase in $\mu$ from 1.02 to 1.025 at t=10
 
 The figures below show the effects of a permanent increase in productivity growth $\mu$ from 1.02 to 1.025 at t=10. 
 
@@ -1679,7 +1679,7 @@ $\bar R$.
 - Perfect foresight makes the effects of the increase in the growth of capital
 precede it, with the effect visible at $t=0$.
 
-#### Experiment 2: An unforeseen increase in $\mu$ from 1.02 to 1.025 at t=0
+#### Experiment 2: an unforeseen increase in $\mu$ from 1.02 to 1.025 at t=0
 
 The figures below show the effects of an immediate jump in $\mu$ to 1.025 at t=0.
 
diff --git a/lectures/cass_fiscal_2.md b/lectures/cass_fiscal_2.md
index 3f396e500..b340037b0 100644
--- a/lectures/cass_fiscal_2.md
+++ b/lectures/cass_fiscal_2.md
@@ -41,7 +41,7 @@ mp.dps = 40
 mp.pretty = True
 ```
 
-## A Two-Country Cass-Koopmans Model
+## A two-country cass-koopmans model
 
 This section describes a two-country version of the basic model of {ref}`cs_fs_model`.
 
@@ -76,7 +76,7 @@ Later, we will use this constraint as a global feasibility constraint in our com
 
 To connect the two countries, we need to specify how capital flows across borders and how taxes are levied in different jurisdictions.
 
-### Capital Mobility and Taxation
+### Capital mobility and taxation
 
 A consumer in country one can hold capital in either country but pays taxes on rentals from foreign holdings of capital at the rate set by the foreign country. 
 
@@ -430,7 +430,7 @@ def compute_η_path(k_path, model, S=100, A_path=None):
     return η_path
 ```
 
-#### Experiment 1: A foreseen increase in $g$ from 0.2 to 0.4 at t=10
+#### Experiment 1: a foreseen increase in $g$ from 0.2 to 0.4 at t=10
 
 The figure below presents transition dynamics after an increase in $g$ in the domestic economy from 0.2 to 0.4 that is announced ten periods in advance.
 
@@ -494,7 +494,7 @@ The domestic economy, in turn, starts running current-account deficits partially
 This means that foreign households begin repaying part of their external debt by reducing their capital stock.
 
 
-#### Experiment 2: A foreseen increase in $g$ from 0.2 to 0.4 at t=10
+#### Experiment 2: a foreseen increase in $g$ from 0.2 to 0.4 at t=10
 
 We now explore the impact of an increase in capital taxation in the domestic economy $10$ periods after its announcement at $t = 1$.
 
diff --git a/lectures/cass_koopmans_1.md b/lectures/cass_koopmans_1.md
index 30b9cbdb0..a23e9dd77 100644
--- a/lectures/cass_koopmans_1.md
+++ b/lectures/cass_koopmans_1.md
@@ -79,7 +79,7 @@ import numpy as np
 from quantecon.optimize import brentq
 ```
 
-## The Model
+## The model
 
 Time is discrete and takes values $t = 0, 1 , \ldots, T$ where $T$ is  finite.
 
@@ -99,7 +99,7 @@ Let $K_t$ be the stock of physical capital at time $t$.
 Let $\vec{C}$ = $\{C_0,\dots, C_T\}$ and
 $\vec{K}$ = $\{K_0,\dots,K_{T+1}\}$.
 
-### Digression: Aggregation Theory
+### Digression: aggregation theory
 
 We use a concept of a representative consumer to be thought of as follows.
 
@@ -151,7 +151,7 @@ It appears often in aggregate economics.
 We shall use this aggregation theory here and also  in  this lecture  {doc}`Cass-Koopmans Competitive Equilibrium <cass_koopmans_2>`.
 
 
-#### An  Economy
+#### An economy
 
 
 A representative household is endowed with one unit of labor at each
@@ -213,7 +213,7 @@ C_t + K_{t+1} \leq F(K_t,N_t) + (1-\delta) K_t \quad \text{for all } t \in \{0,
 
 where $\delta \in (0,1)$ is a depreciation rate of capital.
 
-## Planning Problem
+## Planning problem
 
 A planner chooses an allocation $\{\vec{C},\vec{K}\}$ to
 maximize {eq}`utility-functional` subject to {eq}`allocation`.
@@ -247,7 +247,7 @@ and  pose the following min-max problem:
 
 Before computing first-order conditions, we present some handy formulas.
 
-### Useful Properties of Linearly Homogeneous Production Function
+### Useful properties of linearly homogeneous production function
 
 The following technicalities will help us.
 
@@ -474,7 +474,7 @@ We can construct an economy with the Python code:
 pp = PlanningProblem()
 ```
 
-## Shooting Algorithm
+## Shooting algorithm
 
 We use  **shooting** to compute an optimal allocation
 $\vec{C}, \vec{K}$ and an associated Lagrange multiplier sequence
@@ -688,7 +688,7 @@ Now we can solve the model and plot the paths of consumption, capital, and Lagra
 plot_paths(pp, 0.3, 0.3, [10]);
 ```
 
-## Setting Initial Capital to Steady State Capital
+## Setting initial capital to steady state capital
 
 When  $T \rightarrow +\infty$, the optimal allocation converges to
 steady state values of $C_t$ and $K_t$.
@@ -782,7 +782,7 @@ The following graphs compare optimal outcomes as we vary $T$.
 plot_paths(pp, 0.3, k_ss/3, [150, 75, 50, 25], k_ss=k_ss);
 ```
 
-## A Turnpike Property
+## A turnpike property
 
 The following calculation indicates that when  $T$ is very large,
 the optimal capital stock stays close to
@@ -910,7 +910,7 @@ def plot_saving_rate(pp, c0, k0, T_arr, k_ter=0, k_ss=None, s_ss=None):
 plot_saving_rate(pp, 0.3, k_ss/3, [250, 150, 75, 50], k_ss=k_ss)
 ```
 
-## A Limiting Infinite Horizon Economy
+## A limiting infinite horizon economy
 
 We want to set $T = +\infty$.
 
@@ -964,7 +964,7 @@ The planner slowly lowers the saving rate until reaching a steady
 state in which $f'(K)=\rho +\delta$.
 
 
-## Stable Manifold and Phase Diagram 
+## Stable manifold and phase diagram
 
 We now describe a classic diagram that describes an optimal $(K_{t+1}, C_t)$ path.
 
@@ -1132,7 +1132,7 @@ ax.set_ylabel('$C$')
 plt.show()
 ```
 
-## Concluding Remarks
+## Concluding remarks
 
 In {doc}`Cass-Koopmans Competitive Equilibrium <cass_koopmans_2>`,  we study a decentralized version of an economy with exactly the same
 technology and preference structure as deployed here.
diff --git a/lectures/cass_koopmans_2.md b/lectures/cass_koopmans_2.md
index b8ce49d48..53a9c41dc 100644
--- a/lectures/cass_koopmans_2.md
+++ b/lectures/cass_koopmans_2.md
@@ -73,7 +73,7 @@ from numba.experimental import jitclass
 import numpy as np
 ```
 
-## Review of Cass-Koopmans Model
+## Review of cass-koopmans model
 
 The physical setting is identical with that in {doc}`Cass-Koopmans Planning Model <cass_koopmans_1>`.
 
@@ -125,14 +125,14 @@ $$
 
 where $\delta \in (0,1)$ is a depreciation rate of capital.
 
-### Planning Problem
+### Planning problem
 
 In this lecture {doc}`Cass-Koopmans Planning Model <cass_koopmans_1>`, we studied a problem in which a planner chooses an allocation $\{\vec{C},\vec{K}\}$ to
 maximize {eq}`utility-functional` subject to {eq}`allocation`.
 
 The allocation that solves the planning problem reappears in a competitive equilibrium, as we shall see below.
 
-## Competitive Equilibrium
+## Competitive equilibrium
 
 We now study a decentralized version of the  economy.
 
@@ -178,7 +178,7 @@ Again, we can think of there being  unit measures of identical representative co
 identical representative firms.
 ```
 
-## Market Structure
+## Market structure
 
 The representative household and the representative firm are both price takers.
 
@@ -219,7 +219,7 @@ $$
 
 In this case, we would be taking the time $0$ consumption good to be the **numeraire**.
 
-## Firm Problem
+## Firm problem
 
 At time $t$ a representative firm hires labor
 $\tilde n_t$ and capital $\tilde k_t$.
@@ -239,7 +239,7 @@ $$
 F(\tilde k_t, \tilde n_t) = A \tilde k_t^\alpha \tilde n_t^{1-\alpha}
 $$
 
-### Zero Profit Conditions
+### Zero profit conditions
 
 Zero-profits conditions for capital and labor are
 
@@ -316,7 +316,7 @@ the firm would want to set  $\tilde k_t$ to zero, which is not feasible.
 It is convenient to define
 $\vec{w} =\{w_0, \dots,w_T\}$ and $\vec{\eta}= \{\eta_0, \dots, \eta_T\}$.
 
-## Household Problem
+## Household problem
 
 A representative household lives at $t=0,1,\dots, T$.
 
@@ -402,7 +402,7 @@ The vision here is that an equilibrium price system and allocation are determine
 
 In effect, we imagine that all trades occur just before time $0$.
 
-## Computing a Competitive Equilibrium
+## Computing a competitive equilibrium
 
 We compute a competitive equilibrium by using a **guess and
 verify** approach.
@@ -412,7 +412,7 @@ verify** approach.
 - We then **verify** that at those prices, the household and
   the firm choose the same allocation.
 
-### Guess for Price System
+### Guess for price system
 
 In this lecture {doc}`Cass-Koopmans Planning Model <cass_koopmans_1>`, we  computed an allocation $\{\vec{C}, \vec{K}, \vec{N}\}$
 that solves a planning problem.
@@ -500,7 +500,7 @@ the planning problem:
 k^*_t = \tilde k^*_t=K_t, \tilde n_t=1, c^*_t=C_t
 ```
 
-### Verification Procedure
+### Verification procedure
 
 Our approach is firsts to stare at first-order necessary conditions for 
 optimization problems of the household and the firm.
@@ -625,7 +625,7 @@ Thus, at our guess of the equilibrium price system, the allocation
 that solves the planning problem also solves the problem faced by a
 representative household living in a competitive equilibrium.
 
-### Representative Firm's Problem
+### Representative firm's problem
 
 We now turn to  the problem faced by a firm in a competitive
 equilibrium:
@@ -880,7 +880,7 @@ plt.tight_layout()
 plt.show()
 ```
 
-#### Varying Curvature
+#### Varying curvature
 
 Now we see how our results change if we keep $T$ constant, but allow
 the curvature parameter, $\gamma$ to vary, starting
@@ -926,7 +926,7 @@ resulting in slower convergence to a  steady state allocation.
 Lower $\gamma$ means individuals prefer to smooth less,
 resulting in faster convergence  to a steady state allocation.
 
-## Yield Curves and Hicks-Arrow Prices
+## Yield curves and hicks-arrow prices
 
 We return to  Hicks-Arrow prices and  calculate how they are related to  **yields**  on loans of alternative maturities.
 
diff --git a/lectures/coleman_policy_iter.md b/lectures/coleman_policy_iter.md
index 7bec1c5e0..2198e0d01 100644
--- a/lectures/coleman_policy_iter.md
+++ b/lectures/coleman_policy_iter.md
@@ -66,7 +66,7 @@ from quantecon.optimize import brentq
 from numba import jit
 ```
 
-## The Euler Equation
+## The Euler equation
 
 Our first step is to derive the Euler equation, which is a generalization of
 the Euler equation we obtained in the {doc}`lecture on cake eating <cake_eating_problem>`.
@@ -157,7 +157,7 @@ over interior consumption policies $\sigma$, one solution of which is the optima
 
 Our aim is to solve the functional equation {eq}`cpi_euler_func` and hence obtain $\sigma^*$.
 
-### The Coleman-Reffett Operator
+### The coleman-reffett operator
 
 Recall the Bellman operator
 
@@ -211,7 +211,7 @@ $$
 
 In view of the Euler equation, this is exactly $\sigma^*(y)$.
 
-### Is the Coleman-Reffett Operator Well Defined?
+### Is the coleman-reffett operator well defined?
 
 In particular, is there always a unique $c \in (0, y)$ that solves
 {eq}`cpi_coledef`?
@@ -233,7 +233,7 @@ Sketching these curves and using the information above will convince you that th
 With a bit more analysis, one can show in addition that $K \sigma \in \mathscr P$
 whenever $\sigma \in \mathscr P$.
 
-### Comparison with VFI (Theory)
+### Comparison with VFI (theory)
 
 It is possible to prove that there is a tight relationship between iterates of
 $K$ and iterates of the Bellman operator.
diff --git a/lectures/cross_product_trick.md b/lectures/cross_product_trick.md
index 7824c6c3a..9728fe6bd 100644
--- a/lectures/cross_product_trick.md
+++ b/lectures/cross_product_trick.md
@@ -30,7 +30,7 @@ For a linear-quadratic dynamic programming problem, the idea involves these step
 
 +++
 
-## Undiscounted Dynamic Programming Problem
+## Undiscounted dynamic programming problem
 
 Here is a nonstochastic undiscounted LQ dynamic programming with cross products between
 states and controls in the objective function.
@@ -89,7 +89,7 @@ F & = F^* + Q^{-1} H.
 
 +++
 
-## Kalman Filter
+## Kalman filter
 
 The **duality** that prevails  between a linear-quadratic optimal control and a Kalman filtering problem means that there is an analogous transformation that allows us to transform a Kalman filtering problem
 with non-zero covariance matrix  between between shocks to states and shocks to measurements to an equivalent Kalman filtering problem with zero covariance between shocks to states and measurments.
diff --git a/lectures/egm_policy_iter.md b/lectures/egm_policy_iter.md
index 2250cb520..a78f354d6 100644
--- a/lectures/egm_policy_iter.md
+++ b/lectures/egm_policy_iter.md
@@ -47,7 +47,7 @@ import numpy as np
 from numba import jit
 ```
 
-## Key Idea
+## Key idea
 
 Let's start by reminding ourselves of the theory and then see how the numerics fit in.
 
@@ -77,7 +77,7 @@ u'(c)
 = \beta \int (u' \circ \sigma) (f(y - c) z ) f'(y - c) z \phi(dz)
 ```
 
-### Exogenous Grid
+### Exogenous grid
 
 As discussed in {doc}`the lecture on time iteration <coleman_policy_iter>`, to implement the method on a computer, we need a numerical approximation.
 
@@ -97,7 +97,7 @@ Thus, with the points $\{y_i, c_i\}$ in hand, we can reconstruct $K \sigma$ via
 
 Iteration then continues...
 
-### Endogenous Grid
+### Endogenous grid
 
 The method discussed above requires a root-finding routine to find the
 $c_i$ corresponding to a given income value $y_i$.
@@ -156,7 +156,7 @@ We reuse the `OptimalGrowthModel` class
 :load: _static/lecture_specific/optgrowth_fast/ogm.py
 ```
 
-### The Operator
+### The operator
 
 Here's an implementation of $K$ using EGM as described above.
 
diff --git a/lectures/eig_circulant.md b/lectures/eig_circulant.md
index 270aa6046..06e9c7bee 100644
--- a/lectures/eig_circulant.md
+++ b/lectures/eig_circulant.md
@@ -39,7 +39,7 @@ import matplotlib.pyplot as plt
 np.set_printoptions(precision=3, suppress=True)
 ```
 
-## Constructing a Circulant Matrix
+## Constructing a circulant matrix
 
 To construct an $N \times N$ circulant matrix, we  need only the first row, say,
 
@@ -86,7 +86,7 @@ def construct_cirlulant(row):
 construct_cirlulant(np.array([1., 2., 3.]))
 ```
 
-### Some Properties of Circulant Matrices
+### Some properties of circulant matrices
 
 Here are some useful properties:
 
@@ -126,7 +126,7 @@ where $C^T$ is the transpose of the circulant matrix  defined in equation {eq}`e
 
 
-## Connection to Permutation Matrix
+## Connection to permutation matrix
 
 A good way to construct a circulant matrix is to use a **permutation matrix**.
 
@@ -346,7 +346,7 @@ for j in range(8):
 diff_arr
 ```
 
-## Associated Permutation Matrix
+## Associated permutation matrix
 
 
 Next, we execute calculations to verify that the circulant matrix $C$ defined  in equation {eq}`eqn:circulant` can be written as
@@ -426,7 +426,7 @@ for j in range(8):
     print(diff)
 ```
 
-## Discrete Fourier Transform
+## Discrete fourier transform
 
 The **Discrete Fourier Transform** (DFT) allows us to  represent a  discrete time sequence as a weighted sum of complex sinusoids.
 
diff --git a/lectures/exchangeable.md b/lectures/exchangeable.md
index 4a1762f4a..946e958a1 100644
--- a/lectures/exchangeable.md
+++ b/lectures/exchangeable.md
@@ -79,7 +79,7 @@ from scipy.integrate import quad
 import numpy as np
 ```
 
-## Independently and Identically Distributed
+## Independently and identically distributed
 
 We begin by looking at the notion of an  **independently and identically  distributed sequence** of random variables.
 
@@ -108,7 +108,7 @@ $$
 
 so that the joint density is the product of a sequence of identical marginal densities.
 
-### IID Means Past Observations Don't Tell Us Anything About Future Observations
+### Iid means past observations don't tell us anything about future observations
 
 If a sequence is random variables is IID, past information provides no information about future realizations.
 
@@ -154,7 +154,7 @@ We turn next to an instance of the general case in which the sequence is not IID
 
 Please watch for what can be  learned from the past and when.
 
-## A Setting in Which Past Observations Are Informative
+## A setting in which past observations are informative
 
 Let $\{W_t\}_{t=0}^\infty$ be a sequence of nonnegative
 scalar random variables with a joint probability distribution
@@ -201,7 +201,7 @@ To proceed, we want to know the decision maker's belief about the joint distribu
 
 We'll discuss that next and in the process describe the concept of **exchangeability**.
 
-## Relationship Between IID and Exchangeable
+## Relationship between iid and exchangeable
 
 Conditional on nature selecting $F$, the joint density of the
 sequence $W_0, W_1, \ldots$ is
@@ -288,7 +288,7 @@ sequences of IID Bernoulli random variables with parameter $\theta \in (0,1)$ an
 Bernoulli parameter $\theta$.
 ```
 
-## Bayes' Law
+## Bayes' law
 
 We noted above that in our example model there is something to learn about about the future from past data drawn
 from our particular instance of a process that is exchangeable but not IID.
@@ -357,7 +357,7 @@ $$
 \mathbb{P}\{W = w\} = \sum_{a \in \{f, g\}} \mathbb{P}\{W = w \,|\, q = a \} \mathbb{P}\{q = a \}
 $$
 
-## More Details about Bayesian Updating
+## More details about Bayesian updating
 
 Let's stare at and rearrange Bayes' Law as represented in equation {eq}`eq_Bayes102` with the aim of understanding
 how the **posterior** probability $\pi_{t+1}$ is influenced by the **prior** probability $\pi_t$ and the **likelihood ratio**
@@ -540,7 +540,7 @@ Notice how the likelihood ratio, the middle graph, and the arrows compare with t
 
 ## Appendix
 
-### Sample Paths of $\pi_t$
+### Sample paths of $\pi_t$
 
 Now we'll have some fun by plotting multiple realizations of sample paths of $\pi_t$ under two possible
 assumptions about nature's choice of distribution, namely
@@ -657,7 +657,7 @@ plt.title("convergence");
 
 From the above graph, rates of convergence appear not to depend on whether $F$ or $G$ generates the data.
 
-### Graph of Ensemble Dynamics of $\pi_t$
+### Graph of ensemble dynamics of $\pi_t$
 
 More insights about the dynamics of $\{\pi_t\}$ can be gleaned by computing
 conditional expectations of $\frac{\pi_{t+1}}{\pi_{t}}$ as functions of $\pi_t$ via integration with respect
diff --git a/lectures/finite_markov.md b/lectures/finite_markov.md
index 924358069..53e5aa8d4 100644
--- a/lectures/finite_markov.md
+++ b/lectures/finite_markov.md
@@ -64,7 +64,7 @@ from mpl_toolkits.mplot3d import Axes3D
 The following concepts are fundamental.
 
 (finite_dp_stoch_mat)=
-### {index}`Stochastic Matrices <single: Stochastic Matrices>`
+### {index}`stochastic matrices <single: stochastic matrices>`
 
 ```{index} single: Finite Markov Chains; Stochastic Matrices
 ```
@@ -79,7 +79,7 @@ Each row of $P$ can be regarded as a probability mass function over $n$ possible
 
 It is too not difficult to check [^pm] that if $P$ is a stochastic matrix, then so is the $k$-th power $P^k$ for all $k \in \mathbb N$.
 
-### {index}`Markov Chains <single: Markov Chains>`
+### {index}`markov chains <single: Markov chains>`
 
 ```{index} single: Finite Markov Chains
 ```
@@ -221,7 +221,7 @@ However, it's also a good exercise to roll our own routines --- let's do that fi
 
 In these exercises, we'll take the state space to be $S = 0,\ldots, n-1$.
 
-### Rolling Our Own
+### Rolling our own
 
 To simulate a Markov chain, we need its stochastic matrix $P$ and a marginal probability distribution $\psi$  from which to  draw a realization of $X_0$.
 
@@ -293,7 +293,7 @@ np.mean(X == 0)
 You can try changing the initial distribution to confirm that the output is
 always close to 0.25, at least for the `P` matrix above.
 
-### Using QuantEcon's Routines
+### Using quantecon's routines
 
 As discussed above, [QuantEcon.py](http://quantecon.org/quantecon-py) has routines for handling Markov chains, including simulation.
 
@@ -317,7 +317,7 @@ The [QuantEcon.py](http://quantecon.org/quantecon-py) routine is [JIT compiled](
 %time mc.simulate(ts_length=1_000_000) # qe code version
 ```
 
-#### Adding State Values and Initial Conditions
+#### Adding state values and initial conditions
 
 If we wish to, we can provide a specification of state values to `MarkovChain`.
 
@@ -345,7 +345,7 @@ mc.simulate_indices(ts_length=4)
 ```
 
 (mc_md)=
-## {index}`Marginal Distributions <single: Marginal Distributions>`
+## {index}`marginal distributions <single: marginal distributions>`
 
 ```{index} single: Markov Chains; Marginal Distributions
 ```
@@ -417,7 +417,7 @@ X_t \sim \psi_t \quad \implies \quad X_{t+m} \sim \psi_t P^m
 ```
 
 (finite_mc_mstp)=
-### Multiple Step Transition Probabilities
+### Multiple step transition probabilities
 
 We know that the probability of transitioning from $x$ to $y$ in
 one step is $P(x,y)$.
@@ -438,7 +438,7 @@ $$
 \mathbb P \{X_{t+m} = y \,|\, X_t = x \} = P^m(x, y) = (x, y) \text{-th element of } P^m
 $$
 
-### Example: Probability of Recession
+### Example: probability of recession
 
 ```{index} single: Markov Chains; Future Probabilities
 ```
@@ -464,7 +464,7 @@ $$
 $$
 
 (mc_eg1-1)=
-### Example 2: Cross-Sectional Distributions
+### Example 2: cross-sectional distributions
 
 ```{index} single: Markov Chains; Cross-Sectional Distributions
 ```
@@ -501,7 +501,7 @@ each state.
 
 This is exactly the cross-sectional distribution.
 
-## {index}`Irreducibility and Aperiodicity <single: Irreducibility and Aperiodicity>`
+## {index}`irreducibility and aperiodicity <single: irreducibility and aperiodicity>`
 
 ```{index} single: Markov Chains; Irreducibility, Aperiodicity
 ```
@@ -653,7 +653,7 @@ mc.period
 mc.is_aperiodic
 ```
 
-## {index}`Stationary Distributions <single: Stationary Distributions>`
+## {index}`stationary distributions <single: stationary distributions>`
 
 ```{index} single: Markov Chains; Stationary Distributions
 ```
@@ -740,7 +740,7 @@ This is, in some sense, a steady state probability of unemployment --- more abou
 
 Not surprisingly it tends to zero as $\beta \to 0$, and to one as $\alpha \to 0$.
 
-### Calculating Stationary Distributions
+### Calculating stationary distributions
 
 ```{index} single: Markov Chains; Calculating Stationary Distributions
 ```
@@ -788,7 +788,7 @@ mc = qe.MarkovChain(P)
 mc.stationary_distributions  # Show all stationary distributions
 ```
 
-### Convergence to Stationarity
+### Convergence to stationarity
 
 ```{index} single: Markov Chains; Convergence to Stationarity
 ```
@@ -842,7 +842,7 @@ Here
 You might like to try experimenting with different initial conditions.
 
 (ergodicity)=
-## {index}`Ergodicity <single: Ergodicity>`
+## {index}`ergodicity <single: ergodicity>`
 
 ```{index} single: Markov Chains; Ergodicity
 ```
@@ -891,7 +891,7 @@ Thus, in the long-run, cross-sectional averages for a population and time-series
 This is one aspect of the concept  of ergodicity.
 
 (finite_mc_expec)=
-## Computing Expectations
+## Computing expectations
 
 ```{index} single: Markov Chains; Forecasting Future Values
 ```
@@ -963,7 +963,7 @@ We already know that this is $P^k(x, \cdot)$, so
 
 The vector $P^k h$ stores the conditional expectation $\mathbb E [ h(X_{t + k})  \mid X_t = x]$ over all $x$.
 
-### Iterated Expectations
+### Iterated expectations
 
 The **law of iterated expectations** states that
 
@@ -982,7 +982,7 @@ $$
 
 and note $\psi_t P^k h = \psi_{t+k} h = \mathbb E [  h(X_{t + k}) ] $.
 
-### Expectations of Geometric Sums
+### Expectations of geometric sums
 
 Sometimes we want to compute the mathematical expectation of a geometric sum, such as
 $\sum_t \beta^t h(X_t)$.
diff --git a/lectures/ge_arrow.md b/lectures/ge_arrow.md
index 3886d6ad8..adb58f6ba 100644
--- a/lectures/ge_arrow.md
+++ b/lectures/ge_arrow.md
@@ -145,7 +145,7 @@ $$
 
 for all $t$ and for all $s^t$.
 
-## Recursive Formulation
+## Recursive formulation
 
 Following descriptions in section 9.3.3 of Ljungqvist and Sargent {cite}`Ljungqvist2012` chapter 9, we  set up  a competitive equilibrium of a pure exchange economy with complete markets in one-period Arrow securities.
 
@@ -239,7 +239,7 @@ are  zero net aggregate claims.
 
 
-## State Variable Degeneracy
+## State variable degeneracy
 
 Please see Ljungqvist and Sargent {cite}`Ljungqvist2012` for a description of
 timing protocol for trades  consistent with an  Arrow-Debreu vision in which
@@ -284,7 +284,7 @@ This outcome depends critically on there being complete markets in Arrow securit
 
 For example, it does not prevail in the incomplete markets setting of this lecture {doc}`The Aiyagari Model <aiyagari>`
 
-## Markov Asset Prices
+## Markov asset prices
 
 
 Let's start with a brief summary of formulas for computing asset prices in
@@ -315,7 +315,7 @@ $$
 
 * The gross rate of return on a one-period risk-free bond Markov state $\bar s_i$ is $R_i = (\sum_j Q_{i,j})^{-1}$
 
-### Exogenous Pricing Kernel
+### Exogenous pricing kernel
 
 At this point, we'll take the pricing kernel $Q$ as exogenous, i.e., determined outside the model
 
@@ -352,7 +352,7 @@ Below, we describe an equilibrium model with trading of one-period Arrow securit
 
 In constructing our model, we'll repeatedly encounter formulas that remind us of our asset pricing formulas.
 
-### Multi-Step-Forward Transition Probabilities and Pricing Kernels
+### Multi-step-forward transition probabilities and pricing kernels
 
 The $(i,j)$ component of  the $\ell$-step ahead transition probability $P^\ell$ is
 
@@ -370,7 +370,7 @@ $$
 
 We'll use these objects to state a useful property in asset pricing theory.
 
-### Laws of Iterated Expectations and Iterated Values
+### Laws of iterated expectations and iterated values
 
 A  **law of iterated values** has a mathematical structure that parallels a
 **law of iterated expectations**
@@ -432,7 +432,7 @@ V \left[ V  ( d(s_{t+j}) | s_{t+1} ) \right] | s_t
     \end{aligned}
 $$
 
-## General Equilibrium
+## General equilibrium
 
 Now we are ready to do some fun calculations.
 
@@ -483,7 +483,7 @@ $$
 
 * A collection of $n \times 1$ vectors of individual $k$ consumptions: $c^k\left(s\right), k=1,\ldots, K$
 
-### $Q$ is the Pricing Kernel
+### $q$ is the pricing kernel
 
 
 For any agent  $k \in \left[1, \ldots, K\right]$, at the equilibrium allocation,
@@ -585,7 +585,7 @@ be nonnegative, then in a **finite horizon** economy with sequential trading of
 
 
-### Continuation Wealth
+### Continuation wealth
 
 Continuation wealth plays an important role in Bellmanizing a competitive equilibrium with sequential
 trading of a complete set of one-period Arrow securities.
@@ -640,7 +640,7 @@ the economy begins with  all agents being debt-free and financial-asset-free at
 
 **Remark:** Note that all agents' continuation wealths recurrently return to zero when the Markov state returns to whatever value $s_0$ it had at time $0$.
 
-### Optimal Portfolios
+### Optimal portfolios
 
 A nifty feature of the model is that an optimal portfolio of  a type $k$ agent equals the continuation wealth that we just computed.
 
@@ -651,7 +651,7 @@ $$
 a_k(s) = \psi^k(s), \quad s \in \left[\bar s_1, \ldots, \bar s_n \right]
 $$ (eqn:optport)
 
-### Equilibrium Wealth Distribution $\alpha$
+### Equilibrium wealth distribution $\alpha$
 
 
 With the initial state being  a particular state $s_0 \in \left[\bar{s}_1, \ldots, \bar{s}_n\right]$,
@@ -698,7 +698,7 @@ $$ J^k = (I - \beta P)^{-1} u(\alpha_k y)  , \quad u(c) = \frac{c^{1-\gamma}}{1-
 where it is understood that $ u(\alpha_k y)$ is a vector.
 
 
-## Finite Horizon
+## Finite horizon
 
 We now describe a finite-horizon version of the economy that operates  for $T+1$ periods
 $t \in {\bf T} = \{ 0, 1, \ldots, T\}$.
@@ -712,7 +712,7 @@ one-period utility function $u(c)$ satisfies an Inada condition that sets the ma
 limits borrowing.
 
 
-### Continuation Wealths
+### Continuation wealths
 
 
 We denote  a $K \times 1$ vector of  state-dependent continuation wealths in Markov state $s$ at time $t$ as
@@ -825,7 +825,7 @@ where it is understood that $ u(\alpha_k y)$ is a vector.
 
 
-## Python Code
+## Python code
 
 We are ready to dive into some Python code.
 
@@ -1303,7 +1303,7 @@ for i in range(1, 4):
 ```
 
 
-### Finite Horizon Example
+### Finite horizon example
 
 We now  revisit the economy defined in example 1, but set the time horizon to be $T=10$.
 
diff --git a/lectures/harrison_kreps.md b/lectures/harrison_kreps.md
index ee94938fe..58401d716 100644
--- a/lectures/harrison_kreps.md
+++ b/lectures/harrison_kreps.md
@@ -72,7 +72,7 @@ The Harrison-Kreps model illustrates the following notion of a bubble that attra
 
 > *A component of an asset price can be interpreted as a bubble when all investors agree that the current price of the asset exceeds what they believe the asset's underlying dividend stream justifies*.
 
-## Structure of the Model
+## Structure of the model
 
 The model simplifies things  by ignoring alterations in the distribution of wealth
 among investors who have hard-wired different beliefs about the fundamentals that determine
@@ -149,7 +149,7 @@ The stationary distribution of $P_b$ is approximately $\pi_b = \begin{bmatrix} .
 
 Thus, a type $a$ investor is more pessimistic on average.
 
-### Ownership Rights
+### Ownership rights
 
 An owner of the asset at the end of time $t$ is entitled to the dividend at time $t+1$ and also has the right to sell the asset at time $t+1$.
 
@@ -166,7 +166,7 @@ Case 1 is the case studied in Harrison and Kreps.
 
 In case 2, both types of investors always hold at least some of the asset.
 
-### Short Sales Prohibited
+### Short sales prohibited
 
 No short sales are allowed.
 
@@ -175,7 +175,7 @@ This matters because it limits how  pessimists can express their opinions.
 * They **can** express themselves by selling their shares.
 * They **cannot** express themsevles  more loudly by artificially "manufacturing shares" -- that is, they cannot borrow shares from more optimistic investors and then immediately sell them.
 
-### Optimism and Pessimism
+### Optimism and pessimism
 
 The above specifications of the perceived transition matrices $P_a$ and $P_b$, taken directly from Harrison and Kreps, build in stochastically alternating temporary optimism and pessimism.
 
@@ -194,7 +194,7 @@ This price function is endogenous and to be determined below.
 
 When investors choose whether to purchase or sell the asset at $t$, they also know $s_t$.
 
-## Solving the Model
+## Solving the model
 
 Now let's turn to solving the model.
 
@@ -207,7 +207,7 @@ assumptions about beliefs:
 1. There are two types of agents differentiated only by their beliefs. Each type of agent has sufficient resources to purchase all of the asset (Harrison and Kreps's setting).
 1. There are two types of agents with different beliefs, but because of limited wealth and/or limited leverage, both types of investors hold the asset each period.
 
-### Summary Table
+### Summary table
 
 The following table gives a summary of the findings obtained in the remainder of the lecture
 (in an exercise you will be asked to recreate  the  table and also reinterpret parts of it).
@@ -241,7 +241,7 @@ The row corresponding to $p_p$ would apply if neither type of investor has enoug
 
 The row corresponding to $p_p$ would also  apply if both types have enough resources to buy the entire stock of the asset but  short sales are also  possible so that   temporarily pessimistic   investors price the asset.
 
-### Single Belief Prices
+### Single belief prices
 
 We’ll start by pricing the asset under homogeneous beliefs.
 
@@ -284,7 +284,7 @@ def price_single_beliefs(transition, dividend_payoff, β=.75):
     return prices
 ```
 
-#### Single Belief Prices as Benchmarks
+#### Single belief prices as benchmarks
 
 These equilibrium prices under homogeneous beliefs are important benchmarks for the subsequent analysis.
 
@@ -293,7 +293,7 @@ These equilibrium prices under homogeneous beliefs are important benchmarks for
 
 We will compare these fundamental values of the asset with equilibrium values when traders have different beliefs.
 
-### Pricing under Heterogeneous Beliefs
+### Pricing under heterogeneous beliefs
 
 There are several cases to consider.
 
@@ -430,7 +430,7 @@ def price_optimistic_beliefs(transitions, dividend_payoff, β=.75,
     return p_new, phat_a, phat_b
 ```
 
-### Insufficient Funds
+### Insufficient funds
 
 Outcomes differ when the more optimistic type of investor has insufficient wealth --- or insufficient ability to borrow enough --- to hold the entire stock of the asset.
 
@@ -491,7 +491,7 @@ def price_pessimistic_beliefs(transitions, dividend_payoff, β=.75,
     return p_new
 ```
 
-### Further Interpretation
+### Further interpretation
 
 Jose Scheinkman {cite}`Scheinkman2014` interprets the Harrison-Kreps model as a model of a bubble --- a situation in which an asset price exceeds what every investor thinks is merited by his or her beliefs about the value of the asset's underlying dividend stream.
 
diff --git a/lectures/hoist_failure.md b/lectures/hoist_failure.md
index 32731b590..f8dd2f32f 100644
--- a/lectures/hoist_failure.md
+++ b/lectures/hoist_failure.md
@@ -129,7 +129,7 @@ This observation sets the stage for challenge that confronts us in this lecture,
 
 To compute the probability distribution of the sum of two log normal distributions, we can use the following convolution property of a probability distribution that is a sum of independent random variables.
 
-## The Convolution Property
+## The convolution property
 
 Let $x$ be a random variable with probability density $f(x)$, where $x \in {\bf R}$.
 
@@ -206,7 +206,7 @@ They provide the same answers but `scipy.signal.ftconvolve` is much faster.
 That's why we rely on it later in this lecture.
 
 
-## Approximating Distributions
+## Approximating distributions
 
 We'll construct an example to verify that  discretized distributions can do a good job of approximating  samples drawn from underlying
 continuous distributions.
@@ -216,7 +216,7 @@ We'll start by generating samples of size 25000 of three independent  log normal
 Then we'll plot  histograms and compare them with convolutions of appropriate discretized log normal distributions.
 
 ```{code-cell} python3
-## create sums of two and three log normal random variates ssum2 = s1 + s2 and ssum3 = s1 + s2 + s3
+## Create sums of two and three log normal random variates ssum2 = s1 + s2 and ssum3 = s1 + s2 + s3
 
 
 mu1, sigma1 = 5., 1. # mean and standard deviation
@@ -292,10 +292,10 @@ m = .1 # increment size
 
 ```{code-cell} python3
 ## Cell to check -- note what happens when don't normalize!
-## things match up without adjustment. Compare with above
+## Things match up without adjustment. compare with above
 
 p1,p1_norm,x = pdf_seq(mu1,sigma1,I,m)
-## compute number of points to evaluate the probability mass function
+## Compute number of points to evaluate the probability mass function
 NT = x.size
 
 plt.figure(figsize = (8,8))
@@ -316,7 +316,7 @@ mean, meantheory
 ```
 
 
-## Convolving Probability Mass Functions
+## Convolving probability mass functions
 
 Now let's use the convolution theorem to compute the probability distribution of a sum of the two log normal random variables we have parameterized above.
 
@@ -450,7 +450,7 @@ mean, 3*meantheory
 ```
 
 <!-- #region -->
-## Failure Tree Analysis
+## Failure tree analysis
 
 We shall soon apply the convolution theorem to compute the probability of a **top event** in a failure tree analysis.
 
@@ -508,7 +508,7 @@ $$ (eq:probtop)
 Probabilities for each event are recorded as failure rates per year.
 
 
-## Failure Rates Unknown
+## Failure rates unknown
 
 Now we come to the problem that really interests us, following  {cite}`Ardron_2018` and Greenfield and Sargent
  {cite}`Greenfield_Sargent_1993`  in the spirit of Apostolakis  {cite}`apostolakis1990`.
@@ -551,7 +551,7 @@ The analyst calculates the probability mass function for the **top event** $F$,
 
 <!-- #endregion -->
 
-## Waste Hoist Failure Rate
+## Waste hoist failure rate
 
 We'll take close to a real world example by assuming that $n = 14$.
 
diff --git a/lectures/house_auction.md b/lectures/house_auction.md
index 16c7a365e..05b2dc331 100644
--- a/lectures/house_auction.md
+++ b/lectures/house_auction.md
@@ -49,7 +49,7 @@ In 1994, the multiple rounds, ascending bid auction was actually used by Stanfor
 
 We begin with  overviews of the two mechanisms.
 
-## Ascending Bids Auction for Multiple Goods
+## Ascending bids auction for multiple goods
 
 An auction is administered by an **auctioneer**
 
@@ -84,7 +84,7 @@ In this auction,  person $j$ never tells anyone else his/her private values $v_{
 
 
-## A Benevolent Planner
+## A benevolent planner
 
 This mechanism is designed so that all prospective buyers voluntarily choose to reveal their private values to a **social planner** who uses them to construct a socially optimal allocation.
 
@@ -99,7 +99,7 @@ After the planner receives everyone's vector of private values, the planner depl
 
 
-## Equivalence of Allocations
+## Equivalence of allocations
 
 Remarkably, these two mechanisms can produce virtually identical allocations.
 
@@ -111,10 +111,10 @@ We also work out some examples by hand or almost by hand.
 Next, let's dive down into the details.
 
 
-## Ascending Bid Auction
+## Ascending bid auction
 
 
-### Basic Setting
+### Basic setting
 
 
 We start with  a more detailed description of the setting.
@@ -238,7 +238,7 @@ np.random.seed(100)
 np.set_printoptions(precision=3, suppress=True)
 ```
 
-## An Example
+## An example
 
 +++
 
@@ -348,7 +348,7 @@ def check_kick_off_condition(v, r, ϵ):
 check_kick_off_condition(v, r, ϵ)
 ```
 
-### round 1
+### Round 1
 
 +++
 
@@ -491,7 +491,7 @@ winner_list
 loser_list
 ```
 
-### round 2
+### Round 2
 
 
 +++
@@ -574,7 +574,7 @@ allocation,winner_list,loser_list = check_terminal_condition(bid_info, p, v)
 present_dict(allocation)
 ```
 
-### round 3
+### Round 3
 
 ```{code-cell} ipython3
 p,bid_info = submit_bid(loser_list, p, ϵ, v, bid_info)
@@ -596,7 +596,7 @@ allocation,winner_list,loser_list = check_terminal_condition(bid_info, p, v)
 present_dict(allocation)
 ```
 
-### round 4
+### Round 4
 
 ```{code-cell} ipython3
 p,bid_info = submit_bid(loser_list, p, ϵ, v, bid_info)
@@ -620,7 +620,7 @@ allocation,winner_list,loser_list = check_terminal_condition(bid_info, p, v)
 present_dict(allocation)
 ```
 
-### round 5
+### Round 5
 
 ```{code-cell} ipython3
 p,bid_info = submit_bid(loser_list, p, ϵ, v, bid_info)
@@ -656,7 +656,7 @@ total_revenue = p[list(allocation.keys())].sum()
 total_revenue
 ```
 
-## A Python Class
+## A Python class
 
 +++
 
@@ -957,7 +957,7 @@ auction_1.S
 auction_1.Q
 ```
 
-## Robustness Checks
+## Robustness checks
 
 Let's do some stress testing of our code by applying it to  auctions characterized by different matrices of private values.
 
@@ -1017,7 +1017,7 @@ auction_6.start_auction()
 
 +++
 
-## A Groves-Clarke Mechanism
+## A groves-clarke mechanism
 
 +++
 
@@ -1061,7 +1061,7 @@ Our mechanims works like this.
 
 +++
 
-## An Example Solved by Hand
+## An example solved by hand
 
 +++
 
@@ -1206,7 +1206,7 @@ S = V_orig*Q - np.diag(p)@Q
 p, Q, V, S
 ```
 
-##  Another Python Class
+## Another Python class
 
 It is efficient to assemble our calculations in a single Python Class.
 
@@ -1346,7 +1346,7 @@ We want to compute $\check t_j$ for $j = 1, \ldots, m$ and compare with $p_j$ fr
 
 +++
 
-###  Social Cost
+### Social cost
 
 Using the GC_Mechanism class, we can  calculate the social cost of each buyer.
 
diff --git a/lectures/ifp.md b/lectures/ifp.md
index 6b4906db3..a15f6b771 100644
--- a/lectures/ifp.md
+++ b/lectures/ifp.md
@@ -74,14 +74,14 @@ Other references include {cite}`Deaton1991`, {cite}`DenHaan2010`,
 {cite}`Kuhn2013`, {cite}`Rabault2002`,  {cite}`Reiter2009`  and
 {cite}`SchechtmanEscudero1977`.
 
-## The Optimal Savings Problem
+## The optimal savings problem
 
 ```{index} single: Optimal Savings; Problem
 ```
 
 Let's write down the model and then discuss how to solve it.
 
-### Set-Up
+### Set-up
 
 Consider a household that chooses a state-contingent consumption plan $\{c_t\}_{t \geq 0}$ to maximize
 
@@ -147,7 +147,7 @@ be contingent only on the current state.
 
 Optimality is defined below.
 
-### Value Function and Euler Equation
+### Value function and Euler equation
 
 The *value function* $V \colon \mathsf S \to \mathbb{R}$ is defined by
 
@@ -204,7 +204,7 @@ u' (c_t)
 \right\}
 ```
 
-### Optimality Results
+### Optimality results
 
 As shown in {cite}`ma2020income`,
 
@@ -251,7 +251,7 @@ model suggests that time iteration will be faster and more accurate.
 
 This is the approach that we apply below.
 
-### Time Iteration
+### Time iteration
 
 We can rewrite {eq}`eqeul0` to make it a statement about functions rather than
 random variables.
@@ -321,7 +321,7 @@ It is shown in {cite}`ma2020income` that the unique optimal policy can be
 computed by picking any $\sigma \in \mathscr{C}$ and iterating with the
 operator $K$ defined in {eq}`eqsifc`.
 
-### Some Technical Details
+### Some technical details
 
 The proof of the last statement is somewhat technical but here is a quick
 summary:
@@ -503,7 +503,7 @@ plt.show()
 
 The following exercises walk you through several applications where policy functions are computed.
 
-### A Sanity Check
+### A sanity check
 
 One way to check our results is to
 
diff --git a/lectures/ifp_advanced.md b/lectures/ifp_advanced.md
index 6c8757388..496a61ee2 100644
--- a/lectures/ifp_advanced.md
+++ b/lectures/ifp_advanced.md
@@ -62,11 +62,11 @@ from numba.experimental import jitclass
 from quantecon import MarkovChain
 ```
 
-## The Savings Problem
+## The savings problem
 
 In this section we review the household problem and optimality results.
 
-### Set Up
+### Set up
 
 A household chooses a consumption-asset path $\{(c_t, a_t)\}$ to
 maximize
@@ -189,12 +189,12 @@ We again solve the Euler equation using time iteration, iterating with a
 Coleman--Reffett operator $K$ defined to match the Euler equation
 {eq}`ifpa_euler`.
 
-## Solution Algorithm
+## Solution algorithm
 
 ```{index} single: Optimal Savings; Computation
 ```
 
-### A Time Iteration Operator
+### A time iteration operator
 
 Our definition of the candidate class $\sigma \in \mathscr C$ of consumption
 policies is the same as in our {doc}`earlier lecture <ifp>` on the income
@@ -223,7 +223,7 @@ if and only if $K\sigma(a, z) = \sigma(a, z)$ for all $(a, z) \in
 This means that fixed points of $K$ in $\mathscr C$ and optimal
 consumption policies exactly coincide (see {cite}`ma2020income` for more details).
 
-### Convergence Properties
+### Convergence properties
 
 As before, we pair $\mathscr C$ with the distance
 
@@ -248,7 +248,7 @@ We now have a clear path to successfully approximating the optimal policy:
 choose some $\sigma \in \mathscr C$ and then iterate with $K$ until
 convergence (as measured by the distance $\rho$).
 
-### Using an Endogenous Grid
+### Using an endogenous grid
 
 In the study of that model we found that it was possible to further
 accelerate time iteration via the {doc}`endogenous grid method <egm_policy_iter>`.
@@ -262,7 +262,7 @@ interior.
 In particular, optimal consumption can be equal to assets when the level of
 assets is low.
 
-#### Finding Optimal Consumption
+#### Finding optimal consumption
 
 The endogenous grid method (EGM) calls for us to take a grid of *savings*
 values $s_i$, where each such $s$ is interpreted as $s = a -
@@ -310,7 +310,7 @@ obtained by interpolating $\{a_i, c_i\}$ at each $z$.
 
 In what follows, we use linear interpolation.
 
-### Testing the Assumptions
+### Testing the assumptions
 
 Convergence of time iteration is dependent on the condition $\beta G_R < 1$ being satisfied.
 
@@ -540,7 +540,7 @@ This is because we anticipate income $Y_{t+1}$ tomorrow, which makes the need to
 Can you explain why consuming all assets ends earlier (for lower values of
 assets) when $z=0$?
 
-### Law of Motion
+### Law of motion
 
 Let's try to get some idea of what will happen to assets over the long run
 under this consumption policy.
diff --git a/lectures/imp_sample.md b/lectures/imp_sample.md
index 11c7258f9..6718b01cb 100644
--- a/lectures/imp_sample.md
+++ b/lectures/imp_sample.md
@@ -36,7 +36,7 @@ import matplotlib.pyplot as plt
 from math import gamma
 ```
 
-## Mathematical Expectation of Likelihood Ratio
+## Mathematical expectation of likelihood ratio
 
 In {doc}`this lecture <likelihood_ratio_process>`, we studied a likelihood ratio $\ell \left(\omega_t\right)$
 
@@ -156,7 +156,7 @@ $$
 E^g\left[\ell\left(\omega\right)\right] = \int_\Omega \ell(\omega) g(\omega) d\omega = \int_\Omega \ell(\omega) \frac{g(\omega)}{h(\omega)} h(\omega) d\omega = E^h\left[\ell\left(\omega\right) \frac{g(\omega)}{h(\omega)}\right]
 $$
 
-## Selecting a Sampling Distribution
+## Selecting a sampling distribution
 
 Since we must use an $h$ that has larger mass in parts of the distribution to which  $g$ puts low mass, we use $h=Beta(0.5, 0.5)$ as our importance distribution.
 
@@ -178,7 +178,7 @@ plt.ylim([0., 3.])
 plt.show()
 ```
 
-## Approximating a Cumulative Likelihood Ratio
+## Approximating a cumulative likelihood ratio
 
 We now study how to use importance sampling to approximate
 ${E} \left[L(\omega^t)\right] = \left[\prod_{i=1}^T \ell \left(\omega_i\right)\right]$.
@@ -248,7 +248,7 @@ estimate(g_a, g_b, g_a, g_b, T=10, N=10000)
 estimate(g_a, g_b, h_a, h_b, T=10, N=10000)
 ```
 
-## Distribution of  Sample Mean
+## Distribution of sample mean
 
 We next study the bias and efficiency of the Monte Carlo and importance sampling approaches.
 
@@ -323,7 +323,7 @@ The simulation exercises above show that the importance sampling estimates are u
 
 Evidently, the bias increases with increases in $T$.
 
-## Choosing a  Sampling Distribution
+## Choosing a sampling distribution
 
 +++
 
diff --git a/lectures/inventory_dynamics.md b/lectures/inventory_dynamics.md
index 7a80b9929..1706bab09 100644
--- a/lectures/inventory_dynamics.md
+++ b/lectures/inventory_dynamics.md
@@ -54,7 +54,7 @@ from numba import jit, float64, prange
 from numba.experimental import jitclass
 ```
 
-## Sample Paths
+## Sample paths
 
 Consider a firm with inventory $X_t$.
 
@@ -167,7 +167,7 @@ for i in range(400):
 plt.show()
 ```
 
-## Marginal Distributions
+## Marginal distributions
 
 Now let’s look at the marginal distribution $\psi_T$ of $X_T$ for some
 fixed $T$.
diff --git a/lectures/jv.md b/lectures/jv.md
index 8312ab065..d9910915b 100644
--- a/lectures/jv.md
+++ b/lectures/jv.md
@@ -42,7 +42,7 @@ import scipy.stats as stats
 from numba import jit, prange
 ```
 
-### Model Features
+### Model features
 
 ```{index} single: On-the-Job Search; Model Features
 ```
@@ -127,7 +127,7 @@ with default parameter values
 The $\text{Beta}(2,2)$ distribution is supported on $(0,1)$ - it has a unimodal, symmetric density peaked at 0.5.
 
 (jvboecalc)=
-### Back-of-the-Envelope Calculations
+### Back-of-the-envelope calculations
 
 Before we solve the model, let's make some quick calculations that
 provide intuition on what the solution should look like.
@@ -356,7 +356,7 @@ def solve_model(jv,
     return v_new
 ```
 
-## Solving for Policies
+## Solving for policies
 
 ```{index} single: On-the-Job Search; Solving for Policies
 ```
diff --git a/lectures/kalman.md b/lectures/kalman.md
index cc468acc0..12f6839e4 100644
--- a/lectures/kalman.md
+++ b/lectures/kalman.md
@@ -66,7 +66,7 @@ from scipy.integrate import quad
 from scipy.linalg import eigvals
 ```
 
-## The Basic Idea
+## The basic idea
 
 The Kalman filter has many applications in economics, but for now
 let's pretend that we are rocket scientists.
@@ -203,7 +203,7 @@ ax.clabel(cs, inline=1, fontsize=10)
 plt.show()
 ```
 
-### The Filtering Step
+### The filtering step
 
 We are now presented with some good news and some bad news.
 
@@ -308,7 +308,7 @@ information $y - G \hat x$.
 In generating the figure, we set $G$ to the identity matrix and $R = 0.5 \Sigma$ for $\Sigma$ defined in {eq}`kalman_dhxs`.
 
 (kl_forecase_step)=
-### The Forecast Step
+### The forecast step
 
 What have we achieved so far?
 
@@ -419,7 +419,7 @@ ax.text(float(y[0].item()), float(y[1].item()), "$y$", fontsize=20, color="black
 plt.show()
 ```
 
-### The Recursive Procedure
+### The recursive procedure
 
 ```{index} single: Kalman Filter; Recursive Procedure
 ```
diff --git a/lectures/kalman_2.md b/lectures/kalman_2.md
index c1049dc36..1893a86eb 100644
--- a/lectures/kalman_2.md
+++ b/lectures/kalman_2.md
@@ -61,7 +61,7 @@ mpl.rcParams['text.usetex'] = True
 mpl.rcParams['text.latex.preamble'] = r'\usepackage{{amsmath}}'
 ```
 
-## A worker's output 
+## A worker's output
 
 A representative worker is permanently employed at a firm.
 
@@ -208,7 +208,7 @@ we  use the Kalman filter described in this quantecon lecture {doc}`A First Look
 
 In particular, we want to compute all of the objects in an "innovation representation".
 
-## An Innovations Representation
+## An innovations representation
 
 We have all the objects in hand required to form an innovations representation for the output
 process $\{y_t\}_{t=0}^T$ for a worker.
@@ -273,7 +273,7 @@ fig.tight_layout()
 plt.show()
 ```
 
-## Some Computational Experiments
+## Some computational experiments
 
 Let's look at  $\Sigma_0$ and $\Sigma_T$ in order to see how much the firm learns about the hidden state during the horizon we have set.
 
@@ -585,7 +585,7 @@ ax.legend(bbox_to_anchor=(1, 0.5))
 plt.show()
 ```
 
-## Future Extensions
+## Future extensions
 
 We can do lots of enlightening experiments by creating new types of workers and letting the firm 
 learn about their hidden (to the firm) states by observing just their output histories.
diff --git a/lectures/kesten_processes.md b/lectures/kesten_processes.md
index 1fe6e921b..a72eb1655 100644
--- a/lectures/kesten_processes.md
+++ b/lectures/kesten_processes.md
@@ -71,7 +71,7 @@ register_matplotlib_converters()
 Additional technical background related to this lecture can be found in the
 monograph of {cite}`buraczewski2016stochastic`.
 
-## Kesten Processes
+## Kesten processes
 
 ```{index} single: Kesten processes; heavy tails
 ```
@@ -97,7 +97,7 @@ In particular, we will assume that
 * $\{a_t\}_{t \geq 1}$ is a nonnegative IID stochastic process and
 * $\{\eta_t\}_{t \geq 1}$ is another nonnegative IID stochastic process, independent of the first.
 
-### Example: GARCH Volatility
+### Example: garch volatility
 
 The GARCH model is common in financial applications, where time series such as asset returns exhibit time varying volatility.
 
@@ -150,7 +150,7 @@ where $\{\zeta_t\}$ is again IID and independent of $\{\xi_t\}$.
 
 The volatility sequence $\{\sigma_t^2 \}$, which drives the dynamics of returns, is a Kesten process.
 
-### Example: Wealth Dynamics
+### Example: wealth dynamics
 
 Suppose that a given household saves a fixed fraction $s$ of its current wealth in every period.
 
@@ -203,7 +203,7 @@ current state is drawn from $F^*$.
 
 The equality in {eq}`kp_stationary` states that this distribution is unchanged.
 
-### Cross-Sectional Interpretation
+### Cross-sectional interpretation
 
 There is an important cross-sectional interpretation of stationary distributions, discussed previously but worth repeating here.
 
@@ -241,7 +241,7 @@ next period as it is this period.
 
 Since $y$ was chosen arbitrarily, the distribution is unchanged.
 
-### Conditions for Stationarity
+### Conditions for stationarity
 
 The Kesten process $X_{t+1} = a_{t+1} X_t + \eta_{t+1}$ does not always
 have a stationary distribution.
@@ -270,7 +270,7 @@ As one application of this result, we see that the wealth process
 {eq}`wealth_dynam` will have a unique stationary distribution whenever
 labor income has finite mean and $\mathbb E \ln R_t  + \ln s < 0$.
 
-## Heavy Tails
+## Heavy tails
 
 Under certain conditions, the stationary distribution of a Kesten process has
 a Pareto tail.
@@ -279,7 +279,7 @@ a Pareto tail.
 
 This fact is significant for economics because of the prevalence of Pareto-tailed distributions.
 
-### The Kesten--Goldie Theorem
+### The kesten--goldie theorem
 
 To state the conditions under which the stationary distribution of a Kesten process has a Pareto tail, we first recall that a random variable is called **nonarithmetic** if its distribution is not concentrated on $\{\dots, -2t, -t, 0, t, 2t, \ldots \}$ for any $t \geq 0$.
 
@@ -359,13 +359,13 @@ ax.set(xlabel='time', ylabel='$X_t$')
 plt.show()
 ```
 
-## Application: Firm Dynamics
+## Application: firm dynamics
 
 As noted in our {doc}`lecture on heavy tails <intro:heavy_tails>`, for common measures of firm size such as revenue or employment, the US firm size distribution exhibits a Pareto tail (see, e.g., {cite}`axtell2001zipf`, {cite}`gabaix2016power`).
 
 Let us try to explain this rather striking fact using the Kesten--Goldie Theorem.
 
-### Gibrat's Law
+### Gibrat's law
 
 It was postulated many years ago by Robert Gibrat {cite}`gibrat1931inegalites` that firm size evolves according to a simple rule whereby size next period is proportional to current size.
 
@@ -412,7 +412,7 @@ In the exercises you are asked to show that {eq}`firm_dynam` is more
 consistent with the empirical findings presented above than Gibrat's law in
 {eq}`firm_dynam_gb`.
 
-### Heavy Tails
+### Heavy tails
 
 So what has this to do with Pareto tails?
 
diff --git a/lectures/lagrangian_lqdp.md b/lectures/lagrangian_lqdp.md
index 248f0f2fa..8b4d22f71 100644
--- a/lectures/lagrangian_lqdp.md
+++ b/lectures/lagrangian_lqdp.md
@@ -67,7 +67,7 @@ The techniques in this lecture will prove useful when we study Stackelberg and R
 
 
-## Undiscounted LQ DP Problem
+## Undiscounted LQ dp problem
 
 
 The problem is to choose a sequence of controls  $\{u_t\}_{t=0}^\infty$ to maximize the criterion
@@ -233,7 +233,7 @@ $$ (Mdefn)
 
 +++
 
-## State-Costate Dynamics
+## State-costate dynamics
 
 
 We seek to solve the difference equation system  {eq}`eq4orig` for a sequence $\{x_t\}_{t=0}^\infty$
@@ -255,7 +255,7 @@ which requires that $x_t' R x_t$ converge to zero as $t \rightarrow + \infty$.
 
 +++
 
-## Reciprocal Pairs Property
+## Reciprocal pairs property
 
 To proceed, we study properties of the $(2n \times 2n)$ matrix $M$ defined in {eq}`Mdefn`. 
 
@@ -666,7 +666,7 @@ lq.stationary_values()
 ```
 
 
-## Other Applications
+## Other applications
 
 The preceding approach to imposing stability on a system  of potentially unstable linear difference equations is not limited to  linear quadratic dynamic optimization problems. 
 
@@ -693,13 +693,13 @@ W, V, P = stable_solution(H)
 P
 ```
 
-## Discounted Problems 
+## Discounted problems
 
 +++
 
 
-### Transforming States and Controls to Eliminate Discounting
+### Transforming states and controls to eliminate discounting
 
 A pair of useful transformations allows us to convert a discounted problem into an undiscounted one.
 
@@ -777,7 +777,7 @@ lq.stationary_values()
 ```
 
 
-### Lagrangian for Discounted Problem
+### Lagrangian for discounted problem
 
 For several purposes, it is useful  explicitly briefly to describe
 a Lagrangian for a discounted problem. 
diff --git a/lectures/lake_model.md b/lectures/lake_model.md
index c9b3c5715..e9a5c4b94 100644
--- a/lectures/lake_model.md
+++ b/lectures/lake_model.md
@@ -85,7 +85,7 @@ Before working through what follows, we recommend you read the
 
 You will also need some basic {doc}`linear algebra <linear_algebra>` and probability.
 
-## The Model
+## The model
 
 The economy is inhabited by a very large number of ex-ante identical workers.
 
@@ -100,7 +100,7 @@ Their rates of  transition between employment and unemployment are  governed by
 
 The growth rate of the labor force evidently equals $g=b-d$.
 
-### Aggregate Variables
+### Aggregate variables
 
 We want to derive the dynamics of the following aggregates
 
@@ -115,7 +115,7 @@ We also want to know the values of the following objects
 
 (Here and below, capital letters represent aggregates and lowercase letters represent rates)
 
-### Laws of Motion for Stock Variables
+### Laws of motion for stock variables
 
 We begin by constructing laws of motion for the aggregate variables $E_t,U_t, N_t$.
 
@@ -163,7 +163,7 @@ $$
 
 This law tells us how total employment and unemployment evolve over time.
 
-### Laws of Motion for Rates
+### Laws of motion for rates
 
 Now let's derive the law of motion for rates.
 
@@ -330,7 +330,7 @@ lm = LakeModel(α = 0.03)
 lm.A
 ```
 
-### Aggregate Dynamics
+### Aggregate dynamics
 
 Let's run a simulation under the default parameters (see above) starting from $X_0 = (12, 138)$
 
@@ -411,7 +411,7 @@ plt.show()
 ```
 
 (dynamics_workers)=
-## Dynamics of an Individual Worker
+## Dynamics of an individual worker
 
 An individual worker's employment dynamics are governed by a {doc}`finite state Markov process <finite_markov>`.
 
@@ -492,7 +492,7 @@ Inspection tells us that $P$ is exactly the transpose of $\hat A$ under the assu
 
 Thus, the percentages of time that an  infinitely lived worker  spends employed and unemployed equal the fractions of workers employed and unemployed in the steady state distribution.
 
-### Convergence Rate
+### Convergence rate
 
 How long does it take for time series sample averages to converge to cross-sectional averages?
 
@@ -538,7 +538,7 @@ In this case it takes much of the sample for these two objects to converge.
 
 This is largely due to the high persistence in the Markov chain.
 
-## Endogenous Job Finding Rate
+## Endogenous job finding rate
 
 We now make the hiring rate endogenous.
 
@@ -546,7 +546,7 @@ The transition rate from unemployment to employment will be determined by the Mc
 
 All details relevant to the following discussion can be found in {doc}`our treatment <mccall_model>` of that model.
 
-### Reservation Wage
+### Reservation wage
 
 The most important thing to remember about the model is that optimal decisions
 are characterized by a reservation wage $\bar w$
@@ -561,7 +561,7 @@ As we saw in {doc}`our discussion of the model <mccall_model>`, the reservation
 * $\gamma$, the offer arrival rate
 * $c$, unemployment compensation
 
-### Linking the McCall Search Model to the Lake Model
+### Linking the mccall search model to the lake model
 
 Suppose that  all workers inside a lake model behave according to the McCall search model.
 
@@ -579,7 +579,7 @@ This is now
 = \gamma \sum_{w' \geq \bar w} p(w')
 ```
 
-### Fiscal Policy
+### Fiscal policy
 
 We can use the McCall search version of the Lake Model  to find an optimal level of unemployment insurance.
 
@@ -636,7 +636,7 @@ Following {cite}`davis2006flow`, we set $\alpha$, the hazard rate of leaving emp
 
 * $\alpha = 0.013$
 
-### Fiscal Policy Code
+### Fiscal policy code
 
 We will make use of techniques from the {doc}`McCall model lecture <mccall_model>`
 
diff --git a/lectures/likelihood_bayes.md b/lectures/likelihood_bayes.md
index 2787dbe1b..f31b76329 100644
--- a/lectures/likelihood_bayes.md
+++ b/lectures/likelihood_bayes.md
@@ -69,7 +69,7 @@ def set_seed():
 set_seed()
 ```
 
-## The Setting
+## The setting
 
 We begin by reviewing the setting in {doc}`this lecture <likelihood_ratio_process>`, which we adopt here too.
 
@@ -196,7 +196,7 @@ l_seq_f = np.cumprod(l_arr_f, axis=1)
 
 
-## Likelihood Ratio Processes and Bayes’ Law
+## Likelihood ratio processes and Bayes’ law
 
 Let $\pi_0 \in [0,1]$ be a Bayesian statistician's prior probability that nature generates $w^t$ as a sequence of i.i.d. draws from
 distribution $f$.
@@ -610,7 +610,7 @@ This topic is taken up in {doc}`mix_model`.
 We explore how to learn the true mixing parameter $x$ in the exercise 
 of {doc}`mix_model`.
 
-## Behavior of  Posterior Probability $\{\pi_t\}$  Under  Subjective Probability Distribution
+## Behavior of posterior probability $\{\pi_t\}$ under subjective probability distribution
 
 We'll end this lecture by briefly studying what our Bayesian learner expects to learn under the
 subjective beliefs $\pi_t$ cranked out by Bayes' law.
@@ -949,7 +949,7 @@ ax2.set_ylabel("$w_t$")
 plt.show()
 ```
 
-##  Initial Prior is Verified by Paths Drawn from Subjective Conditional Densities
+## Initial prior is verified by paths drawn from subjective conditional densities
 
 
@@ -973,7 +973,7 @@ table
 The fraction of simulations for which $\pi_{t}$  had converged to $1$ is indeed always  close  to $\pi_{-1}$, as anticipated.
 
 
-## Drilling Down a Little Bit
+## Drilling down a little bit
 
 To understand how the local dynamics of $\pi_t$ behaves, it is enlightening to consult the  variance of $\pi_{t}$ conditional on $\pi_{t-1}$.
 
@@ -1024,7 +1024,7 @@ Notice how the conditional variance approaches $0$ for $\pi_{t-1}$ near  either
 
 The conditional variance is nearly zero only when the agent  is almost sure that $w_t$ is drawn from $F$,  or is almost sure it is drawn from $G$.
 
-## Related Lectures
+## Related lectures
 
 This lecture has been devoted to building some useful infrastructure that will help us understand inferences that are the foundations of
 results described  in {doc}`this lecture <odu>` and {doc}`this lecture <wald_friedman>` and {doc}`this lecture <navy_captain>`.
\ No newline at end of file
diff --git a/lectures/likelihood_ratio_process.md b/lectures/likelihood_ratio_process.md
index a2037eae6..23e826ac1 100644
--- a/lectures/likelihood_ratio_process.md
+++ b/lectures/likelihood_ratio_process.md
@@ -59,7 +59,7 @@ import pandas as pd
 from IPython.display import display, Math
 ```
 
-## Likelihood Ratio Process
+## Likelihood ratio process
 
 A nonnegative random variable $W$ has one of two probability density functions, either
 $f$ or $g$.
@@ -159,7 +159,7 @@ def simulate(a, b, T=50, N=500):
 ```
 
 (nature_likeli)=
-## Nature Permanently Draws from Density g
+## Nature permanently draws from density g
 
 We first simulate the likelihood ratio process when nature permanently
 draws from $g$.
@@ -236,7 +236,7 @@ Mathematical induction implies
 $E\left[L\left(w^{t}\right)\bigm|q=g\right]=1$ for all
 $t \geq 1$.
 
-## Peculiar Property
+## Peculiar property
 
 How can $E\left[L\left(w^{t}\right)\bigm|q=g\right]=1$ possibly be true when most probability mass of the likelihood
 ratio process is piling up near $0$ as
@@ -272,7 +272,7 @@ We explain the problem in more detail in {doc}`this lecture <imp_sample>`.
 There we describe an alternative way to compute the mean of a likelihood ratio by computing the mean of a _different_ random variable by sampling from a _different_ probability distribution.
 
 
-## Nature Permanently Draws from Density f
+## Nature permanently draws from density f
 
 Now suppose that before time $0$ nature permanently decided to draw repeatedly from density $f$.
 
@@ -319,7 +319,7 @@ plt.plot(range(T), np.sum(l_seq_f > 10000, axis=0) / N)
 plt.show()
 ```
 
-## Likelihood Ratio Test
+## Likelihood ratio test
 
 We now describe how to employ the machinery
 of Neyman and Pearson {cite}`Neyman_Pearson` to test the hypothesis that history $w^t$ is generated by repeated
@@ -590,7 +590,7 @@ presented to Milton Friedman, as we describe in {doc}`this lecture <wald_friedma
 
 
 (rel_entropy)=
-## Kullback–Leibler Divergence
+## Kullback–leibler divergence
 
 Now let's consider a case in which neither $g$ nor $f$
 generates the data.
@@ -781,7 +781,7 @@ In the [next section](hetero_agent), we will see an application of these ideas.
 
 
 (hetero_agent)=
-## Heterogeneous Beliefs and Financial Markets
+## Heterogeneous beliefs and financial markets
 
 A likelihood ratio process lies behind Lawrence Blume and David Easley's answer to their question
 ''If you're so smart, why aren't you rich?'' {cite}`blume2006if`.  
@@ -1004,7 +1004,7 @@ $$
 
 
-### If you're so smart, $\ldots$ 
+### If you're so smart, $\ldots$
 
 
 Let's compute some values of limiting allocations {eq}`eq:allocationrule1` for some interesting possible limiting
@@ -1043,7 +1043,7 @@ Doing this will allow us to connect our analysis with an argument of {cite}`alch
 
 
-### Competitive Equilibrium Prices 
+### Competitive equilibrium prices
 
 Two fundamental welfare theorems for general equilibrium models lead us to anticipate that there is  a connection between the allocation that solves the social planning problem we have been studying and the allocation in a  **competitive equilibrium**  with complete markets in history-contingent commodities.
 
@@ -1169,7 +1169,7 @@ According to formula {eq}`eq:pformulafinal`, we have the following possible limi
 * when $l_\infty = \infty$, $c_\infty^1 = 1 $ and tails of competitive equilibrium prices reflect agent $1$'s probability model $\pi_t^1(s^t)$ according to $p_t(s^t) \propto \delta^t \pi_t^1(s^t) $
 * for small $t$'s, competitive equilibrium prices reflect both agents' probability models.  
 
-### Simulations 
+### Simulations
 
 Now let's implement some simulations when agent $1$ believes marginal density 
 
@@ -1424,7 +1424,7 @@ Since $KL(f,g) > KL(g,f)$, we  see faster convergence in  the first panel at the
 
 This ties in nicely with {eq}`eq:kl_likelihood_link`.
 
-## Hypothesis Testing and Classification 
+## Hypothesis testing and classification
 
 This section discusses another application of likelihood ratio processes.
 
@@ -1526,7 +1526,7 @@ $$
 
 For shorthand we'll write $L_t =  L(w^t)$.
 
-### Model Selection Mistake Probability 
+### Model selection mistake probability
 
 We first study  a problem that assumes  timing protocol 1.  
 
@@ -1855,7 +1855,7 @@ plt.show()
 
 Evidently, $e^{-C(f,g)T}$ is an upper bound on the error rate.
 
-### Jensen-Shannon divergence
+### Jensen-shannon divergence
 
 The [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) is another  divergence measure.  
 
@@ -2177,7 +2177,7 @@ Evidently, Chernoff entropy and Jensen-Shannon entropy each covary tightly with
 We'll encounter related ideas in {doc}`wald_friedman` very soon.
 
 
-## Related Lectures
+## Related lectures
 
 Likelihood processes play an important role in Bayesian learning, as described in {doc}`likelihood_bayes`
 and as applied in {doc}`odu`.
diff --git a/lectures/linear_algebra.md b/lectures/linear_algebra.md
index d7131dfc4..2326b42c8 100644
--- a/lectures/linear_algebra.md
+++ b/lectures/linear_algebra.md
@@ -81,7 +81,7 @@ from mpl_toolkits.mplot3d import Axes3D
 from scipy.linalg import inv, solve, det, eig
 ```
 
-## {index}`Vectors <single: Vectors>`
+## {index}`vectors <single: vectors>`
 
 ```{index} single: Linear Algebra; Vectors
 ```
@@ -122,7 +122,7 @@ for v in vecs:
 plt.show()
 ```
 
-### Vector Operations
+### Vector operations
 
 ```{index} single: Vectors; Operations
 ```
@@ -218,7 +218,7 @@ x + y
 4 * x
 ```
 
-### Inner Product and Norm
+### Inner product and norm
 
 ```{index} single: Vectors; Inner Product
 ```
@@ -379,7 +379,7 @@ If $y = (y_1, y_2, y_3)$ is any linear combination of these vectors, then $y_3 =
 Hence $A_0$ fails to span all of $\mathbb R ^3$.
 
 (la_li)=
-### Linear Independence
+### Linear independence
 
 ```{index} single: Vectors; Linear Independence
 ```
@@ -414,7 +414,7 @@ The following statements are equivalent to linear independence of $A := \{a_1, \
 (The zero in the first expression is the origin of $\mathbb R ^n$)
 
 (la_unique_reps)=
-### Unique Representations
+### Unique representations
 
 Another nice thing about sets of linearly independent vectors is that each element in the span has a unique representation as a linear combination of these vectors.
 
@@ -474,7 +474,7 @@ $A$ is called *diagonal* if the only nonzero entries are on the principal diagon
 
 If, in addition to being diagonal, each element along the principal diagonal is equal to 1, then $A$ is called the *identity matrix* and denoted by $I$.
 
-### Matrix Operations
+### Matrix operations
 
 ```{index} single: Matrix; Operations
 ```
@@ -625,7 +625,7 @@ In particular, `A @ B` is matrix multiplication, whereas `A * B` is element-by-e
 See [here](https://python-programming.quantecon.org/numpy.html#matrix-multiplication) for more discussion.
 
 (la_linear_map)=
-### Matrices as Maps
+### Matrices as maps
 
 ```{index} single: Matrix; Maps
 ```
@@ -644,7 +644,7 @@ You can check that this holds for the function $f(x) = A x + b$ when $b$ is the
 
 In fact, it's [known](https://en.wikipedia.org/wiki/Linear_map#Matrices) that $f$ is linear if and *only if* there exists a matrix $A$ such that $f(x) = Ax$ for all $x$.
 
-## Solving Systems of Equations
+## Solving systems of equations
 
 ```{index} single: Matrix; Solving Systems of Equations
 ```
@@ -743,7 +743,7 @@ A happy fact is that linear independence of the columns of $A$ also gives us uni
 
 Indeed, it follows from our {ref}`earlier discussion <la_unique_reps>` that if $\{a_1, \ldots, a_k\}$ are linearly independent and $y = Ax = x_1 a_1 + \cdots + x_k a_k$, then no $z \not= x$ satisfies $y = Az$.
 
-### The Square Matrix Case
+### The square matrix case
 
 Let's discuss some more details, starting with the case where $A$ is $n \times n$.
 
@@ -766,7 +766,7 @@ In particular, the following are equivalent
 
 The property of having linearly independent columns is sometimes expressed as having *full column rank*.
 
-#### Inverse Matrices
+#### Inverse matrices
 
 ```{index} single: Matrix; Inverse
 ```
@@ -802,7 +802,7 @@ Perhaps the most important fact about determinants is that $A$ is nonsingular if
 This gives us a useful one-number summary of whether or not a square matrix can be
 inverted.
 
-### More Rows than Columns
+### More rows than columns
 
 This is the $n \times k$ case with $n > k$.
 
@@ -837,7 +837,7 @@ projections.
 
 The solution is known to be $\hat x = (A'A)^{-1}A'y$ --- see for example chapter 3 of [these notes](https://python.quantecon.org/_static/lecture_specific/linear_algebra/course_notes.pdf).
 
-### More Columns than Rows
+### More columns than rows
 
 This is the $n \times k$ case with $n < k$, so there are fewer
 equations than unknowns.
@@ -867,7 +867,7 @@ $$
 
 In other words, uniqueness fails.
 
-### Linear Equations with SciPy
+### Linear equations with scipy
 
 ```{index} single: Linear Algebra; SciPy
 ```
@@ -904,7 +904,7 @@ The latter method uses a different algorithm (LU decomposition) that is numerica
 To obtain the least-squares solution $\hat x = (A'A)^{-1}A'y$, use `scipy.linalg.lstsq(A, y)`.
 
 (la_eigen)=
-## {index}`Eigenvalues <single: Eigenvalues>` and {index}`Eigenvectors <single: Eigenvectors>`
+## {index}`eigenvalues <single: eigenvalues>` and {index}`eigenvectors <single: eigenvectors>`
 
 ```{index} single: Linear Algebra; Eigenvalues
 ```
@@ -1023,7 +1023,7 @@ Since any scalar multiple of an eigenvector is an eigenvector with the same
 eigenvalue (check it), the eig routine normalizes the length of each eigenvector
 to one.
 
-### Generalized Eigenvalues
+### Generalized eigenvalues
 
 It is sometimes useful to consider the *generalized eigenvalue problem*, which, for given
 matrices $A$ and $B$, seeks generalized eigenvalues
@@ -1039,12 +1039,12 @@ Of course, if $B$ is square and invertible, then we can treat the
 generalized eigenvalue problem as an ordinary eigenvalue problem $B^{-1}
 A v = \lambda v$, but this is not always the case.
 
-## Further Topics
+## Further topics
 
 We round out our discussion by briefly mentioning several other important
 topics.
 
-### Series Expansions
+### Series expansions
 
 ```{index} single: Linear Algebra; Series Expansions
 ```
@@ -1055,7 +1055,7 @@ that if $|a| < 1$, then $\sum_{k=0}^{\infty} a^k = (1 - a)^{-1}$.
 A generalization of this idea exists in the matrix setting.
 
 (la_mn)=
-#### Matrix Norms
+#### Matrix norms
 
 ```{index} single: Linear Algebra; Matrix Norms
 ```
@@ -1073,7 +1073,7 @@ the left-hand side is a *matrix norm* --- in this case, the so-called
 For example, for a square matrix $S$, the condition $\| S \| < 1$ means that $S$ is *contractive*, in the sense that it pulls all vectors towards the origin [^cfn].
 
 (la_neumann)=
-#### {index}`Neumann's Theorem <single: Neumann's Theorem>`
+#### {index}`neumann's theorem <single: neumann's theorem>`
 
 ```{index} single: Linear Algebra; Neumann's Theorem
 ```
@@ -1092,7 +1092,7 @@ $k \in \mathbb{N}$, then $I - A$ is invertible, and
 ```
 
 (la_neumann_remarks)=
-#### {index}`Spectral Radius <single: Spectral Radius>`
+#### {index}`spectral radius <single: spectral radius>`
 
 ```{index} single: Linear Algebra; Spectral Radius
 ```
@@ -1110,7 +1110,7 @@ there exists a $k$ with $\| A^k \| < 1$.
 
 In which case {eq}`la_neumann` is valid.
 
-### {index}`Positive Definite Matrices <single: Positive Definite Matrices>`
+### {index}`positive definite matrices <single: positive definite matrices>`
 
 ```{index} single: Linear Algebra; Positive Definite Matrices
 ```
@@ -1129,7 +1129,7 @@ are strictly positive, and hence $A$ is invertible (with positive
 definite inverse).
 
 (la_mcalc)=
-### Differentiating Linear and Quadratic Forms
+### Differentiating linear and quadratic forms
 
 ```{index} single: Linear Algebra; Differentiating Linear and Quadratic Forms
 ```
@@ -1150,7 +1150,7 @@ Then
 
 {ref}`la_ex1` below asks you to apply these formulas.
 
-### Further Reading
+### Further reading
 
 The documentation of the `scipy.linalg` submodule can be found [here](https://docs.scipy.org/doc/scipy/reference/linalg.html).
 
diff --git a/lectures/linear_models.md b/lectures/linear_models.md
index c887aa5db..15964fa3d 100644
--- a/lectures/linear_models.md
+++ b/lectures/linear_models.md
@@ -74,7 +74,7 @@ from scipy.stats import norm
 import random
 ```
 
-## The Linear State Space Model
+## The linear state space model
 
 ```{index} single: Models; Linear State Space
 ```
@@ -116,7 +116,7 @@ Even without these draws, the primitives 1--3 pin down the *probability distribu
 
 Later we'll see how to compute these distributions and their moments.
 
-#### Martingale Difference Shocks
+#### Martingale difference shocks
 
 ```{index} single: Linear State Space Models; Martingale Difference Shocks
 ```
@@ -144,7 +144,7 @@ The following examples help to highlight this point.
 They also illustrate the wise dictum *finding the state is an art*.
 
 (lss_sode)=
-#### Second-order Difference Equation
+#### Second-order difference equation
 
 Let $\{y_t\}$ be a deterministic sequence that satisfies
 
@@ -221,7 +221,7 @@ plot_lss(A, C, G)
 
 Later you'll be asked to recreate this figure.
 
-#### Univariate Autoregressive Processes
+#### Univariate autoregressive processes
 
 ```{index} single: Linear State Space Models; Univariate Autoregressive Processes
 ```
@@ -290,7 +290,7 @@ G_1 = [1, 0, 0, 0]
 plot_lss(A_1, C_1, G_1, n=4, ts_length=200)
 ```
 
-#### Vector Autoregressions
+#### Vector autoregressions
 
 ```{index} single: Linear State Space Models; Vector Autoregressions
 ```
@@ -371,7 +371,7 @@ Such an $x_t$ process can be used to model deterministic seasonals in quarterly
 
 The *indeterministic* seasonal produces recurrent, but aperiodic, seasonal fluctuations.
 
-#### Time Trends
+#### Time trends
 
 ```{index} single: Linear State Space Models; Time Trends
 ```
@@ -443,7 +443,7 @@ $$
 
 Then $x_t^\prime = \begin{bmatrix} t(t-1)/2 &t & 1 \end{bmatrix}$. You can now confirm that $y_t = G x_t$ has the correct form.
 
-### Moving Average Representations
+### Moving average representations
 
 ```{index} single: Linear State Space Models; Moving Average Representations
 ```
@@ -505,7 +505,7 @@ The second term is a translated linear function of time.
 
 For this reason, $x_{1t}$ is called a *martingale with drift*.
 
-## Distributions and Moments
+## Distributions and moments
 
 ```{index} single: Linear State Space Models; Distributions
 ```
@@ -513,7 +513,7 @@ For this reason, $x_{1t}$ is called a *martingale with drift*.
 ```{index} single: Linear State Space Models; Moments
 ```
 
-### Unconditional Moments
+### Unconditional moments
 
 Using {eq}`st_space_rep`, it's easy to obtain expressions for the
 (unconditional) means of $x_t$ and $y_t$.
@@ -557,7 +557,7 @@ information, to be defined below.
 However, you should be aware that these "unconditional" moments do depend on
 the initial distribution $N(\mu_0, \Sigma_0)$.
 
-#### Moments of the Observables
+#### Moments of the observables
 
 Using linearity of expectations again we have
 
@@ -635,7 +635,7 @@ By similar reasoning combined with {eq}`lss_umy` and {eq}`lss_uvy`,
 y_t \sim N(G \mu_t, G \Sigma_t G')
 ```
 
-### Ensemble Interpretations
+### Ensemble interpretations
 
 How should we interpret the distributions defined by {eq}`lss_mgs_x`--{eq}`lss_mgs_y`?
 
@@ -755,7 +755,7 @@ The histogram and population distribution are close, as expected.
 By looking at the figures and experimenting with parameters, you will gain a
 feel for how the population distribution depends on the model primitives {ref}`listed above <lss_pgs>`, as intermediated by the distribution's parameters.
 
-#### Ensemble Means
+#### Ensemble means
 
 In the preceding figure, we approximated the population distribution of $y_T$ by
 
@@ -831,7 +831,7 @@ $$
 \qquad (I \to \infty)
 $$
 
-### Joint Distributions
+### Joint distributions
 
 In the preceding discussion, we looked at the distributions of $x_t$ and
 $y_t$ in isolation.
@@ -868,7 +868,7 @@ $$
 p(x_{t+1} \,|\, x_t) = N(Ax_t, C C')
 $$
 
-#### Autocovariance Functions
+#### Autocovariance functions
 
 An important object related to the joint distribution is the *autocovariance function*
 
@@ -888,7 +888,7 @@ Elementary calculations show that
 
 Notice that $\Sigma_{t+j,t}$ in general depends on both $j$, the gap between the two dates, and $t$, the earlier date.
 
-## Stationarity and Ergodicity
+## Stationarity and ergodicity
 
 ```{index} single: Linear State Space Models; Stationarity
 ```
@@ -900,7 +900,7 @@ Stationarity and ergodicity are two properties  that, when they hold,  greatly a
 
 Let's start with the intuition.
 
-### Visualizing Stability
+### Visualizing stability
 
 Let's look at some more time series from the same model that we analyzed above.
 
@@ -960,7 +960,7 @@ distribution as $t \to \infty$.
 
 When such a distribution exists it is called a *stationary distribution*.
 
-### Stationary Distributions
+### Stationary distributions
 
 In our setting, a distribution $\psi_{\infty}$ is said to be *stationary* for $x_t$ if
 
@@ -986,7 +986,7 @@ $$
 
 where $\mu_{\infty}$ and $\Sigma_{\infty}$ are fixed points of {eq}`lss_mut_linear_models` and {eq}`eqsigmalaw_linear_models` respectively.
 
-### Covariance Stationary Processes
+### Covariance stationary processes
 
 Let's see what happens to the preceding figure if we start $x_0$ at the stationary distribution.
 
@@ -1023,9 +1023,9 @@ A  process $\{x_t\}$ is said to be *covariance stationary* if
 
 In our setting, $\{x_t\}$ will be covariance stationary if $\mu_0, \Sigma_0, A, C$  assume values that  imply that none of $\mu_t, \Sigma_t, \Sigma_{t+j,t}$ depends on $t$.
 
-### Conditions for Stationarity
+### Conditions for stationarity
 
-#### The Globally Stable Case
+#### The globally stable case
 
 The difference equation $\mu_{t+1} = A \mu_t$ is known to have *unique*
 fixed point $\mu_{\infty} = 0$ if all eigenvalues of $A$ have moduli strictly less than unity.
@@ -1055,7 +1055,7 @@ Because of the constant first component in the state vector, we will never have
 
 How can we find stationary solutions that respect a constant state component?
 
-#### Processes with a Constant State Component
+#### Processes with a constant state component
 
 To investigate such a process, suppose that $A$ and $C$ take the
 form
@@ -1142,7 +1142,7 @@ Let's suppose that we're working with a covariance stationary process.
 
 In this case, we know that the ensemble mean will converge to $\mu_{\infty}$ as the sample size $I$ approaches infinity.
 
-#### Averages over Time
+#### Averages over time
 
 Ensemble averages across simulations are interesting theoretically, but in real life, we usually observe only a *single* realization $\{x_t, y_t\}_{t=0}^T$.
 
@@ -1171,7 +1171,7 @@ In particular,
 
 In our linear Gaussian setting, any covariance stationary process is also ergodic.
 
-## Noisy Observations
+## Noisy observations
 
 In some settings, the observation equation $y_t = Gx_t$ is modified to
 include an error term.
@@ -1233,7 +1233,7 @@ The theory of prediction for linear state space systems is elegant and
 simple.
 
 (ff_cm)=
-### Forecasting Formulas -- Conditional Means
+### Forecasting formulas -- conditional means
 
 The natural way to predict variables is to use conditional distributions.
 
@@ -1287,7 +1287,7 @@ $$
 = G A^j x_t
 $$
 
-### Covariance of Prediction Errors
+### Covariance of prediction errors
 
 It is useful to obtain the covariance matrix of the vector of  $j$-step-ahead prediction errors
 
diff --git a/lectures/lln_clt.md b/lectures/lln_clt.md
index 0dc910dfd..0380bc363 100644
--- a/lectures/lln_clt.md
+++ b/lectures/lln_clt.md
@@ -80,7 +80,7 @@ We begin with the law of large numbers, which tells us when sample averages
 will converge to their population means.
 
 (lln_ksl)=
-### The Classical LLN
+### The classical LLN
 
 The classical law of large numbers concerns independent and
 identically distributed (IID) random variables.
@@ -281,7 +281,7 @@ The three distributions are chosen at random from a selection stored in the dict
 
 Next, we turn to the central limit theorem, which tells us about the distribution of the deviation between sample averages and population means.
 
-### Statement of the Theorem
+### Statement of the theorem
 
 The central limit theorem is one of the most remarkable results in all of mathematics.
 
@@ -514,7 +514,7 @@ window that you can rotate with your mouse, giving different views on the
 density sequence.
 
 (multivariate_clt)=
-### The Multivariate Case
+### The multivariate case
 
 ```{index} single: Law of Large Numbers; Multivariate Case
 ```
diff --git a/lectures/lq_inventories.md b/lectures/lq_inventories.md
index e1ed768e2..c31a1938f 100644
--- a/lectures/lq_inventories.md
+++ b/lectures/lq_inventories.md
@@ -415,7 +415,7 @@ These two concepts correspond to these distinct altered firm problems.
 
 We use these two alternative production concepts in order to shed light on the baseline model.
 
-## Inventories Not Useful
+## Inventories not useful
 
 Let’s turn first to the setting in which inventories aren’t needed.
 
@@ -446,7 +446,7 @@ $$
 Q_{t}^{ni}=\frac{a_{0}+\nu_{t}-c_{1}}{c_{2}+a_{1}}.
 $$
 
-## Inventories Useful but are Hardwired to be Zero Always
+## Inventories useful but are hardwired to be zero always
 
 Next, we turn to a distinct problem in which inventories are useful –
 meaning that there are costs of $d_2 (I_t - S_t)^2$ associated
diff --git a/lectures/lqcontrol.md b/lectures/lqcontrol.md
index 041c91584..d92e58769 100644
--- a/lectures/lqcontrol.md
+++ b/lectures/lqcontrol.md
@@ -82,7 +82,7 @@ The "linear" part of LQ is a linear law of motion for the state, while the "quad
 
 Let's begin with the former, move on to the latter, and then put them together into an optimization problem.
 
-### The Law of Motion
+### The law of motion
 
 Let $x_t$ be a vector describing the state of some economic system.
 
@@ -296,14 +296,14 @@ $$
 
 Under this specification, the household's current loss is the squared deviation of consumption from the ideal level $\bar c$.
 
-## Optimality -- Finite Horizon
+## Optimality -- finite horizon
 
 ```{index} single: LQ Control; Optimality (Finite Horizon)
 ```
 
 Let's now be precise about the optimization problem we wish to consider, and look at how to solve it.
 
-### The Objective
+### The objective
 
 We will begin with the finite horizon case, with terminal time $T \in \mathbb N$.
 
@@ -575,7 +575,7 @@ are wrapped in a class  called `LQ`, which includes
     * `compute_sequence` ---- simulates the dynamics of $x_t, u_t, w_t$ given $x_0$ and assuming standard normal shocks
 
 (lq_mfpa)=
-### An Application
+### An application
 
 Early Keynesian models assumed that households have a constant marginal
 propensity to consume from current income.
@@ -779,11 +779,11 @@ of assets in the middle periods to fund rising consumption.
 
 However, the essential features are the same: consumption is smooth relative to income, and assets are strongly positively correlated with cumulative unanticipated income.
 
-## Extensions and Comments
+## Extensions and comments
 
 Let's now consider a number of standard extensions to the LQ problem treated above.
 
-### Time-Varying Parameters
+### Time-varying parameters
 
 In some settings, it can be desirable to allow $A, B, C, R$ and $Q$ to depend on $t$.
 
@@ -798,7 +798,7 @@ One illustration is given {ref}`below <lq_nsi>`.
 For further examples and a more systematic treatment, see {cite}`HansenSargent2013`, section 2.4.
 
 (lq_cpt)=
-### Adding a Cross-Product Term
+### Adding a cross-product term
 
 In some LQ problems, preferences include a cross-product term $u_t' N x_t$, so that the objective function becomes
 
@@ -840,7 +840,7 @@ The sequence $\{d_t\}$ is unchanged from {eq}`lq_dd`.
 We leave interested readers to confirm these results (the calculations are long but not overly difficult).
 
 (lq_ih)=
-### Infinite Horizon
+### Infinite horizon
 
 ```{index} single: LQ Control; Infinite Horizon
 ```
@@ -908,7 +908,7 @@ The state evolves according to the time-homogeneous process $x_{t+1} = (A - BF)
 An example infinite horizon problem is treated {ref}`below <lqc_mwac>`.
 
 (lq_cert_eq)=
-### Certainty Equivalence
+### Certainty equivalence
 
 Linear quadratic control problems of the class discussed above have the property of *certainty equivalence*.
 
@@ -918,10 +918,10 @@ This can be confirmed by inspecting {eq}`lq_oc_ih` or {eq}`lq_oc_cp`.
 
 It follows that we can ignore uncertainty when solving for optimal behavior, and plug it back in when examining optimal state dynamics.
 
-## Further Applications
+## Further applications
 
 (lq_nsi)=
-### Application 1: Age-Dependent Income Process
+### Application 1: age-dependent income process
 
 {ref}`Previously <lq_mfpa>` we studied a permanent income model that generated consumption smoothing.
 
@@ -1060,7 +1060,7 @@ The asset path exhibits dynamics consistent with standard life cycle theory.
 {ref}`lqc_ex1` gives the full set of parameters used here and asks you to replicate the figure.
 
 (lq_nsi2)=
-### Application 2: A Permanent Income Model with Retirement
+### Application 2: a permanent income model with retirement
 
 In the {ref}`previous application <lq_nsi>`, we generated income dynamics with an inverted U shape using polynomials and placed them in an LQ framework.
 
@@ -1134,7 +1134,7 @@ in life followed by later saving.
 Assets peak at retirement and subsequently decline.
 
 (lqc_mwac)=
-### Application 3: Monopoly with Adjustment Costs
+### Application 3: monopoly with adjustment costs
 
 Consider a monopolist facing stochastic inverse demand function
 
diff --git a/lectures/markov_asset.md b/lectures/markov_asset.md
index 842cc4bae..778ce1e13 100644
--- a/lectures/markov_asset.md
+++ b/lectures/markov_asset.md
@@ -79,7 +79,7 @@ import quantecon as qe
 from numpy.linalg import eigvals, solve
 ```
 
-## {index}`Pricing Models <single: Pricing Models>`
+## {index}`pricing models <single: pricing models>`
 
 ```{index} single: Models; Pricing
 ```
@@ -92,7 +92,7 @@ Let $\{d_t\}_{t \geq 0}$ be a stream of dividends
 Let's look at some equations that we expect to hold for prices of assets under ex-dividend contracts
 (we will consider cum-dividend pricing in the exercises).
 
-### Risk-Neutral Pricing
+### Risk-neutral pricing
 
 ```{index} single: Pricing Models; Risk-Neutral
 ```
@@ -117,7 +117,7 @@ Here ${\mathbb E}_t [y]$ denotes the best forecast of $y$, conditioned on inform
 
 More precisely, ${\mathbb E}_t [y]$ is the mathematical expectation of $y$ conditional on information available at time $t$.
 
-### Pricing with Random Discount Factor
+### Pricing with random discount factor
 
 ```{index} single: Pricing Models; Risk Aversion
 ```
@@ -146,7 +146,7 @@ This is because such assets pay well when funds are more urgently wanted.
 
 We give examples of how the stochastic discount factor has been modeled below.
 
-### Asset Pricing and Covariances
+### Asset pricing and covariances
 
 Recall that, from the definition of a conditional covariance ${\rm cov}_t (x_{t+1}, y_{t+1})$, we have
 
@@ -175,7 +175,7 @@ Equation {eq}`lteeqs102` asserts that the covariance of the stochastic discount
 
 We give examples of some models of stochastic discount factors that have been proposed later in this lecture and also in a [later lecture](https://python-advanced.quantecon.org/lucas_model.html).
 
-### The Price-Dividend Ratio
+### The price-dividend ratio
 
 Aside from prices, another quantity of interest is the **price-dividend ratio** $v_t := p_t / d_t$.
 
@@ -191,7 +191,7 @@ v_t = {\mathbb E}_t \left[ m_{t+1} \frac{d_{t+1}}{d_t} (1 + v_{t+1}) \right]
 
 Below we'll discuss the implication of this equation.
 
-## Prices in the Risk-Neutral Case
+## Prices in the risk-neutral case
 
 What can we say about price dynamics on the basis of the models described above?
 
@@ -204,7 +204,7 @@ For now we'll study  the risk-neutral case in which  the stochastic discount fac
 
 We'll  focus on how an asset  price depends on a dividend process.
 
-### Example 1: Constant Dividends
+### Example 1: constant dividends
 
 The simplest case is risk-neutral price of a constant, non-random dividend stream $d_t = d > 0$.
 
@@ -235,7 +235,7 @@ This is the equilibrium price in the constant dividend case.
 Indeed, simple algebra shows that setting $p_t = \bar p$ for all $t$
 satisfies the difference equation $p_t = \beta (d + p_{t+1})$.
 
-### Example 2: Dividends with Deterministic Growth Paths
+### Example 2: dividends with deterministic growth paths
 
 Consider a growing, non-random dividend process $d_{t+1} = g d_t$
 where $0 < g \beta < 1$.
@@ -268,7 +268,7 @@ $$
 This is called the *Gordon formula*.
 
 (mass_mg)=
-### Example 3: Markov Growth, Risk-Neutral Pricing
+### Example 3: Markov growth, risk-neutral pricing
 
 Next, we consider a dividend process
 
@@ -331,7 +331,7 @@ plt.tight_layout()
 plt.show()
 ```
 
-#### Pricing Formula
+#### Pricing formula
 
 To obtain asset prices in this setting, let's adapt our analysis from the case of deterministic growth.
 
@@ -431,7 +431,7 @@ Moreover, dividend growth is increasing in the state.
 
 The anticipation of high future dividend growth leads to a high price-dividend ratio.
 
-## Risk Aversion and Asset Prices
+## Risk aversion and asset prices
 
 Now let's turn to the case where agents are risk averse.
 
@@ -441,7 +441,7 @@ We'll price several distinct assets, including
 * A consol (a type of bond issued by the UK government in the 19th century)
 * Call options on a consol
 
-### Pricing a Lucas Tree
+### Pricing a lucas tree
 
 ```{index} single: Finite Markov Asset Pricing; Lucas Tree
 ```
@@ -641,7 +641,7 @@ This is because, with a positively correlated state process, higher states indic
 With the stochastic discount factor {eq}`lucsdf2`, higher growth decreases the
 discount factor, lowering the weight placed on future dividends.
 
-#### Special Cases
+#### Special cases
 
 In the special case $\gamma =1$, we have $J = P$.
 
@@ -660,7 +660,7 @@ risk-neutral solution {eq}`rned`.
 
 This is as expected, since $\gamma = 0$ implies $u(c) = c$ (and hence agents are risk-neutral).
 
-### A Risk-Free Consol
+### A risk-free consol
 
 Consider the same pure exchange representative agent economy.
 
@@ -741,13 +741,13 @@ def consol_price(ap, ζ):
     return p
 ```
 
-### Pricing an Option to Purchase the Consol
+### Pricing an option to purchase the consol
 
 Let's now price options of various maturities.
 
 We'll study an option that  gives the owner the  right to purchase a consol at a price $p_S$.
 
-#### An Infinite Horizon Call Option
+#### An infinite horizon call option
 
 We want to price an *infinite horizon*  option to purchase a consol at a price $p_S$.
 
@@ -885,11 +885,11 @@ where the consol prices are high --- will be visited recurrently.
 
 The reason for low valuations in high Markov growth states is that $\beta=0.9$, so  future payoffs are  discounted substantially.
 
-### Risk-Free Rates
+### Risk-free rates
 
 Let's look at risk-free interest rates over different periods.
 
-#### The One-period Risk-free Interest Rate
+#### The one-period risk-free interest rate
 
 As before, the stochastic discount factor is $m_{t+1} = \beta g_{t+1}^{-\gamma}$.
 
@@ -907,7 +907,7 @@ $$
 
 where the $i$-th  element of $m_1$ is the reciprocal of the one-period gross risk-free interest rate in state $x_i$.
 
-#### Other Terms
+#### Other terms
 
 Let $m_j$ be an $n \times 1$ vector whose $i$ th component is the reciprocal of the $j$ -period gross risk-free interest rate in state $x_i$.
 
diff --git a/lectures/markov_perf.md b/lectures/markov_perf.md
index f4e0a4f24..a31921db5 100644
--- a/lectures/markov_perf.md
+++ b/lectures/markov_perf.md
@@ -88,7 +88,7 @@ Well known examples include
 
 Let's examine a model of the first type.
 
-### Example: A Duopoly Model
+### Example: a duopoly model
 
 Two firms are the only producers of a good, the demand for which is governed by a linear inverse demand function
 
@@ -170,7 +170,7 @@ These iterations can be challenging to implement computationally.
 
 However, they simplify for the case in which  one-period payoff functions are quadratic and  transition laws are linear --- which takes us to our next topic.
 
-## Linear Markov Perfect Equilibria
+## Linear Markov perfect equilibria
 
 ```{index} single: Linear Markov Perfect Equilibria
 ```
@@ -181,7 +181,7 @@ In linear-quadratic dynamic games, these "stacked Bellman equations" become "sta
 
 We'll lay out that structure in a general setup and then apply it to some simple problems.
 
-### Coupled Linear Regulator Problems
+### Coupled linear regulator problems
 
 We consider a general linear-quadratic regulator game with two players.
 
@@ -222,7 +222,7 @@ Here
 * $A$ is $n \times n$
 * $B_i$ is $n \times k_i$
 
-### Computing Equilibrium
+### Computing equilibrium
 
 We formulate a linear Markov perfect equilibrium as follows.
 
@@ -319,13 +319,13 @@ Moreover, since
 
 we need to solve these $k_1 + k_2$ equations simultaneously.
 
-#### Key Insight
+#### Key insight
 
 A key insight is that  equations  {eq}`orig-3` and {eq}`orig-5` are linear in $F_{1t}$ and $F_{2t}$.
 
 After these equations have been solved, we can take  $F_{it}$ and solve for $P_{it}$ in {eq}`orig-4` and {eq}`orig-6`.
 
-#### Infinite Horizon
+#### Infinite horizon
 
 We often want to compute the solutions of such games for infinite horizons, in the hope that the decision rules $F_{it}$ settle down to be time-invariant as $t_1 \rightarrow +\infty$.
 
@@ -344,7 +344,7 @@ We use the function [nnash](https://github.com/QuantEcon/QuantEcon.py/blob/maste
 
 Let's use these procedures to treat some applications, starting with the duopoly model.
 
-### A Duopoly Model
+### A duopoly model
 
 To map the duopoly model into  coupled linear-quadratic dynamic programming problems, define the state
 and controls as
@@ -420,7 +420,7 @@ The optimal decision rule of firm $i$ will take the form $u_{it} = - F_i x_t$, i
 x_{t+1} = (A - B_1 F_1 -B_1 F_2 ) x_t
 ```
 
-### Parameters and Solution
+### Parameters and solution
 
 Consider the previously presented duopoly model with parameter values of:
 
diff --git a/lectures/mccall_correlated.md b/lectures/mccall_correlated.md
index 813d8a0d5..f240da9cf 100644
--- a/lectures/mccall_correlated.md
+++ b/lectures/mccall_correlated.md
@@ -54,7 +54,7 @@ from numba import jit, prange, float64
 from numba.experimental import jitclass
 ```
 
-## The Model
+## The model
 
 Wages at each point in time are given by
 
@@ -93,7 +93,7 @@ In this express, $u$ is a utility function and $\mathbb E_z$ is expectation of n
 
 The variable $z$ enters as a state in the Bellman equation because its current value helps predict future wages.
 
-### A Simplification
+### A simplification
 
 There is a way that we can reduce dimensionality in this problem, which greatly accelerates computation.
 
@@ -334,7 +334,7 @@ plt.show()
 As expected, higher unemployment compensation shifts the reservation wage up
 at all state values.
 
-## Unemployment Duration
+## Unemployment duration
 
 Next we study how mean unemployment duration varies with unemployment compensation.
 
diff --git a/lectures/mccall_fitted_vfi.md b/lectures/mccall_fitted_vfi.md
index 2253aaf8e..da7b8ae4a 100644
--- a/lectures/mccall_fitted_vfi.md
+++ b/lectures/mccall_fitted_vfi.md
@@ -55,7 +55,7 @@ from numba import jit, float64
 from numba.experimental import jitclass
 ```
 
-## The Algorithm
+## The algorithm
 
 The model is the same as the McCall model with job separation we {doc}`studied before <mccall_model_with_separation>`, except that the wage offer distribution is continuous.
 
@@ -91,7 +91,7 @@ The function $q$ in {eq}`bell1mcmc` is the density of the wage offer distributio
 
 Its support is taken as equal to $\mathbb R_+$.
 
-### Value Function Iteration
+### Value function iteration
 
 In theory, we should now proceed as follows:
 
@@ -111,7 +111,7 @@ is to record its value $v'(w)$ for every $w \in \mathbb R_+$.
 
 Clearly, this is impossible.
 
-### Fitted Value Function Iteration
+### Fitted value function iteration
 
 What we will do instead is use **fitted value function iteration**.
 
diff --git a/lectures/mccall_model.md b/lectures/mccall_model.md
index 3298eec65..b9694a9f6 100644
--- a/lectures/mccall_model.md
+++ b/lectures/mccall_model.md
@@ -69,7 +69,7 @@ import quantecon as qe
 from quantecon.distributions import BetaBinomial
 ```
 
-## The McCall Model
+## The mccall model
 
 ```{index} single: Models; McCall
 ```
@@ -106,7 +106,7 @@ The variable  $y_t$ is income, equal to
 * unemployment compensation $c$ when unemployed
 
 
-### A Trade-Off
+### A trade-off
 
 The worker faces a trade-off:
 
@@ -122,7 +122,7 @@ Dynamic programming can be thought of as a two-step procedure that
 
 We'll go through these steps in turn.
 
-### The Value Function
+### The value function
 
 In order to optimally trade-off current and future rewards, we need to think about two things:
 
@@ -182,7 +182,7 @@ If we optimize and pick the best of these two options, we obtain maximal lifetim
 But this is precisely $v^*(w)$, which is the left-hand side of {eq}`odu_pv`.
 
 
-### The Optimal Policy
+### The optimal policy
 
 Suppose for now that we are able to solve {eq}`odu_pv` for the unknown function $v^*$.
 
@@ -233,7 +233,7 @@ The agent should accept if and only if the current wage offer exceeds the reserv
 In view of {eq}`reswage`, we can compute this reservation wage if we can compute the value function.
 
 
-## Computing the Optimal Policy: Take 1
+## Computing the optimal policy: take 1
 
 To put the above ideas into action, we need to compute the value function at
 each possible state $w \in \mathbb W$.
@@ -265,7 +265,7 @@ v^*(i)
 
 
-### The Algorithm
+### The algorithm
 
 To compute this vector, we use successive approximations:
 
@@ -295,7 +295,7 @@ For a small tolerance, the returned function $v$ is a close approximation to the
 
 The theory below elaborates on this point.
 
-### Fixed Point Theory
+### Fixed point theory
 
 What's the mathematics behind these ideas?
 
@@ -509,7 +509,7 @@ The next line computes the reservation wage at  default parameters
 compute_reservation_wage(mcm)
 ```
 
-### Comparative Statics
+### Comparative statics
 
 Now that we know how to compute the reservation wage, let's see how it varies with
 parameters.
@@ -553,7 +553,7 @@ As expected, the reservation wage increases both with patience and with
 unemployment compensation.
 
 (mm_op2)=
-## Computing an Optimal Policy: Take 2
+## Computing an optimal policy: take 2
 
 The approach to dynamic programming just described is standard and
 broadly applicable.
diff --git a/lectures/mccall_model_with_separation.md b/lectures/mccall_model_with_separation.md
index fe7d2c48a..d766d8e81 100644
--- a/lectures/mccall_model_with_separation.md
+++ b/lectures/mccall_model_with_separation.md
@@ -64,7 +64,7 @@ from typing import NamedTuple
 from quantecon.distributions import BetaBinomial
 ```
 
-## The Model
+## The model
 
 The model is similar to the {doc}`baseline McCall job search model <mccall_model>`.
 
@@ -89,7 +89,7 @@ introducing a utility function $u$.
 
 It satisfies $u'> 0$ and $u'' < 0$.
 
-### The Wage Process
+### The wage process
 
 For now we will drop the separation of state process and wage process that we
 maintained for the {doc}`baseline model <mccall_model>`.
@@ -102,7 +102,7 @@ The set of possible wage values is denoted by $\mathbb W$.
 driving random outcomes, since this formulation is usually convenient in more sophisticated
 models.)
 
-### Timing and Decisions
+### Timing and decisions
 
 At the start of each period, the agent can be either
 
@@ -128,7 +128,7 @@ The process then repeats.
 We do not allow for job search while employed---this topic is taken up in a {doc}`later lecture <jv>`.
 ```
 
-## Solving the Model
+## Solving the model
 
 We drop time subscripts in what follows and primes denote next period values.
 
@@ -142,7 +142,7 @@ Here *value* means the value of the objective function {eq}`objective` when the
 
 Our first aim is to obtain these functions.
 
-### The Bellman Equations
+### The Bellman equations
 
 Suppose for now that the worker can calculate the functions $v$ and $h$ and use them in his decision making.
 
@@ -183,7 +183,7 @@ Equations {eq}`bell1_mccall` and {eq}`bell2_mccall` are the Bellman equations fo
 They provide enough information to solve for both $v$ and $h$.
 
 (ast_mcm)=
-### A Simplifying Transformation
+### A simplifying transformation
 
 Rather than jumping straight into solving these equations, let's see if we can
 simplify them somewhat.
@@ -236,7 +236,7 @@ v(w) = u(w) + \beta
 In the last expression, we wrote $w_e$ as $w$ to make the notation
 simpler.
 
-### The Reservation Wage
+### The reservation wage
 
 Suppose we can use {eq}`bell02_mccall` and {eq}`bell01_mccall` to solve for
 $d$ and $v$.
@@ -260,7 +260,7 @@ w \geq \bar w
 \bar w \text{ solves } v(\bar w) =  u(c) + \beta d
 $$
 
-### Solving the Bellman Equations
+### Solving the Bellman equations
 
 We'll use the same iterative approach to solving the Bellman equations that we
 adopted in the {doc}`first job search lecture <mccall_model>`.
@@ -377,7 +377,7 @@ def solve_model(model, tol=1e-5, max_iter=2000):
     return v_final, d_final
 ```
 
-### The Reservation Wage: First Pass
+### The reservation wage: first pass
 
 The optimal choice of the agent is summarized by the reservation wage.
 
@@ -405,7 +405,7 @@ plt.show()
 
 The value $v$ is increasing because higher $w$ generates a higher wage flow conditional on staying employed.
 
-### The Reservation Wage: Computation
+### The reservation wage: computation
 
 Here's a function `compute_reservation_wage` that takes an instance of `Model`
 and returns the associated reservation wage.
@@ -428,11 +428,11 @@ def compute_reservation_wage(model):
 
 Next we will investigate how the reservation wage varies with parameters.
 
-## Impact of Parameters
+## Impact of parameters
 
 In each instance below, we'll show you a figure and then ask you to reproduce it in the exercises.
 
-### The Reservation Wage and Unemployment Compensation
+### The reservation wage and unemployment compensation
 
 First, let's look at how $\bar w$ varies with unemployment compensation.
 
@@ -447,7 +447,7 @@ As expected, higher unemployment compensation causes the worker to hold out for
 
 In effect, the cost of continuing job search is reduced.
 
-### The Reservation Wage and Discounting
+### The reservation wage and discounting
 
 Next, let's investigate how $\bar w$ varies with the discount factor.
 
@@ -460,7 +460,7 @@ $\beta$
 
 Again, the results are intuitive: More patient workers will hold out for higher wages.
 
-### The Reservation Wage and Job Destruction
+### The reservation wage and job destruction
 
 Finally, let's look at how $\bar w$ varies with the job separation rate $\alpha$.
 
diff --git a/lectures/mccall_q.md b/lectures/mccall_q.md
index 4d7b23d5d..4b840ef05 100644
--- a/lectures/mccall_q.md
+++ b/lectures/mccall_q.md
@@ -82,7 +82,7 @@ import matplotlib.pyplot as plt
 np.random.seed(123)
 ```
 
-## Review of McCall Model
+## Review of mccall model
 
 We begin by reviewing the McCall model described in {doc}`this quantecon lecture <mccall_model>`.
 
@@ -239,7 +239,7 @@ We'll use this value function as a benchmark later after we have done some Q-lea
 print(valfunc_VFI)
 ```
 
-## Implied Quality Function  $Q$
+## Implied quality function $q$
 
 
 A **quality function** $Q$ map  state-action pairs into optimal values.
@@ -313,7 +313,7 @@ $$
 
 +++
 
-## From Probabilities  to Samples
+## From probabilities to samples
 
 We noted  above that  the optimal Q function for our McCall worker satisfies the Bellman equations
 
@@ -370,7 +370,7 @@ to  objects in equation system {eq}`eq:old105`.
 
 This informal argument takes us to the threshold of Q-learning.
 
-## Q-Learning
+## Q-learning
 
 Let's first describe  a $Q$-learning algorithm precisely.
 
@@ -704,7 +704,7 @@ The above graphs indicates that
 
 * the quality of approximation to the "true" value function computed by value function iteration improves for longer epochs
 
-## Employed Worker Can't Quit
+## Employed worker can't quit
 
 
 The preceding version of temporal difference Q-learning described in  equation system  {eq}`eq:old4` lets an employed  worker quit, i.e., reject her wage as an incumbent and instead receive unemployment compensation this period
@@ -744,7 +744,7 @@ We illustrate these possibilities with the following code and graph.
 plot_epochs(epochs_to_plot=[100, 1000, 10000, 100000, 200000], quit_allowed=0)
 ```
 
-## Possible Extensions
+## Possible extensions
 
 To extend the algorthm to handle problems with continuous state spaces,
 a typical approach is to restrict Q-functions and policy functions to take particular
diff --git a/lectures/mix_model.md b/lectures/mix_model.md
index 635061a6d..ee3e7b09d 100644
--- a/lectures/mix_model.md
+++ b/lectures/mix_model.md
@@ -207,7 +207,7 @@ l_arr_f = simulate(F_a, F_b, N=50000)
 l_seq_f = np.cumprod(l_arr_f, axis=1)
 ```
 
-## Sampling from  Compound Lottery $H$
+## Sampling from compound lottery $h$
 
 We implement two methods  to draw samples from
 our mixture model $\alpha F + (1-\alpha) G$.
@@ -293,7 +293,7 @@ plt.legend()
 plt.show()
 ```
 
-## Type 1 Agent
+## Type 1 agent
 
 We'll now study what our type 1 agent learns
 
@@ -396,7 +396,7 @@ Formula {eq}`eq:bayeslaw103` generalizes formula {eq}`eq:recur1`.
 Formula {eq}`eq:bayeslaw103` can be regarded as a one step revision of prior probability $ \pi_0 $ after seeing
 the batch of data $ \left\{ w_{i}\right\} _{i=1}^{t+1} $.
 
-## What a type 1 Agent Learns when Mixture $H$ Generates Data
+## What a type 1 agent learns when mixture $h$ generates data
 
 We now study what happens when the mixture distribution $h;\alpha$ truly generated the data each period.
 
@@ -472,7 +472,7 @@ plot_π_seq(α = 0.2)
 
 Evidently, $\alpha$ is having a big effect on the destination of $\pi_t$ as $t \rightarrow + \infty$
 
-## Kullback-Leibler Divergence Governs Limit of $\pi_t$
+## Kullback-leibler divergence governs limit of $\pi_t$
 
 To understand what determines whether the limit point of  $\pi_t$ is  $0$ or $1$  and how the answer depends on the true value of the mixing probability  $\alpha \in (0,1) $ that generates
 
@@ -617,7 +617,7 @@ Kullback-Leibler divergence:
 
 - When $\alpha$ is large, $KL_f < KL_g$ meaning the divergence of $f$ from $h$ is smaller than that of $g$ and so the limit point of $\pi_t$ is close to $1$.
 
-## Type 2 Agent
+## Type 2 agent
 
 We now describe how our type 2 agent formulates his learning problem and what he eventually learns.
 
@@ -702,7 +702,7 @@ plt.show()
 
 Evidently,  the Bayesian posterior  narrows in on the true value  $\alpha = .8$ of the mixing parameter as the length of a history of observations grows.
 
-## Concluding Remarks
+## Concluding remarks
 
 Our type 1 person  deploys an incorrect statistical  model.
 
diff --git a/lectures/mle.md b/lectures/mle.md
index e71e1da44..91718cec8 100644
--- a/lectures/mle.md
+++ b/lectures/mle.md
@@ -60,11 +60,11 @@ from statsmodels.iolib.summary2 import summary_col
 
 We assume familiarity with basic probability and multivariate calculus.
 
-## Set Up and Assumptions
+## Set up and assumptions
 
 Let's consider the steps we need to go through in maximum likelihood estimation and how they pertain to this study.
 
-### Flow of Ideas
+### Flow of ideas
 
 The first step with maximum likelihood estimation is to choose the probability distribution believed to be generating the data.
 
@@ -81,7 +81,7 @@ We'll let the data pick out a particular element of the class by pinning down th
 
 The parameter estimates so produced will be called **maximum likelihood estimates**.
 
-### Counting Billionaires
+### Counting billionaires
 
 Treisman {cite}`Treisman2016` is interested in estimating the number of billionaires in different countries.
 
@@ -163,7 +163,7 @@ plt.show()
 
 From the histogram, it appears that the Poisson assumption is not unreasonable (albeit with a very low $\mu$ and some outliers).
 
-## Conditional Distributions
+## Conditional distributions
 
 In Treisman's paper, the dependent variable --- the number of billionaires $y_i$ in country $i$ --- is modeled as a function of GDP per capita, population size, and years membership in GATT and WTO.
 
@@ -227,7 +227,7 @@ plt.show()
 We can see that the distribution of $y_i$ is conditional on
 $\mathbf{x}_i$ ($\mu_i$ is no longer constant).
 
-## Maximum Likelihood Estimation
+## Maximum likelihood estimation
 
 In our model for number of billionaires, the conditional distribution
 contains 4 ($k = 4$) parameters that we need to estimate.
@@ -350,7 +350,7 @@ $$
 However, no analytical solution exists to the above problem -- to find the MLE
 we need to use numerical methods.
 
-## MLE with Numerical Methods
+## MLE with numerical methods
 
 Many distributions do not have nice, analytical solutions and therefore require
 numerical methods to solve for parameter estimates.
@@ -607,7 +607,7 @@ Note that our implementation of the Newton-Raphson algorithm is rather
 basic --- for more robust implementations see,
 for example, [scipy.optimize](https://docs.scipy.org/doc/scipy/reference/optimize.html).
 
-## Maximum Likelihood Estimation with `statsmodels`
+## Maximum likelihood estimation with `statsmodels`
 
 Now that we know what's going on under the hood, we can apply MLE to an interesting application.
 
diff --git a/lectures/multi_hyper.md b/lectures/multi_hyper.md
index b25b14571..ba6d73d0f 100644
--- a/lectures/multi_hyper.md
+++ b/lectures/multi_hyper.md
@@ -35,7 +35,7 @@ In the lecture we'll learn about
 * using a Monte Carlo simulation of a multivariate normal distribution to evaluate the quality of a normal approximation
 * the administrator's problem and why the multivariate hypergeometric distribution is the right tool
 
-## The Administrator's Problem
+## The administrator's problem
 
 An administrator in charge of allocating research grants is in the following situation.
 
@@ -62,7 +62,7 @@ The $n$ balls drawn represent  successful proposals and are  awarded research fu
 
 The remaining $N-n$ balls receive no research funds.
 
-### Details of the Awards Procedure Under Study
+### Details of the awards procedure under study
 
 Let $k_i$ be the number of balls of color $i$ that are drawn.
 
@@ -106,7 +106,7 @@ the population of $N$ balls.
 
 The right tool for the administrator's job is the **multivariate hypergeometric distribution**.
 
-### Multivariate Hypergeometric Distribution
+### Multivariate hypergeometric distribution
 
 Let's start with some imports.
 
@@ -304,7 +304,7 @@ n = 6
 Σ
 ```
 
-### Back to The Administrator's Problem
+### Back to the administrator's problem
 
 Now let's turn to the grant administrator's problem.
 
@@ -368,7 +368,7 @@ np.cov(sample.T)
 
 Evidently, the sample means and covariances approximate their population counterparts well.
 
-### Quality of Normal Approximation
+### Quality of normal approximation
 
 To judge the quality of a multivariate normal approximation to the multivariate hypergeometric distribution, we draw a large sample from a multivariate normal distribution with the mean vector  and covariance matrix for the corresponding multivariate hypergeometric distribution and compare the simulated distribution with the population multivariate hypergeometric distribution.
 
diff --git a/lectures/multivariate_normal.md b/lectures/multivariate_normal.md
index 6e7af55ee..8e8aa59a8 100644
--- a/lectures/multivariate_normal.md
+++ b/lectures/multivariate_normal.md
@@ -44,7 +44,7 @@ We will use  the multivariate normal distribution to formulate some useful model
 * time series generated by linear stochastic difference equations
 * optimal linear filtering theory
 
-## The Multivariate Normal Distribution
+## The multivariate normal distribution
 
 This lecture defines a Python class `MultivariateNormal` to be used
 to generate **marginal** and **conditional** distributions associated
@@ -263,7 +263,7 @@ squares regressions.
 We’ll compare those linear least squares regressions for the simulated
 data to their population counterparts.
 
-## Bivariate Example
+## Bivariate example
 
 We start with a bivariate normal distribution pinned down by
 
@@ -505,7 +505,7 @@ closely approximate their population counterparts.
 A Law of Large
 Numbers explains why  sample analogues approximate  population objects.
 
-## Trivariate Example
+## Trivariate example
 
 Let’s apply our code to a trivariate example.
 
@@ -569,7 +569,7 @@ multi_normal.βs[0], results.params
 Once again, sample analogues do a good job of approximating their
 populations counterparts.
 
-## One Dimensional Intelligence (IQ)
+## One dimensional intelligence (iq)
 
 Let’s move closer to a real-life example, namely, inferring a
 one-dimensional measure of intelligence called IQ from a list of test
@@ -812,7 +812,7 @@ If we were to drive the number of tests $n \rightarrow + \infty$, the
 conditional standard deviation $\hat{\sigma}_{\theta}$ would
 converge to $0$ at  rate $\frac{1}{n^{.5}}$.
 
-## Information as Surprise
+## Information as surprise
 
 By using a different representation, let’s look at things from a
 different perspective.
@@ -927,7 +927,7 @@ np.max(np.abs(μθ_hat_arr - μθ_hat_arr_C)) < 1e-10
 np.max(np.abs(Σθ_hat_arr - Σθ_hat_arr_C)) < 1e-10
 ```
 
-## Cholesky Factor Magic
+## Cholesky factor magic
 
 Evidently, the Cholesky factorizations automatically computes the
 population  **regression coefficients** and associated statistics
@@ -943,7 +943,7 @@ Indeed, in formula {eq}`mnv_1`,
 - the coefficient $c_i$ is the simple population regression
   coefficient of $\theta - \mu_\theta$ on $\epsilon_i$
 
-## Math and Verbal  Intelligence
+## Math and verbal intelligence
 
 We can alter the preceding example to be more realistic.
 
@@ -1097,7 +1097,7 @@ for indices, IQ, conditions in [([*range(2*n), 2*n], 'θ', 'y1, y2, y3, y4'),
 Evidently, math tests provide no information about $\mu$ and
 language tests provide no information about $\eta$.
 
-## Univariate Time Series Analysis
+## Univariate time series analysis
 
 We can use the multivariate normal distribution and a little matrix
 algebra to present foundations of univariate linear time series
@@ -1262,7 +1262,7 @@ x = z[:T+1]
 y = z[T+1:]
 ```
 
-### Smoothing Example
+### Smoothing example
 
 This is an instance of a classic `smoothing` calculation whose purpose
 is to compute $E X \mid Y$.
@@ -1296,7 +1296,7 @@ print(" E [ X | Y] = ", )
 multi_normal_ex1.cond_dist(0, y)
 ```
 
-### Filtering Exercise
+### Filtering exercise
 
 Compute $E\left[x_{t} \mid y_{t-1}, y_{t-2}, \dots, y_{0}\right]$.
 
@@ -1339,7 +1339,7 @@ sub_y = y[:t]
 multi_normal_ex2.cond_dist(0, sub_y)
 ```
 
-### Prediction Exercise
+### Prediction exercise
 
 Compute $E\left[y_{t} \mid y_{t-j}, \dots, y_{0} \right]$.
 
@@ -1379,7 +1379,7 @@ sub_y = y[:t-j+1]
 multi_normal_ex3.cond_dist(0, sub_y)
 ```
 
-### Constructing a Wold Representation
+### Constructing a wold representation
 
 Now we’ll apply Cholesky decomposition to decompose
 $\Sigma_{y}=H H^{\prime}$ and form
@@ -1413,7 +1413,7 @@ y
 This example is an instance of what is known as a **Wold representation** in time series analysis.
 
 
-## Stochastic Difference Equation
+## Stochastic difference equation
 
 Consider the stochastic second-order linear difference equation
 
@@ -1566,7 +1566,7 @@ C = np.array([[𝛼2, 𝛼1], [0, 𝛼2]])
 Σy = A_inv @ (Σb + Σu) @ A_inv.T
 ```
 
-## Application to Stock Price Model
+## Application to stock price model
 
 Let
 
@@ -1694,7 +1694,7 @@ be if people did not have perfect foresight but were optimally
 predicting future dividends on the basis of the information
 $y_t, y_{t-1}$ at time $t$.
 
-## Filtering Foundations
+## Filtering foundations
 
 Assume that $x_0$ is an $n \times 1$ random vector and that
 $y_0$ is a $p \times 1$ random vector determined by the
@@ -1930,7 +1930,7 @@ x1_cond = A @ μ1_hat
 x1_cond, Σ1_cond
 ```
 
-### Code for Iterating
+### Code for iterating
 
 Here is code for solving a dynamic filtering problem by iterating on our
 equations, followed by an example.
@@ -1974,7 +1974,7 @@ The iterative algorithm just described is a version of the celebrated **Kalman f
 We describe the Kalman filter  and some applications of it in {doc}`A First Look at the Kalman Filter <kalman>`
 
 
-## Classic Factor Analysis Model
+## Classic factor analysis model
 
 The factor analysis model widely used in psychology and other fields can
 be represented as
@@ -2135,7 +2135,7 @@ $\Lambda I^{-1} f = \Lambda f$.
 Λ @ f
 ```
 
-## PCA and Factor Analysis
+## Pca and factor analysis
 
 To learn about Principal Components Analysis (PCA), please see this lecture {doc}`Singular Value Decompositions <svd_intro>`.
 
diff --git a/lectures/navy_captain.md b/lectures/navy_captain.md
index 56431f556..f7eb6d821 100644
--- a/lectures/navy_captain.md
+++ b/lectures/navy_captain.md
@@ -204,7 +204,7 @@ plt.show()
 Above, we plot the two possible probability densities $f_0$ and
 $f_1$
 
-## Frequentist Decision Rule
+## Frequentist decision rule
 
 The Navy told the Captain to use a  frequentist decision rule.
 
@@ -458,7 +458,7 @@ axs[1].set_title(r'optimal PFA and PD given $\pi^*$')
 plt.show()
 ```
 
-## Bayesian Decision Rule
+## Bayesian decision rule
 
 In  {doc}`A Problem that Stumped Milton Friedman <wald_friedman>`,
 we learned how Abraham Wald confirmed the Navy
@@ -776,7 +776,7 @@ axs[1].legend()
 plt.show()
 ```
 
-## Was the Navy Captain’s Hunch Correct?
+## Was the navy captain’s hunch correct?
 
 We now compare average (i.e., frequentist) losses obtained by the
 frequentist and Bayesian decision rules.
@@ -832,7 +832,7 @@ $\bar{V}_{fre}-\bar{V}_{Bayes}$.
 
 It is always positive.
 
-## More Details
+## More details
 
 We can provide more insights by focusing on the case in which
 $\pi^{*}=0.5=\pi_{0}$.
@@ -857,7 +857,7 @@ corresponding to `t_optimal` sample size.
 t_idx = t_optimal - 1
 ```
 
-## Distribution of Bayesian Decision Rule’s Time to Decide
+## Distribution of Bayesian decision rule’s time to decide
 
 We use  simulations to  compute the frequency distribution of the  time to
 decide for the Bayesian decision rule and compare that time to the
@@ -992,7 +992,7 @@ plt.title('Unconditional distribution of times')
 plt.show()
 ```
 
-## Probability of Making Correct Decision
+## Probability of making correct decision
 
 Now we use simulations to compute the fraction of samples in which the
 Bayesian and the frequentist decision rules decide correctly.
@@ -1051,7 +1051,7 @@ plt.title('Uncond. probability of making correct decisions before t')
 plt.show()
 ```
 
-## Distribution of Likelihood Ratios at Frequentist’s $t$
+## Distribution of likelihood ratios at frequentist’s $t$
 
 Next we use simulations to construct distributions of likelihood ratios
 after $t$ draws.
diff --git a/lectures/newton_method.md b/lectures/newton_method.md
index 3af3b5f62..89e5eef9d 100644
--- a/lectures/newton_method.md
+++ b/lectures/newton_method.md
@@ -94,7 +94,7 @@ import autograd.numpy as np
 plt.rcParams["figure.figsize"] = (10, 5.7)
 ```
 
-## Fixed Point Computation Using Newton's Method
+## Fixed point computation using newton's method
 
 In this section we solve the fixed point of the law of motion for capital in
 the setting of the [Solow growth
@@ -104,7 +104,7 @@ We will inspect the fixed point visually, solve it by successive
 approximation, and then apply Newton's method to achieve faster convergence.
 
 (solow)=
-### The Solow Model
+### The solow model
 
 In the Solow growth model, assuming Cobb-Douglas production technology and
 zero population growth, the law of motion for capital is
@@ -214,7 +214,7 @@ plt.show()
 We see that $k^*$ is indeed the unique positive fixed point.
 
 
-#### Successive Approximation
+#### Successive approximation
 
 First let's compute the fixed point using successive approximation.
 
@@ -263,7 +263,7 @@ This is close to the true value.
 k_star
 ```
 
-#### Newton's Method 
+#### Newton's method
 
 In general, when applying Newton's fixed point method to some function $g$, 
 we start with a guess $x_0$ of the fixed
@@ -363,7 +363,7 @@ plot_trajectories(params)
 We can see that Newton's method converges faster than successive approximation.
 
 
-## Root-Finding in One Dimension
+## Root-finding in one dimension
 
 In the previous section we computed fixed points.
 
@@ -375,7 +375,7 @@ the problem of finding fixed points.
 
 
-### Newton's Method for Zeros
+### Newton's method for zeros
 
 Let's suppose we want to find an $x$ such that $f(x)=0$ for some smooth
 function $f$ mapping real numbers to real numbers.
@@ -438,7 +438,7 @@ automatic differentiation or GPU acceleration, it will be helpful to know how
 to implement Newton's method ourselves.)
 
 
-### Application to Finding Fixed Points
+### Application to finding fixed points
 
 Now consider again the Solow fixed-point calculation, where we solve for $k$
 satisfying $g(k) = k$.
@@ -464,7 +464,7 @@ The result confirms the descent we saw in the graphs above: a very accurate resu
 
 
-## Multivariate Newton’s Method
+## Multivariate newton’s method
 
 In this section, we introduce a two-good problem, present a
 visualization of the problem, and solve for the equilibrium of the two-good market
@@ -477,7 +477,7 @@ We will see a significant performance gain when using Netwon's method.
 
 
 (two_goods_market)=
-### A Two Goods Market Equilibrium
+### A two goods market equilibrium
 
 Let's start by computing the market equilibrium of a two-good problem.
 
@@ -531,7 +531,7 @@ $$
 
 for this particular question.
 
-#### A Graphical Exploration
+#### A graphical exploration
 
 Since our problem is only two-dimensional, we can use graphical analysis to visualize and help understand the problem.
 
@@ -648,7 +648,7 @@ plt.show()
 It seems there is an equilibrium close to $p = (1.6, 1.5)$.
 
 
-#### Using a Multidimensional Root Finder
+#### Using a multidimensional root finder
 
 To solve for $p^*$ more precisely, we use a zero-finding algorithm from `scipy.optimize`.
 
@@ -681,7 +681,7 @@ np.max(np.abs(e(p, A, b, c)))
 This is indeed a very small error.
 
 
-#### Adding Gradient Information
+#### Adding gradient information
 
 In many cases, for zero-finding algorithms applied to smooth functions, supplying the [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant) of the function leads to better convergence properties.
 
@@ -724,7 +724,7 @@ p = solution.x
 np.max(np.abs(e(p, A, b, c)))
 ```
 
-#### Using Newton's Method
+#### Using newton's method
 
 Now let's use Newton's method to compute the equilibrium price using the multivariate version of Newton's method
 
@@ -785,7 +785,7 @@ The result is very accurate.
 With the larger overhead, the speed is not better than the optimized `scipy` function.
 
 
-### A High-Dimensional Problem
+### A high-dimensional problem
 
 Our next step is to investigate a large market with 3,000 goods.
 
diff --git a/lectures/odu.md b/lectures/odu.md
index e359c36ab..dec3b6c4e 100644
--- a/lectures/odu.md
+++ b/lectures/odu.md
@@ -67,7 +67,7 @@ import scipy.optimize as op
 from scipy.stats import cumfreq, beta
 ```
 
-### Model Features
+### Model features
 
 - Infinite horizon dynamic programming with two states and one binary
   control.
@@ -79,7 +79,7 @@ Let’s first review the basic McCall model
 {cite}`McCall1970` and then add the variation we
 want to consider.
 
-### The Basic McCall Model
+### The basic mccall model
 
 Recall that, {doc}`in the baseline model <mccall_model>`, an
 unemployed worker is presented in each period with a permanent job offer
@@ -113,7 +113,7 @@ v(w)
 The optimal policy has the form $\mathbf{1}\{w \geq \bar w\}$, where
 $\bar w$ is a constant called the *reservation wage*.
 
-### Offer Distribution Unknown
+### Offer distribution unknown
 
 Now let’s extend the model by considering the variation presented in
 {cite}`Ljungqvist2012`, section 6.6.
@@ -239,7 +239,7 @@ plt.show()
 ```
 
 (looking-forward)=
-### Looking Forward
+### Looking forward
 
 What kind of optimal policy might result from
 {eq}`odu_mvf` and the parameterization specified above?
@@ -266,7 +266,7 @@ $\mathbb 1{w\geq \bar w(\pi) }$ for some
 decreasing function $\bar w$.
 
 (take-1-solution-by-vfi)=
-## Take 1: Solution by VFI
+## Take 1: solution by VFI
 
 Let’s set about solving the model and see how our results match with our
 intuition.
@@ -481,7 +481,7 @@ forward](looking-forward).
   $\bar w(\pi)$ introduced there.
 - It is decreasing as expected.
 
-## Take 2: A More Efficient Method
+## Take 2: a more efficient method
 
 Let’s consider another method to solve for the optimal policy.
 
@@ -496,7 +496,7 @@ As a consequence, the algorithm is orders of magnitude faster than VFI.
 This section illustrates the point that when it comes to programming, a
 bit of mathematical analysis goes a long way.
 
-## Another Functional Equation
+## Another functional equation
 
 To begin, note that when $w = \bar w(\pi)$, the worker is indifferent
 between accepting and rejecting.
@@ -548,7 +548,7 @@ Equation {eq}`odu_mvf4` can be understood as a functional equation, where $\bar
 * Let's call it the *reservation wage functional equation* (RWFE).
 * The solution $\bar w$ to the RWFE is the object that we wish to compute.
 
-## Solving the RWFE
+## Solving the rwfe
 
 To solve the RWFE, we will first show that its solution is the
 fixed point of a [contraction mapping](https://en.wikipedia.org/wiki/Contraction_mapping).
@@ -766,7 +766,7 @@ plt.show()
 ```{solution-end}
 ```
 
-## Appendix A
+## Appendix a
 
 The next piece of code generates a fun simulation to see what the effect
 of a change in the underlying distribution on the unemployment rate is.
@@ -852,7 +852,7 @@ ax.legend()
 plt.show()
 ```
 
-## Appendix B
+## Appendix b
 
 In this appendix we provide more details about how Bayes' Law contributes to the workings of the model.
 
@@ -1061,7 +1061,7 @@ We now provide some examples that provide insights about how the model works.
 
 ## Examples
 
-### Example 1 (Baseline)
+### Example 1 (baseline)
 
 $F$ ~ Beta(1, 1), $G$ ~ Beta(3, 1.2), $c$=0.3.
 
diff --git a/lectures/ols.md b/lectures/ols.md
index 51497a664..305e8df3a 100644
--- a/lectures/ols.md
+++ b/lectures/ols.md
@@ -76,7 +76,7 @@ This lecture assumes you are familiar with basic econometrics.
 For an introductory text covering these topics, see, for example,
 {cite}`Wooldridge2015`.
 
-## Simple Linear Regression
+## Simple linear regression
 
 {cite}`Acemoglu2001` wish to determine whether or not differences in institutions can help to explain observed economic outcomes.
 
@@ -302,7 +302,7 @@ ax.set_ylabel('logpgp95')
 plt.show()
 ```
 
-## Extending the Linear Regression Model
+## Extending the linear regression model
 
 So far we have only accounted for institutions affecting economic
 performance - almost certainly there are numerous other factors
diff --git a/lectures/opt_transport.md b/lectures/opt_transport.md
index d8e19120f..e5111c23e 100644
--- a/lectures/opt_transport.md
+++ b/lectures/opt_transport.md
@@ -57,7 +57,7 @@ from scipy.stats import betabinom
 import networkx as nx
 ```
 
-## The Optimal Transport Problem
+## The optimal transport problem
 
 Suppose that $m$ factories produce goods that must be sent to $n$ locations.
 
@@ -128,13 +128,13 @@ More about this later.
 
 
-## The Linear Programming Approach
+## The linear programming approach
 
 In this section we discuss using using standard linear programming solvers to
 tackle the optimal transport problem.
 
 
-### Vectorizing a Matrix of Decision Variables
+### Vectorizing a matrix of decision variables
 
 A *matrix* of decision variables $x_{ij}$ appears in problem {eq}`plannerproblem`.
 
@@ -255,7 +255,7 @@ $$
 $$
 
 
-### An Application
+### An application
 
 
 We now provide an example that takes the form {eq}`decisionvars` that we'll
@@ -476,7 +476,7 @@ The vector $z$ evidently equals $\operatorname{vec}(X)$.
 The minimized cost from the optimal transport plan is given by the $fun$ variable.
 
 
-### Using a Just-in-Time Compiler
+### Using a just-in-time compiler
 
 We can also solve optimal transportation problems using a powerful tool from
 QuantEcon, namely, `quantecon.optimize.linprog_simplex`.
@@ -542,7 +542,7 @@ As you can see, the `quantecon.optimize.linprog_simplex` is much faster.
 QuantEcon version, having been tested more extensively over a longer period of
 time.)
 
-## The Dual Problem
+## The dual problem
 
 Let $u, v$ denotes vectors of dual decision variables with entries $(u_i), (v_j)$.
 
@@ -642,7 +642,7 @@ This equality is assured by   **complementary slackness** conditions that state
 
 
-## The Python Optimal Transport Package
+## The Python optimal transport package
 
 There is an excellent [Python package](https://pythonot.github.io/) for
 optimal transport that simplifies some of the steps we took above.
@@ -654,7 +654,7 @@ passing the data out to a linear programming routine.
 since we want to understand what happens under the hood.)
 
 
-### Replicating Previous Results
+### Replicating previous results
 
 The following line of code solves the example application discussed above
 using linear programming.
@@ -673,7 +673,7 @@ total_cost
 
 Here we use [np.vdot](https://numpy.org/doc/stable/reference/generated/numpy.vdot.html) for the trace inner product of X and C
 
-### A Larger Application
+### A larger application
 
 Now let's try using the same package on a slightly larger application.
 
diff --git a/lectures/optgrowth.md b/lectures/optgrowth.md
index 6f4a4dd00..38f2d7c7f 100644
--- a/lectures/optgrowth.md
+++ b/lectures/optgrowth.md
@@ -67,7 +67,7 @@ from scipy.interpolate import interp1d
 from scipy.optimize import minimize_scalar
 ```
 
-## The Model
+## The model
 
 ```{index} single: Optimal Growth; Model
 ```
@@ -100,7 +100,7 @@ k_{t+1} + c_t \leq y_t
 
 and all variables are required to be nonnegative.
 
-### Assumptions and Comments
+### Assumptions and comments
 
 In what follows,
 
@@ -156,7 +156,7 @@ In the present context
 * $y_t$ is called the *state* variable --- it summarizes the "state of the world" at the start of each period.
 * $c_t$ is called the *control* variable --- a value chosen by the agent each period after observing the state.
 
-### The Policy Function Approach
+### The policy function approach
 
 ```{index} single: Optimal Growth; Policy Function Approach
 ```
@@ -258,7 +258,7 @@ The value function gives the maximal value that can be obtained from state $y$,
 
 A policy $\sigma \in \Sigma$ is called **optimal** if it attains the supremum in {eq}`vfcsdp0` for all $y \in \mathbb R_+$.
 
-### The Bellman Equation
+### The Bellman equation
 
 With our assumptions on utility and production functions, the value function as defined in {eq}`vfcsdp0` also satisfies a **Bellman equation**.
 
@@ -297,7 +297,7 @@ The Bellman equation is important because it gives us more information about the
 
 It also suggests a way of computing the value function, which we discuss below.
 
-### Greedy Policies
+### Greedy policies
 
 The primary importance of the value function is that we can use it to compute optimal policies.
 
@@ -336,7 +336,7 @@ Hence, once we have a good approximation to $v^*$, we can compute the
 The advantage is that we are now solving a much lower dimensional optimization
 problem.
 
-### The Bellman Operator
+### The Bellman operator
 
 How, then, should we compute the value function?
 
@@ -377,7 +377,7 @@ which says precisely that $v$ is a solution to the Bellman equation.
 
 It follows that $v^*$ is a fixed point of $T$.
 
-### Review of Theoretical Results
+### Review of theoretical results
 
 ```{index} single: Dynamic Programming; Theory
 ```
@@ -410,7 +410,7 @@ Hence, at least one optimal policy exists.
 
 Our problem now is how to compute it.
 
-### {index}`Unbounded Utility <single: Unbounded Utility>`
+### {index}`unbounded utility <single: unbounded utility>`
 
 ```{index} single: Dynamic Programming; Unbounded Utility
 ```
@@ -461,7 +461,7 @@ The algorithm will be
 1. Unless some stopping condition is satisfied, set
    $\{ v_1, \ldots, v_I \} = \{ T \hat v(y_1), \ldots, T \hat v(y_I) \}$ and go to step 2.
 
-### Scalar Maximization
+### Scalar maximization
 
 To maximize the right hand side of the Bellman equation {eq}`fpb30`, we are going to use
 the `minimize_scalar` routine from SciPy.
@@ -491,7 +491,7 @@ def maximize(g, a, b, args):
     return maximizer, maximum
 ```
 
-### Optimal Growth Model
+### Optimal growth model
 
 We will assume for now that $\phi$ is the distribution of $\xi := \exp(\mu + s \zeta)$ where
 
@@ -555,7 +555,7 @@ but it does have some theoretical advantages in the present setting.
 
 (For example, it preserves the contraction mapping property of the Bellman operator --- see, e.g., {cite}`pal2013`.)
 
-### The Bellman Operator
+### The Bellman operator
 
 The next function implements the Bellman operator.
 
@@ -588,7 +588,7 @@ def T(v, og):
 ```
 
 (benchmark_growth_mod)=
-### An Example
+### An example
 
 Let's suppose now that
 
@@ -695,7 +695,7 @@ The sequence of iterates converges towards $v^*$.
 
 We are clearly getting closer.
 
-### Iterating to Convergence
+### Iterating to convergence
 
 We can write a function that iterates until the difference is below a particular
 tolerance level.
@@ -728,7 +728,7 @@ plt.show()
 
 The figure shows that we are pretty much on the money.
 
-### The Policy Function
+### The policy function
 
 ```{index} single: Optimal Growth; Policy Function
 ```
diff --git a/lectures/optgrowth_fast.md b/lectures/optgrowth_fast.md
index 514fa12b6..b3545125d 100644
--- a/lectures/optgrowth_fast.md
+++ b/lectures/optgrowth_fast.md
@@ -69,7 +69,7 @@ The function `brent_max` is also designed for embedding in JIT-compiled code.
 
 These are alternatives to similar functions in SciPy (which, unfortunately, are not JIT-aware).
 
-## The Model
+## The model
 
 ```{index} single: Optimal Growth; Model
 ```
@@ -124,7 +124,7 @@ This is where we sacrifice flexibility in order to gain more speed.
 The class includes some methods such as `u_prime` that we do not need now
 but will use in later lectures.
 
-### The Bellman Operator
+### The Bellman operator
 
 We will use JIT compilation to accelerate the Bellman operator.
 
diff --git a/lectures/pandas_panel.md b/lectures/pandas_panel.md
index 664545b81..629fad0e7 100644
--- a/lectures/pandas_panel.md
+++ b/lectures/pandas_panel.md
@@ -57,7 +57,7 @@ Additional detail will be added to our `DataFrame` using pandas'
 `merge` function, and data will be summarized with the `groupby`
 function.
 
-## Slicing and Reshaping Data
+## Slicing and reshaping data
 
 We will read in a dataset from the OECD of real minimum wages in 32
 countries and assign it to `realwage`.
@@ -172,7 +172,7 @@ realwage_f = realwage.xs(('Hourly', 'In 2015 constant prices at 2015 USD exchang
 realwage_f.head()
 ```
 
-## Merging Dataframes and Filling NaNs
+## Merging dataframes and filling nans
 
 Similar to relational databases like SQL, pandas has built in methods to
 merge datasets together.
@@ -341,7 +341,7 @@ merged = merged.transpose()
 merged.head()
 ```
 
-## Grouping and Summarizing Data
+## Grouping and summarizing data
 
 Grouping and summarizing data can be particularly useful for
 understanding large panel datasets.
@@ -481,7 +481,7 @@ plt.legend()
 plt.show()
 ```
 
-## Final Remarks
+## Final remarks
 
 This lecture has provided an introduction to some of pandas' more
 advanced features, including multiindices, merging, grouping and
diff --git a/lectures/perm_income.md b/lectures/perm_income.md
index 5806cacdf..50b0791b0 100644
--- a/lectures/perm_income.md
+++ b/lectures/perm_income.md
@@ -54,7 +54,7 @@ import random
 from numba import jit
 ```
 
-## The Savings Problem
+## The savings problem
 
 ```{index} single: Permanent Income Model; Savings Problem
 ```
@@ -105,7 +105,7 @@ $$
 
 Not every martingale arises as a random walk (see, for example, [Wald's martingale](https://en.wikipedia.org/wiki/Wald%27s_martingale)).
 
-### The Decision Problem
+### The decision problem
 
 A consumer has preferences over consumption streams that are ordered by the utility functional
 
@@ -184,7 +184,7 @@ Finally, we impose the *no Ponzi scheme* condition
 
 This condition rules out an always-borrow scheme that would allow the consumer to enjoy bliss consumption forever.
 
-### First-Order Conditions
+### First-order conditions
 
 First-order conditions for maximizing {eq}`sprob1` subject to {eq}`sprob2` are
 
@@ -215,7 +215,7 @@ One way to interpret {eq}`sprob5` is that consumption will change only when
 These ideas will be clarified below.
 
 (odr_pi)=
-### The Optimal Decision Rule
+### The optimal decision rule
 
 Now let's deduce the optimal decision rule [^fod].
 
@@ -272,7 +272,7 @@ These last two equations assert that consumption equals *economic income*
     * a constant marginal propensity to consume  times the sum of non-financial wealth and financial wealth
     * the amount the consumer can consume while leaving its wealth intact
 
-#### Responding to the State
+#### Responding to the state
 
 The *state* vector confronting the consumer at $t$ is $\begin{bmatrix} b_t & z_t \end{bmatrix}$.
 
@@ -329,7 +329,7 @@ A key is to use the fact that $(1 + r) \beta = 1$ and $(I - \beta A)^{-1} = \sum
 
 We've now successfully written $c_t$ and $b_{t+1}$ as functions of $b_t$ and $z_t$.
 
-#### A State-Space Representation
+#### A state-space representation
 
 We can summarize our dynamics in the form of a linear state-space system governing consumption, debt and income:
 
@@ -419,7 +419,7 @@ We can then compute the mean and covariance of $\tilde y_t$ from
 \end{aligned}
 ```
 
-#### A Simple Example with IID Income
+#### A simple example with iid income
 
 To gain some preliminary intuition on the implications of {eq}`pi_ssr`, let's look at a highly stylized example where income is just IID.
 
@@ -523,12 +523,12 @@ ax.set(xlabel='Time', ylabel='Consumption')
 plt.show()
 ```
 
-## Alternative Representations
+## Alternative representations
 
 In this section, we shed more light on the evolution of savings, debt and
 consumption by representing their dynamics in several different ways.
 
-### Hall's Representation
+### Hall's representation
 
 ```{index} single: Permanent Income Model; Hall's Representation
 ```
@@ -633,7 +633,7 @@ Equation {eq}`pi_spr` can be rearranged to take the form
 
 Equation {eq}`sprob77`  asserts that the *cointegrating residual*  on the left side equals the conditional expectation of the geometric sum of future incomes on the right [^f8].
 
-### Cross-Sectional Implications
+### Cross-sectional implications
 
 Consider again {eq}`sprob16abcd`, this time in light of our discussion of
 distribution dynamics in the {doc}`lecture on linear systems <linear_models>`.
@@ -681,7 +681,7 @@ Equation {eq}`pi_vt` tells us that the variance of $c_t$ increases over time at
 A number of different studies have investigated this prediction and found some support for it
 (see, e.g., {cite}`DeatonPaxton1994`, {cite}`STY2004`).
 
-### Impulse Response Functions
+### Impulse response functions
 
 Impulse response functions measure responses  to various  impulses (i.e., temporary shocks).
 
@@ -689,7 +689,7 @@ The impulse response function of $\{c_t\}$ to the innovation $\{w_t\}$ is a box.
 
 In particular, the response of $c_{t+j}$ to a unit increase in the innovation $w_{t+1}$ is $(1-\beta) U (I -\beta A)^{-1} C$ for all $j \geq 1$.
 
-### Moving Average Representation
+### Moving average representation
 
 It's useful to express the innovation to the expected present value of the endowment process in terms of a moving average representation for income $y_t$.
 
@@ -731,7 +731,7 @@ c_{t+1} - c_t = (1-\beta) d(\beta) w_{t+1}
 The object $d(\beta)$ is the **present value of the moving average coefficients** in the representation for the endowment process $y_t$.
 
 (sub_classic_consumption)=
-## Two Classic Examples
+## Two classic examples
 
 We illustrate some of the preceding ideas with two examples.
 
@@ -943,7 +943,7 @@ b_{t+1} - b_t = (K-1) a_t
 
 This indicates how the fraction $K$ of the innovation to $y_t$ that is regarded as permanent influences the fraction of the innovation that is saved.
 
-## Further Reading
+## Further reading
 
 The model described above significantly changed how economists think about
 consumption.
@@ -955,7 +955,7 @@ For example, liquidity constraints and precautionary savings appear to be presen
 Further discussion can be found in, e.g., {cite}`HallMishkin1982`, {cite}`Parker1999`, {cite}`Deaton1991`, {cite}`Carroll2001`.
 
 (perm_income_appendix)=
-## Appendix: The Euler Equation
+## Appendix: the Euler equation
 
 Where does the first-order condition {eq}`sprob4` come from?
 
diff --git a/lectures/perm_income_cons.md b/lectures/perm_income_cons.md
index 1732f6ba0..45bd3ebc2 100644
--- a/lectures/perm_income_cons.md
+++ b/lectures/perm_income_cons.md
@@ -134,7 +134,7 @@ The dynamics of $\{y_t\}$ again follow the linear state space model
 
 The restrictions on the shock process and parameters are the same as in our {doc}`previous lecture <perm_income>`.
 
-### Digression on a Useful Isomorphism
+### Digression on a useful isomorphism
 
 The LQ permanent income model of consumption is mathematically isomorphic with a version of
 Barro's {cite}`Barro1979` model of tax smoothing.
@@ -162,7 +162,7 @@ All characterizations of a $\{c_t, y_t, b_t\}$ in the LQ permanent income model
 
 See [consumption and tax smoothing models](https://python-advanced.quantecon.org/smoothing.html) for further exploitation of an isomorphism between consumption and tax smoothing models.
 
-### A Specification of the Nonfinancial Income Process
+### A specification of the nonfinancial income process
 
 For the purposes of this lecture, let's assume $\{y_t\}$ is a second-order univariate autoregressive process:
 
@@ -198,7 +198,7 @@ C= \begin{bmatrix}
 U = \begin{bmatrix} 0 & 1 & 0 \end{bmatrix}
 $$
 
-## The LQ Approach
+## The LQ approach
 
 {ref}`Previously <odr_pi>` we solved the permanent income model  by solving a system of linear expectational difference equations subject to two boundary conditions.
 
@@ -218,7 +218,7 @@ On the other hand, formulating the model in terms of an LQ dynamic programming p
 - finding the state (of a dynamic programming problem) is an art, and
 - iterations on a Bellman equation  implicitly jointly solve both a  forecasting problem and a control problem
 
-### The LQ Problem
+### The LQ problem
 
 Recall from our {doc}`lecture on LQ theory <lqcontrol>` that the optimal linear regulator problem is to choose
 a decision rule for $u_t$ to minimize
@@ -250,7 +250,7 @@ The optimal policy is $u_t = -Fx_t$, where $F := \beta (Q+\beta \tilde B'P \tild
 
 Under an optimal decision rule $F$, the state vector $x_t$ evolves according to $x_{t+1} = (\tilde A-\tilde BF) x_t + \tilde C w_{t+1}$.
 
-### Mapping into the LQ Framework
+### Mapping into the LQ framework
 
 To map into the LQ framework, we'll use
 
@@ -325,7 +325,7 @@ The reason is that it drops out of the Euler equation for consumption.
 
 In what follows we set it equal to unity.
 
-### The Exogenous Nonfinancial Income Process
+### The exogenous nonfinancial income process
 
 First, we create the objects for the optimal linear regulator
 
@@ -404,7 +404,7 @@ P, F, d = lqpi.stationary_values()  # Compute value function and decision rule
 ABF = ALQ - BLQ @ F  #  Form closed loop system
 ```
 
-### Comparison with the Difference Equation Approach
+### Comparison with the difference equation approach
 
 In our {doc}`first lecture <perm_income>` on the infinite horizon permanent
 income problem we used a different solution method.
@@ -469,7 +469,7 @@ Now let's create instances of the [LinearStateSpace](https://github.com/QuantEco
 
 To do this, we'll use the outcomes from our second method.
 
-## Two Example Economies
+## Two example economies
 
 In the spirit of Bewley models {cite}`Bewley86`, we'll generate panels of consumers.
 
@@ -491,7 +491,7 @@ Those transient effects will not be present in the second example.
 
 We use methods affiliated with the [LinearStateSpace](https://github.com/QuantEcon/QuantEcon.py/blob/master/quantecon/lss.py) class to simulate the model.
 
-### First Set of Initial Conditions
+### First set of initial conditions
 
 We generate  25 paths of the exogenous non-financial income process and the associated optimal consumption and debt paths.
 
@@ -505,7 +505,7 @@ Comparing sample paths with population distributions at each date $t$ is a usefu
 lss = qe.LinearStateSpace(A_LSS, C_LSS, G_LSS, mu_0=μ_0, Sigma_0=Σ_0)
 ```
 
-### Population and Sample Panels
+### Population and sample panels
 
 In the code below, we use the [LinearStateSpace](https://github.com/QuantEcon/QuantEcon.py/blob/master/quantecon/lss.py) class to
 
@@ -673,7 +673,7 @@ All of them accumulate debt in anticipation of rising nonfinancial income.
 
 They expect their nonfinancial income to rise toward the invariant distribution of income, a consequence of our having started them at $y_{-1} = y_{-2} = 0$.
 
-#### Cointegration Residual
+#### Cointegration residual
 
 The following figure plots realizations of the left side of {eq}`old12`, which,
 {ref}`as discussed in our last lecture <coint_pi>`, is called the **cointegrating residual**.
@@ -718,7 +718,7 @@ cointegration_figure(bsim0, csim0)
 plt.show()
 ```
 
-### A "Borrowers and Lenders" Closed Economy
+### A "borrowers and lenders" closed economy
 
 When we set $y_{-1} = y_{-2} = 0$ and $b_0 =0$ in the
 preceding exercise, we make debt "head north" early in the sample.
diff --git a/lectures/prob_matrix.md b/lectures/prob_matrix.md
index b142b9e39..a23c38753 100644
--- a/lectures/prob_matrix.md
+++ b/lectures/prob_matrix.md
@@ -59,7 +59,7 @@ set_matplotlib_formats('retina')
 ```
 
 
-## Sketch of Basic Concepts
+## Sketch of basic concepts
 
 We'll briefly define what we mean by a **probability space**, a **probability measure**, and a **random variable**.
 
@@ -104,7 +104,7 @@ applied statisticians often proceed simply by specifying a form for an induced d
 That is how we'll proceed in this lecture and in many subsequent lectures. 
 
 
-## What Does Probability Mean?
+## What does probability mean?
 
 Before diving in, we'll say a few words about what probability theory means and how it connects to statistics.
 
@@ -194,7 +194,7 @@ Key concepts that connect probability theory with statistics are laws of large n
      * we say "partly" because a Bayesian also pays attention to relative frequencies 
 
 
-## Representing  Probability Distributions
+## Representing probability distributions
 
 A  probability distribution $\textrm{Prob} (X \in A)$ can  be described by its **cumulative distribution function (CDF)**
 
@@ -231,7 +231,7 @@ Doing this enables us to confine our tool set basically to linear algebra.
 Later we'll briefly discuss how to approximate a continuous random variable with a discrete random variable.
 
 
-## Univariate Probability Distributions
+## Univariate probability distributions
 
 We'll devote most of this lecture to discrete-valued random variables, but we'll say a few things
 about continuous-valued random variables.
@@ -323,7 +323,7 @@ $$
 \textrm{Prob}\{X\in \tilde{X}\} =1
 $$
 
-## Bivariate Probability Distributions
+## Bivariate probability distributions
 
 We'll now discuss a bivariate **joint distribution**.
 
@@ -357,7 +357,7 @@ $$
 \sum_{i}\sum_{j}f_{ij}=1
 $$
 
-## Marginal Probability Distributions
+## Marginal probability distributions
 
 The joint distribution induce marginal distributions
 
@@ -400,7 +400,7 @@ f(y)& = \int_{\mathbb{R}} f(x,y) dx
 \end{aligned}
 $$
 
-## Conditional Probability  Distributions
+## Conditional probability distributions
 
 Conditional probabilities are defined according to
 
@@ -446,7 +446,7 @@ $$
 $$
 
 
-## Transition Probability Matrix
+## Transition probability matrix
 
 Consider the following joint probability distribution of  two random variables.
 
@@ -495,7 +495,7 @@ Note that
 
 
-## Application: Forecasting a Time Series
+## Application: forecasting a time series
 
 Suppose that there are two time periods.
 
@@ -523,7 +523,7 @@ $$\text{Prob} \{X(1)=j|X(0)=i\}= \frac{f_{ij}}{ \sum_{j}f_{ij}}$$
 - This formula is a workhorse for applied economic forecasters.
 
 
-## Statistical Independence
+## Statistical independence
 
 Random variables X and Y are statistically **independent** if
 
@@ -550,7 +550,7 @@ $$
 $$
 
 
-## Means and Variances
+## Means and variances
 
 The  mean and variance of a discrete random variable $X$  are
 
@@ -571,7 +571,7 @@ $$
 \end{aligned}
 $$
 
-## Matrix Representations of Some Bivariate Distributions
+## Matrix representations of some bivariate distributions
 
 Let's use matrices to represent a joint distribution, conditional distribution, marginal distribution, and the mean and variance of a  bivariate random variable.
 
@@ -882,7 +882,7 @@ d_new.marg_dist()
 d_new.cond_dist()
 ```
 
-## A Continuous Bivariate Random Vector
+## A continuous bivariate random vector
 
 
 A two-dimensional Gaussian distribution has  joint density
@@ -1079,7 +1079,7 @@ print(μy, σy)
 print(μ2 + ρ * σ2 * (1 - μ1) / σ1, np.sqrt(σ2**2 * (1 - ρ**2)))
 ```
 
-## Sum of Two Independently Distributed Random Variables
+## Sum of two independently distributed random variables
 
 Let $X, Y$ be two independent discrete random variables that take values in $\bar{X}, \bar{Y}$, respectively.
 
@@ -1237,7 +1237,7 @@ Thus, multiple  joint distributions $[f_{ij}]$ can have  the same marginals.
 **Remark:**
 - Couplings  are important in optimal transport problems and in Markov processes. Please see this {doc}`lecture about optimal transport <opt_transport>`
 
-## Copula Functions
+## Copula functions
 
 Suppose that $X_1, X_2, \dots, X_n$ are $N$ random variables  and that
 
diff --git a/lectures/prob_meaning.md b/lectures/prob_meaning.md
index dfde21873..abe8a57b2 100644
--- a/lectures/prob_meaning.md
+++ b/lectures/prob_meaning.md
@@ -69,7 +69,7 @@ import scipy.stats as st
 
 Empowered with these Python tools, we'll now  explore the two meanings described above.
 
-## Frequentist Interpretation
+## Frequentist interpretation
 
 Consider the following classic example.
 
@@ -337,7 +337,7 @@ $$
 as $I$ goes to infinity.
 
 
-## Bayesian Interpretation
+## Bayesian interpretation
 
 We again use a binomial distribution.
 
@@ -694,7 +694,7 @@ As shown in the figure above, as the number of observations grows, the Bayesian
 However, if you take a closer look, you will find that the centers of  the BCIs are not exactly $0.4$, due to the persistent influence of the prior distribution and the randomness of the simulation path.
 
 
-## Role of a Conjugate Prior
+## Role of a conjugate prior
 
 We have made  assumptions that link functional forms of  our likelihood function and our prior in a way that has eased our calculations considerably.
 
diff --git a/lectures/qr_decomp.md b/lectures/qr_decomp.md
index 09c57f302..e4f2e85a8 100644
--- a/lectures/qr_decomp.md
+++ b/lectures/qr_decomp.md
@@ -26,7 +26,7 @@ This lecture describes the QR decomposition and how it relates to
 
 We'll write some Python code to help consolidate our understandings.
 
-## Matrix Factorization
+## Matrix factorization
 
 The QR decomposition (also called the QR factorization) of a matrix is a decomposition of a matrix into the product of  an orthogonal matrix and a triangular matrix.
 
@@ -48,7 +48,7 @@ We'll use a **Gram-Schmidt process** to compute a  QR decomposition
 
 Because doing so is so educational, we'll  write our own Python code to do the job
 
-## Gram-Schmidt process
+## Gram-schmidt process
 
 We'll start with a **square** matrix $A$.
 
@@ -58,7 +58,7 @@ We'll deal with a rectangular matrix $A$ later.
 
 Actually, our algorithm will work with a rectangular $A$ that is not square.
 
-### Gram-Schmidt process for square $A$
+### Gram-schmidt process for square $a$
 
 Here we apply a Gram-Schmidt  process to the  **columns**  of matrix $A$.
 
@@ -137,7 +137,7 @@ R = \left[ \begin{matrix} a_1·e_1 & a_2·e_1 & \cdots & a_n·e_1\\ 0 & a_2·e_2
 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_n·e_n \end{matrix} \right]
 $$
 
-### $A$ not square 
+### $a$ not square
 
 Now suppose that $A$ is an $n \times m$ matrix where $m > n$.  
 
@@ -162,7 +162,7 @@ a_{n+1} & = (a_{n+1}\cdot e_1) e_1 + (a_{n+1}\cdot e_2) e_2 + \cdots + (a_{n+1}\
 a_m & = (a_m\cdot e_1) e_1 + (a_m\cdot e_2) e_2 + \cdots + (a_m \cdot e_n) e_n  \cr
 \end{align*}
 
-## Some Code
+## Some code
 
 Now let's write some homemade Python code to implement a QR decomposition by deploying the  Gram-Schmidt process described above.
 
@@ -296,7 +296,7 @@ Q_scipy, R_scipy = adjust_sign(*qr(A))
 Q_scipy, R_scipy
 ```
 
-## Using QR Decomposition to Compute Eigenvalues
+## Using QR decomposition to compute eigenvalues
 
 Now for a useful  fact about the QR algorithm.  
 
@@ -367,7 +367,7 @@ Compare with the `scipy` package.
 sorted(np.linalg.eigvals(A))
 ```
 
-## $QR$ and PCA
+## $QR$ and pca
 
 There are interesting connections between the $QR$ decomposition and principal components analysis (PCA).
 
diff --git a/lectures/rand_resp.md b/lectures/rand_resp.md
index 1986812f2..c9bddf5a6 100644
--- a/lectures/rand_resp.md
+++ b/lectures/rand_resp.md
@@ -35,7 +35,7 @@ Related ideas underlie  modern **differential privacy** systems.
 (See https://en.wikipedia.org/wiki/Differential_privacy)
 
 
-## Warner's Strategy
+## Warner's strategy
 
 As usual, let's bring in the Python modules we'll be using.
 
@@ -148,7 +148,7 @@ From expressions {eq}`eq:five` and {eq}`eq:seven` we can deduce that:
 - The MSE of $\hat{\pi}$  decreases as $p$ increases.
 
 
-## Comparing Two Survey Designs 
+## Comparing two survey designs
 
 Let's compare the preceding randomized-response method with a stylized non-randomized response method.
 
@@ -315,7 +315,7 @@ df3_mc
 
 Evidently, as $n$ increases, the randomized response method does  better performance in more situations.
 
-## Concluding Remarks
+## Concluding remarks
 
 {doc}`This QuantEcon lecture <util_rand_resp>`  describes some alternative randomized response surveys.
 
diff --git a/lectures/rational_expectations.md b/lectures/rational_expectations.md
index 5c57bb6e2..d9fb1b9f9 100644
--- a/lectures/rational_expectations.md
+++ b/lectures/rational_expectations.md
@@ -78,7 +78,7 @@ We'll also use the LQ class from `QuantEcon.py`.
 from quantecon import LQ
 ```
 
-### The Big Y, little y Trick
+### The big y, little y trick
 
 This widely used method applies in contexts in which a **representative firm** or agent is a "price taker" operating within a competitive equilibrium.
 
@@ -108,7 +108,7 @@ Please watch for how this strategy is applied as the lecture unfolds.
 
 We begin by applying the  Big $Y$, little $y$ trick in a very simple static context.
 
-#### A Simple Static Example of the Big Y, little y Trick
+#### A simple static example of the big y, little y trick
 
 Consider a static model in which a unit measure of  firms produce a homogeneous good that is sold in a competitive market.
 
@@ -176,7 +176,7 @@ to be solved for the competitive equilibrium market-wide output $Y$.
 
 After solving for $Y$, we can compute the competitive equilibrium price $p$ from the inverse demand curve {eq}`ree_comp3d_static`.
 
-### Related Planning Problem
+### Related planning problem
 
 Define **consumer surplus** as the  area under the inverse demand curve:
 
@@ -200,7 +200,7 @@ Thus, a $Y$ that solves {eq}`staticY` is a competitive equilibrium output as wel
 
 This type of outcome provides an intellectual justification for liking a competitive equilibrium.
 
-### Further Reading
+### Further reading
 
 References for this lecture include
 
@@ -208,7 +208,7 @@ References for this lecture include
 * {cite}`Sargent1987`, chapter XIV
 * {cite}`Ljungqvist2012`, chapter 7
 
-## Rational Expectations Equilibrium
+## Rational expectations equilibrium
 
 ```{index} single: Rational Expectations Equilibrium; Definition
 ```
@@ -229,7 +229,7 @@ law of motion generated by production choices induced by this belief.
 We formulate a rational expectations equilibrium in terms of a fixed point of an operator that maps beliefs into optimal beliefs.
 
 (ree_ce)=
-### Competitive Equilibrium with Adjustment Costs
+### Competitive equilibrium with adjustment costs
 
 ```{index} single: Rational Expectations Equilibrium; Competitive Equilbrium (w. Adjustment Costs)
 ```
@@ -252,7 +252,7 @@ where
 * $Y_t = \int_0^1 y_t(\omega) d \omega = y_t$ is the market-wide level of output
 
 (ree_fp)=
-#### The Firm's Problem
+#### The firm's problem
 
 Each firm is a price taker.
 
@@ -288,7 +288,7 @@ This includes ones that the firm cares about but does not control like $p_t$.
 
 We turn to this problem now.
 
-#### Prices and Aggregate Output
+#### Prices and aggregate output
 
 In view of {eq}`ree_comp3d`, the firm's incentive to forecast the market price translates into an incentive to forecast aggregate output $Y_t$.
 
@@ -298,7 +298,7 @@ The output $y_t(\omega)$ of a single firm $\omega$ has a negligible effect on ag
 
 That justifies firms in regarding their forecasts of aggregate output as being unaffected by their own output decisions.
 
-#### Representative Firm's Beliefs
+#### Representative firm's beliefs
 
 We suppose the firm believes that market-wide output $Y_t$ follows the law of motion
 
@@ -312,7 +312,7 @@ where $Y_0$ is a known initial condition.
 
 The *belief function* $H$ is an equilibrium object, and hence remains to be determined.
 
-#### Optimal Behavior Given Beliefs
+#### Optimal behavior given beliefs
 
 For now, let's fix a particular belief $H$ in {eq}`ree_hlom` and investigate the firm's response to it.
 
@@ -345,7 +345,7 @@ h(y, Y) := \textrm{argmax}_{y'}
 
 Evidently $v$ and $h$ both depend on $H$.
 
-#### Characterization with First-Order Necessary Conditions
+#### Characterization with first-order necessary conditions
 
 In what follows it will be helpful to have a second characterization of $h$, based on first-order conditions.
 
@@ -385,7 +385,7 @@ A representative  firm's decision rule solves the difference equation {eq}`ree_c
 Note that solving the Bellman equation {eq}`comp4` for $v$ and then $h$ in {eq}`ree_opbe` yields
 a decision rule that automatically imposes both the Euler equation {eq}`ree_comp7` and the transversality condition.
 
-#### The Actual Law of Motion for Output
+#### The actual law of motion for output
 
 As we've seen, a given belief translates into a particular decision rule $h$.
 
@@ -400,7 +400,7 @@ Y_{t+1} =  h(Y_t, Y_t)
 Thus, when firms believe that the law of motion for market-wide output is {eq}`ree_hlom`, their optimizing behavior makes the actual law of motion be {eq}`ree_comp9a`.
 
 (ree_def)=
-### Definition of Rational Expectations Equilibrium
+### Definition of rational expectations equilibrium
 
 A *rational expectations equilibrium* or *recursive competitive equilibrium*  of the model with adjustment costs is a decision rule $h$ and an aggregate law of motion $H$ such that
 
@@ -410,7 +410,7 @@ A *rational expectations equilibrium* or *recursive competitive equilibrium*  of
 
 Thus, a rational expectations equilibrium equates the perceived and actual laws of motion {eq}`ree_hlom` and {eq}`ree_comp9a`.
 
-#### Fixed Point Characterization
+#### Fixed point characterization
 
 As we've seen, the firm's optimum problem induces a mapping $\Phi$ from a perceived law of motion $H$ for market-wide output to an actual law of motion $\Phi(H)$.
 
@@ -418,14 +418,14 @@ The mapping $\Phi$ is the composition of two mappings, the first of which maps a
 
 The $H$ component of a rational expectations equilibrium is a fixed point of $\Phi$.
 
-## Computing  an Equilibrium
+## Computing an equilibrium
 
 ```{index} single: Rational Expectations Equilibrium; Computation
 ```
 
 Now let's compute a  rational expectations equilibrium.
 
-### Failure of Contractivity
+### Failure of contractivity
 
 Readers accustomed to dynamic programming arguments might try to address this problem by choosing some guess $H_0$ for the aggregate law of motion and then iterating with $\Phi$.
 
@@ -445,7 +445,7 @@ Lucas and Prescott {cite}`LucasPrescott1971` used this method to construct a rat
 Some details follow.
 
 (ree_pp)=
-### A Planning Problem Approach
+### A planning problem approach
 
 ```{index} single: Rational Expectations Equilibrium; Planning Problem Approach
 ```
@@ -478,7 +478,7 @@ $$
 
 subject to an initial condition for $Y_0$.
 
-### Solution of  Planning Problem
+### Solution of planning problem
 
 Evaluating the integral in {eq}`comp10` yields the quadratic form $a_0
 Y_t - a_1 Y_t^2 / 2$.
@@ -515,7 +515,7 @@ equation
 \beta a_0 + \gamma Y_t - [\beta a_1 + \gamma (1+ \beta)]Y_{t+1} + \gamma \beta Y_{t+2} =0
 ```
 
-### Key Insight
+### Key insight
 
 Return to equation {eq}`ree_comp7` and set $y_t = Y_t$ for all $t$.
 
@@ -534,7 +534,7 @@ It follows that for this example we can compute equilibrium quantities by formin
 The optimal policy function for the planning problem is the aggregate law of motion
 $H$ that the representative firm faces within a rational expectations equilibrium.
 
-#### Structure of the Law of Motion
+#### Structure of the law of motion
 
 As you are asked to show in the exercises, the fact that the planner's
 problem is an LQ control problem implies an optimal policy --- and hence aggregate law
diff --git a/lectures/re_with_feedback.md b/lectures/re_with_feedback.md
index 48a0aae94..e8fb86bbc 100644
--- a/lectures/re_with_feedback.md
+++ b/lectures/re_with_feedback.md
@@ -76,7 +76,7 @@ as an **expectational difference equation** whose solution is a rational expecta
 We'll start this lecture with a quick review of deterministic (i.e., non-random)
 first-order and second-order linear difference equations.
 
-## Linear Difference Equations
+## Linear difference equations
 
 We'll use the *backward shift* or *lag* operator $L$.
 
@@ -92,7 +92,7 @@ We'll often use the equality  $L^{-1} x_t \equiv x_{t+1}$ below.
 
 The algebra of lag and forward shift operators can simplify representing and solving linear difference equations.
 
-### First Order
+### First order
 
 We want to solve a linear first-order scalar difference equation.
 
@@ -179,7 +179,7 @@ diverge, in which case a solution of this form does not exist.
 The distributed lead in $u$ in {eq}`equn_5` need not
 converge when $|\lambda| < 1$.
 
-### Second Order
+### Second order
 
 Now consider the second order difference equation
 
@@ -218,7 +218,7 @@ Equation {eq}`equn_7` has a form that we shall encounter often.
 * $\lambda_1 y_t$ is called the **feedback part**
 * $-{\frac{\lambda_2^{-1}}{1 - \lambda_2^{-1}L^{-1}}} u_{t+1}$ is called the **feedforward part**
 
-## Illustration: Cagan's Model
+## Illustration: cagan's model
 
 Now let's use linear difference equations to represent and solve Sargent's  {cite}`Sargent77hyper` rational expectations version of
 Cagan’s model {cite}`Cagan` that connects the price level to the public's anticipations of future money supplies.
@@ -351,7 +351,7 @@ sequence $c \lambda^{-t}$ where $c$ is an arbitrary positive
 constant.
 ```
 
-## Some Python Code
+## Some Python code
 
 We’ll construct examples that illustrate {eq}`equation_3`.
 
@@ -464,7 +464,7 @@ Because
 - it happens that in this example future $m$’s are always less
   than the current $m$
 
-## Alternative Code
+## Alternative code
 
 We could also have run the simulation using the quantecon
 **LinearStateSpace** code.
@@ -498,7 +498,7 @@ plt.legend()
 plt.show()
 ```
 
-### Special Case
+### Special case
 
 To simplify our presentation in ways that will let focus on an important
 idea, in the above second-order difference equation {eq}`equation_6` that governs
@@ -534,7 +534,7 @@ $$
 Please keep these formulas in mind as we investigate an alternative
 route to and interpretation of our formula for $F$.
 
-## Another Perspective
+## Another perspective
 
 Above, we imposed stability or non-explosiveness on the solution of the key difference equation {eq}`equation_1`
 in Cagan's model by solving the  unstable root of the characteristic polynomial forward.
@@ -685,7 +685,7 @@ p_0 = - (Q^{22})^{-1} Q^{21} m_0.
 This is the unique **stabilizing value** of $p_0$ expressed as a function of
 $m_0$.
 
-### Refining the Formula
+### Refining the formula
 
 We can get an even more convenient formula for $p_0$ that is cast
 in terms of components of $Q$ instead of components of
@@ -757,7 +757,7 @@ $$
 Q_1 = \begin{bmatrix} Q_{11} \\ Q_{21}  \end{bmatrix}.
 $$
 
-### Remarks about Feedback
+### Remarks about feedback
 
 We have expressed {eq}`equation_8` in what superficially appears to be a form in
 which $y_{t+1}$ feeds back on $y_t$, even though what we
@@ -778,7 +778,7 @@ We’ll keep these observations in mind as we turn now to a case in which
 the log money supply actually does feed back on the log of the price
 level.
 
-## Log money Supply Feeds Back on Log Price Level
+## Log money supply feeds back on log price level
 
 An arrangement of eigenvalues that split around unity, with one being
 below unity and another being greater than unity, sometimes prevails when there is  *feedback* from the log price level to the log
@@ -964,7 +964,7 @@ exist.
 magic_p0(1, δ=0.2)
 ```
 
-## Big $P$, Little $p$ Interpretation
+## Big $p$, little $p$ interpretation
 
 It is helpful to view our solutions of difference equations having  feedback from the price level or inflation to money or the rate of money
 creation in terms of the Big $K$, little $k$ idea discussed in {doc}`Rational Expectations Models <rational_expectations>`.
@@ -1064,7 +1064,7 @@ Compare $F^*$ with $F_1 + F_2 F^*$
 F_check[0] + F_check[1] * F_star, F_star
 ```
 
-## Fun with SymPy
+## Fun with sympy
 
 This section is a  gift for readers who have made it this far.
 
diff --git a/lectures/samuelson.md b/lectures/samuelson.md
index a8b056724..edcf728e5 100644
--- a/lectures/samuelson.md
+++ b/lectures/samuelson.md
@@ -63,7 +63,7 @@ from sympy import Symbol, init_printing
 from cmath import sqrt
 ```
 
-### Samuelson's Model
+### Samuelson's model
 
 Samuelson used a *second-order linear difference equation* to
 represent a model of national output based on three components:
@@ -201,7 +201,7 @@ no random shocks hit aggregate demand --- has only transient fluctuations.
 We can convert the model to one that has persistent irregular
 fluctuations by adding a random shock to aggregate demand.
 
-### Stochastic Version of the Model
+### Stochastic version of the model
 
 We create a **random** or **stochastic** version of the model by adding
 a random process of **shocks** or **disturbances**
@@ -215,7 +215,7 @@ equation**:
 Y_t = G_t + a (1-b) Y_{t-1} - a b Y_{t-2} + \sigma \epsilon_{t}
 ```
 
-### Mathematical Analysis of the Model
+### Mathematical analysis of the model
 
 To get started, let's set $G_t \equiv 0$, $\sigma = 0$, and
 $\gamma = 0$.
@@ -354,7 +354,7 @@ absolute values strictly less than one, the absolute value of the larger
 one governs the rate of convergence to the steady state of the non
 stochastic version of the model.
 
-### Things This Lecture Does
+### Things this lecture does
 
 We write a function to generate simulations of a $\{Y_t\}$ sequence as a function of time.
 
@@ -495,7 +495,7 @@ difference equation parameter pairs in the Samuelson model are such that:
 Later we'll present the graph with a red mark showing the particular
 point implied by the setting of $(a,b)$.
 
-### Function to Describe Implications of Characteristic Polynomial
+### Function to describe implications of characteristic polynomial
 
 ```{code-cell} python3
 def categorize_solution(ρ1, ρ2):
@@ -523,7 +523,7 @@ therefore get smooth convergence to a steady state')
 categorize_solution(1.3, -.4)
 ```
 
-### Function for Plotting Paths
+### Function for plotting paths
 
 A useful function for our work below is
 
@@ -540,7 +540,7 @@ def plot_y(function=None):
     plt.show()
 ```
 
-### Manual or "by hand" Root Calculations
+### Manual or "by hand" root calculations
 
 The following function calculates roots of the characteristic polynomial
 using high school algebra.
@@ -604,7 +604,7 @@ def y_nonstochastic(y_0=100, y_1=80, α=.92, β=.5, γ=10, n=80):
 plot_y(y_nonstochastic())
 ```
 
-### Reverse-Engineering Parameters to Generate Damped Cycles
+### Reverse-engineering parameters to generate damped cycles
 
 The next cell writes code that takes as inputs the modulus $r$ and
 phase $\phi$ of a conjugate pair of complex numbers in polar form
@@ -619,8 +619,8 @@ $$
   pairs that would generate those roots
 
 ```{code-cell} python3
-### code to reverse-engineer a cycle
-### y_t = r^t (c_1 cos(ϕ t) + c2 sin(ϕ t))
+### Code to reverse-engineer a cycle
+### Y_t = r^t (c_1 cos(ϕ t) + c2 sin(ϕ t))
 ###
 
 def f(r, ϕ):
@@ -664,7 +664,7 @@ print(f"ρ1, ρ2 = {ρ1}, {ρ2}")
 ρ1, ρ2
 ```
 
-### Root Finding Using Numpy
+### Root finding using numpy
 
 Here we'll use numpy to compute the roots of the characteristic
 polynomial
@@ -731,7 +731,7 @@ def y_nonstochastic(y_0=100, y_1=80, α=.9, β=.8, γ=10, n=80):
 plot_y(y_nonstochastic())
 ```
 
-### Reverse-Engineered Complex Roots: Example
+### Reverse-engineered complex roots: example
 
 The next cell studies the implications of reverse-engineered complex
 roots.
@@ -758,7 +758,7 @@ ytemp = y_nonstochastic(α=a, β=b, y_0=20, y_1=30)
 plot_y(ytemp)
 ```
 
-### Digression: Using Sympy to Find Roots
+### Digression: using sympy to find roots
 
 We can also use sympy to compute analytic formulas for the roots
 
@@ -781,7 +781,7 @@ r2 = -b
 sympy.solve(z**2 - r1*z - r2, z)
 ```
 
-## Stochastic Shocks
+## Stochastic shocks
 
 Now we'll construct some code to simulate the stochastic version of the
 model that emerges when we add a random shock process to aggregate
@@ -845,7 +845,7 @@ r = .97
 period = 10   #  Length of cycle in units of time
 ϕ = 2 * math.pi/period
 
-### Apply the  reverse-engineering function f
+### Apply the reverse-engineering function f
 
 ρ1, ρ2, a, b = f(r, ϕ)
 
@@ -857,7 +857,7 @@ print(f"a, b = {a}, {b}")
 plot_y(y_stochastic(y_0=40, y_1 = 42, α=a, β=b, σ=2, n=100))
 ```
 
-## Government Spending
+## Government spending
 
 This function computes a response to either a permanent or one-off increase
 in government expenditures
@@ -958,7 +958,7 @@ We can also see the response to a one time jump in government expenditures
 plot_y(y_stochastic_g(g=500, g_t=50, duration='one-off'))
 ```
 
-## Wrapping Everything Into a Class
+## Wrapping everything into a class
 
 Up to now, we have written functions to do the work.
 
@@ -1158,7 +1158,7 @@ class Samuelson():
         return fig
 ```
 
-### Illustration of Samuelson Class
+### Illustration of Samuelson class
 
 Now we'll put our Samuelson class to work on an example
 
@@ -1172,7 +1172,7 @@ sam.plot()
 plt.show()
 ```
 
-### Using the Graph
+### Using the graph
 
 We'll use our graph to show where the roots lie and how their location
 is consistent with the behavior of the path just graphed.
@@ -1184,7 +1184,7 @@ sam.param_plot()
 plt.show()
 ```
 
-## Using the LinearStateSpace Class
+## Using the linearstatespace class
 
 It turns out that we can use the [QuantEcon.py](http://quantecon.org/quantecon-py)
 [LinearStateSpace](https://github.com/QuantEcon/QuantEcon.py/blob/master/quantecon/lss.py) class to do
@@ -1235,7 +1235,7 @@ axes[-1].set_xlabel('Iteration')
 plt.show()
 ```
 
-### Other Methods in the `LinearStateSpace` Class
+### Other methods in the `linearstatespace` class
 
 Let's plot **impulse response functions** for the instance of the
 Samuelson model using a method in the `LinearStateSpace` class
@@ -1257,7 +1257,7 @@ w, v = np.linalg.eig(A)
 print(w)
 ```
 
-### Inheriting Methods from `LinearStateSpace`
+### Inheriting methods from `linearstatespace`
 
 We could also create a subclass of `LinearStateSpace` (inheriting all its
 methods and attributes) to add more functions to use
@@ -1394,7 +1394,7 @@ plt.show()
 samlss.multipliers()
 ```
 
-## Pure Multiplier Model
+## Pure multiplier model
 
 Let's shut down the accelerator by setting $b=0$ to get a pure
 multiplier model
diff --git a/lectures/sir_model.md b/lectures/sir_model.md
index 537352e35..86238b51a 100644
--- a/lectures/sir_model.md
+++ b/lectures/sir_model.md
@@ -66,7 +66,7 @@ from scipy.integrate import odeint
 
 This routine calls into compiled code from the FORTRAN library odepack.
 
-## The SIR Model
+## The SIR model
 
 In the version of the SIR model we will analyze there are four states.
 
@@ -80,7 +80,7 @@ Comments:
 * Those who have recovered are assumed to have acquired immunity.
 * Those in the exposed group are not yet infectious.
 
-### Time Path
+### Time path
 
 The flow across states follows the path $S \to E \to I \to R$.
 
@@ -234,7 +234,7 @@ grid_size = 1000
 t_vec = np.linspace(0, t_length, grid_size)
 ```
 
-### Experiment 1: Constant R0 Case
+### Experiment 1: constant r0 case
 
 Let's start with the case where `R0` is constant.
 
@@ -282,7 +282,7 @@ Here are cumulative cases, as a fraction of population:
 plot_paths(c_paths, labels)
 ```
 
-### Experiment 2: Changing Mitigation
+### Experiment 2: changing mitigation
 
 Let's look at a scenario where mitigation (e.g., social distancing) is
 successively imposed.
@@ -345,7 +345,7 @@ Here are cumulative cases, as a fraction of population:
 plot_paths(c_paths, labels)
 ```
 
-## Ending Lockdown
+## Ending lockdown
 
 The following replicates [additional results](https://drive.google.com/file/d/1uS7n-7zq5gfSgrL3S0HByExmpq4Bn3oh/view) by Andrew Atkeson on the timing of lifting lockdown.
 
diff --git a/lectures/stats_examples.md b/lectures/stats_examples.md
index 1d3a397b0..7399f2a63 100644
--- a/lectures/stats_examples.md
+++ b/lectures/stats_examples.md
@@ -42,7 +42,7 @@ set_matplotlib_formats('retina')
 ```
 
 
-## Some Discrete Probability Distributions
+## Some discrete probability distributions
 
 
 Let's write some Python code to compute   means and variances of some  univariate random variables.
@@ -138,7 +138,7 @@ print("The population variance is: ", r*(1-p)/p**2)
 ```
 
 
-## Newcomb–Benford distribution
+## Newcomb–benford distribution
 
 The **Newcomb–Benford law** fits  many data sets, e.g., reports of incomes to tax authorities, in which
 the leading digit is more likely to be small than large.
@@ -233,7 +233,7 @@ print(μ-μ_hat < 1e-3)
 print(σ-σ_hat < 1e-3)
 ```
 
-## Uniform Distribution
+## Uniform distribution
 
 $$
 \begin{aligned}
@@ -270,7 +270,7 @@ print("\nThe population mean is: ", (a+b)/2)
 print("The population variance is: ", (b-a)**2/12)
 ```
 
-##  A Mixed Discrete-Continuous Distribution
+## A mixed discrete-continuous distribution
 
 We'll motivate this example with  a little story.
 
@@ -333,7 +333,7 @@ print("variance: ", var)
 ```
 
 
-## Drawing a  Random Number from a Particular Distribution
+## Drawing a random number from a particular distribution
 
 Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability distribution
 
diff --git a/lectures/svd_intro.md b/lectures/svd_intro.md
index a24e43c7a..42c08f507 100644
--- a/lectures/svd_intro.md
+++ b/lectures/svd_intro.md
@@ -28,7 +28,7 @@ Like principal components analysis (PCA), DMD can be thought of as a data-reduct
 
 In a sequel to this lecture about  {doc}`Dynamic Mode Decompositions <var_dmd>`, we'll describe how SVD's provide ways rapidly to compute reduced-order approximations to first-order Vector Autoregressions (VARs).
 
-##  The Setting
+## The setting
 
 Let $X$ be an $m \times n$ matrix of rank $p$.
 
@@ -58,7 +58,7 @@ In the $m > > n$  case in which there are many more attributes $m$ than individu
 
 We'll again use a **singular value decomposition**,  but now to construct a **dynamic mode decomposition** (DMD)
 
-## Singular Value Decomposition
+## Singular value decomposition
 
 A **singular value decomposition** of an $m \times n$ matrix $X$ of rank $p \leq \min(m,n)$ is
 
@@ -124,7 +124,7 @@ Later we'll also describe an **economy** or **reduced** SVD.
 
 Before we study a **reduced** SVD we'll say a little more about properties of a **full** SVD.
 
-## Four Fundamental Subspaces
+## Four fundamental subspaces
 
 Let  ${\mathcal C}$ denote a column space, ${\mathcal N}$ denote a null space, and ${\mathcal R}$ denote a row space.
 
@@ -319,7 +319,7 @@ print("Row space:\n", row_space.T)
 print("Right null space:\n", null_space.T)
 ```
 
-## Eckart-Young Theorem
+## Eckart-young theorem
 
 Suppose that we want to construct  the best rank $r$ approximation of an $m \times n$ matrix $X$.
 
@@ -354,7 +354,7 @@ You can read about the Eckart-Young theorem and some of its uses [here](https://
 
 We'll make use of this theorem when we discuss principal components analysis (PCA) and also dynamic mode decomposition (DMD).
 
-## Full and Reduced SVD's
+## Full and reduced svd's
 
 Up to now we have described properties of a **full** SVD in which shapes of $U$, $\Sigma$, and $V$ are $\left(m, m\right)$, $\left(m, n\right)$, $\left(n, n\right)$, respectively.
 
@@ -504,7 +504,7 @@ SShat=np.diag(Shat)
 np.allclose(X, Uhat@SShat@Vhat)
 ```
 
-## Polar Decomposition
+## Polar decomposition
 
 A **reduced** singular value decomposition (SVD) of $X$ is related to a **polar decomposition** of $X$
 
@@ -532,7 +532,7 @@ and in our reduced SVD
 * $\Sigma$ is a $p \times p$ diagonal matrix
 * $V$ is an $n \times p$ orthonormal
 
-## Application: Principal Components Analysis (PCA)
+## Application: principal components analysis (pca)
 
 Let's begin with a case in which $n >> m$, so that we have many  more individuals $n$ than attributes $m$.
 
@@ -628,7 +628,7 @@ T&= BV \cr
 $$
 
 
-## Relationship of PCA to SVD
+## Relationship of pca to SVD
 
 To relate an SVD to a PCA of data set $X$, first construct the SVD of the data matrix $X$:
 
@@ -667,7 +667,7 @@ is a vector of **loadings** of variables $X_i$ on the $k$th principal component,
 
 * $\sigma_k $ for each $k=1, \ldots, p$ is the strength of $k$th **principal component**, where strength means contribution to the overall covariance of $X$.
 
-## PCA with Eigenvalues and Eigenvectors
+## Pca with eigenvalues and eigenvectors
 
 We now  use an eigen decomposition of a sample covariance matrix to do PCA.
 
diff --git a/lectures/troubleshooting.md b/lectures/troubleshooting.md
index f9f162c5d..fcf2404b3 100644
--- a/lectures/troubleshooting.md
+++ b/lectures/troubleshooting.md
@@ -26,7 +26,7 @@ kernelspec:
 
 This page is for readers experiencing errors when running the code from the lectures.
 
-## Fixing Your Local Environment
+## Fixing your local environment
 
 The basic assumption of the lectures is that code in a lecture should execute whenever
 
@@ -62,7 +62,7 @@ Second, you can report an issue, so we can try to fix your local set up.
 We like getting feedback on the lectures so please don't hesitate to get in
 touch.
 
-## Reporting an Issue
+## Reporting an issue
 
 One way to give feedback is to raise an issue through our [issue tracker](https://github.com/QuantEcon/lecture-python/issues).
 
diff --git a/lectures/two_auctions.md b/lectures/two_auctions.md
index e06c7bb5c..998051ee2 100644
--- a/lectures/two_auctions.md
+++ b/lectures/two_auctions.md
@@ -51,7 +51,7 @@ Much of our  Python code below is based on his.
 
 +++
 
-##  First-Price Sealed-Bid Auction (FPSB)
+## First-price sealed-bid auction (FPSB)
 
 +++
 
@@ -94,7 +94,7 @@ To complete the specification of the situation, we'll  assume that  prospective
 
 Bidder optimally chooses to bid less than $v_i$.
 
-### Characterization of FPSB Auction
+### Characterization of FPSB auction
 
 A FPSB auction has a unique symmetric Bayesian Nash Equilibrium.
 
@@ -116,13 +116,13 @@ A proof for this assertion is available  at the [Wikepedia page](https://en.wiki
 
 +++
 
-## Second-Price Sealed-Bid Auction (SPSB)
+## Second-price sealed-bid auction (SPSB)
 
 +++
 
 **Protocols:** In a  second-price sealed-bid (SPSB) auction,  the winner pays the second-highest bid.
 
-## Characterization of SPSB Auction
+## Characterization of SPSB auction
 
 In a  SPSB auction  bidders optimally choose to bid their  values.
 
@@ -133,7 +133,7 @@ A proof is provided at [the Wikepedia
 
 +++
 
-## Uniform Distribution of Private Values
+## Uniform distribution of private values
 
 +++
 
@@ -184,13 +184,13 @@ $$
 \end{aligned}
 $$
 
-## Second Price Sealed Bid Auction
+## Second price sealed bid auction
 
 In a  **SPSB**, it is optimal for bidder $i$ to bid $v_i$.
 
 +++
 
-## Python Code
+## Python code
 
 ```{code-cell} ipython3
 import numpy as np
@@ -268,7 +268,7 @@ ax.set_ylabel('Bid, $b_i$')
 sns.despine()
 ```
 
-## Revenue Equivalence Theorem
+## Revenue equivalence theorem
 
 +++
 
@@ -355,7 +355,7 @@ It follows that an optimal bidding strategy in a FPSB auction is $b(v_{i}) = \ma
 
 +++
 
-##  Calculation of  Bid Price in FPSB
+## Calculation of bid price in FPSB
 
 +++
 
@@ -429,7 +429,7 @@ ax.set_title('Solution for FPSB')
 sns.despine()
 ```
 
-##  $\chi^2$ Distribution
+## $\chi^2$ distribution
 
 Let's try an example in which the distribution of private values is a $\chi^2$ distribution.
 
@@ -518,7 +518,7 @@ ax.set_ylabel('Density')
 sns.despine()
 ```
 
-## 5 Code Summary
+## 5 code summary
 
 +++
 
diff --git a/lectures/uncertainty_traps.md b/lectures/uncertainty_traps.md
index b93d7582e..08e7f12c9 100644
--- a/lectures/uncertainty_traps.md
+++ b/lectures/uncertainty_traps.md
@@ -56,7 +56,7 @@ plt.rcParams["figure.figsize"] = (11, 5)  #set default figure size
 import numpy as np
 ```
 
-## The Model
+## The model
 
 The original model described in {cite}`fun` has many interesting moving parts.
 
@@ -100,7 +100,7 @@ The higher is the precision, the more informative $x_m$ is about the fundamental
 
 Output shocks are independent across time and firms.
 
-### Information and Beliefs
+### Information and beliefs
 
 All entrepreneurs start with identical beliefs about $\theta_0$.
 
diff --git a/lectures/util_rand_resp.md b/lectures/util_rand_resp.md
index f6ac37f2c..08d7d9e20 100644
--- a/lectures/util_rand_resp.md
+++ b/lectures/util_rand_resp.md
@@ -34,7 +34,7 @@ proposed, for example, by {cite}`lanke1975choice`, {cite}`lanke1976degree`, {cit
 
 
-## Privacy Measures
+## Privacy measures
 
 We consider randomized response models with only  two possible answers, "yes" and "no."
 
@@ -55,11 +55,11 @@ $$
 $$ (eq:util-rand-one)
 
 
-## Zoo of Concepts
+## Zoo of concepts
 
 At this point we describe some concepts proposed by various researchers
 
-###  Leysieffer and Warner(1976)
+### Leysieffer and warner(1976)
 
 The response $r$ is regarded as jeopardizing with respect to $A$ or $A^{'}$ if
 
@@ -173,9 +173,9 @@ $$ (eq:util-rand-eight-b)
 
 This measure is just the first term in {eq}`eq:util-rand-seven-a`, i.e., the probability that an individual answers "yes" and is perceived to belong to $A$.
 
-##  Respondent's Expected Utility
+## Respondent's expected utility
 
-### Truth Border
+### Truth border
 
 Key  assumptions  that underlie a randomized response technique for estimating the  fraction of a population that belongs to  $A$ are:
 
@@ -263,7 +263,7 @@ The source of the positive relationship is:
 
 - Suppose now that $\text{Pr}(A|\text{yes})$ increases. That reduces the utility of telling the truth. To preserve indifference between a truthful answer and a lie, $\text{Pr}(A|\text{no})$ must increase to reduce the utility of lying.
 
-###  Drawing a  Truth Border
+### Drawing a truth border
 
 We can deduce two things about the truth border:
 
@@ -335,9 +335,9 @@ plt.title('Figure 1.2')
 plt.show()
 ```
 
-## Utilitarian View of Survey Design
+## Utilitarian view of survey design
 
-### Iso-variance Curves
+### Iso-variance curves
 
 A statistician's objective is
 
@@ -372,7 +372,7 @@ From expression {eq}`eq:util-rand-thirteen`, {eq}`eq:util-rand-fourteen-a` and {
 
 - Iso-variance curves are always upward-sloping and concave.
 
-### Drawing  Iso-variance Curves
+### Drawing iso-variance curves
 
 We use Python code to draw iso-variance curves.
 
@@ -440,7 +440,7 @@ var = Iso_Variance(pi=0.3, n=100)
 var.plotting_iso_variance_curve()
 ```
 
-### Optimal Survey
+### Optimal survey
 
 A point on an iso-variance curves can be attained with the unrelated question design.
 
@@ -470,13 +470,13 @@ Here are some comments about the model design:
 
 - A more general design problem would be to minimize some weighted sum of the estimator's variance and bias. It would be optimal to accept some lies from the most "reluctant" respondents.
 
-## Criticisms of Proposed Privacy Measures
+## Criticisms of proposed privacy measures
 
 We can use a utilitarian approach to analyze some  privacy measures.
 
 We'll enlist Python Code to help us.
 
-###  Analysis of Method of Lanke's (1976)
+### Analysis of method of lanke's (1976)
 
 Lanke (1976) recommends a privacy  protection criterion that minimizes:
 
@@ -543,7 +543,7 @@ $$
 
 This is  not an optimal  choice under a utilitarian approach.
 
-### Analysis on the Method of Chaudhuri and Mukerjee's (1988)
+### Analysis on the method of Chaudhuri and mukerjee's (1988)
 
 {cite}`Chadhuri_Mukerjee_88`
 
@@ -670,7 +670,7 @@ If the individuals are willing to volunteer this information, it seems that the
 
 It ignores the fact that respondents retain the option of lying until they have seen the question to be answered.
 
-## Concluding Remarks
+## Concluding remarks
 
 
 The justifications for a randomized response procedure are that
diff --git a/lectures/var_dmd.md b/lectures/var_dmd.md
index 8fa8d190f..2fbd5f315 100644
--- a/lectures/var_dmd.md
+++ b/lectures/var_dmd.md
@@ -20,7 +20,7 @@ This lecture applies computational methods  that we learned about in this lectur
 * dynamic mode decompositions (DMDs)
 * connections between DMDs and first-order VARs 
 
-## First-Order Vector Autoregressions 
+## First-order vector autoregressions
 
 
 We want to fit a **first-order vector autoregression**
@@ -258,7 +258,7 @@ $$ (eq:AhatSVDformula)
 
 
-## Dynamic Mode Decomposition (DMD)
+## Dynamic mode decomposition (DMD)
 
 
@@ -638,7 +638,7 @@ This concludes the proof.
 Also see {cite}`DDSE_book` (p. 238)
 
 
-### Decoder of  $\check b$ as a linear projection
+### Decoder of $\check b$ as a linear projection
 
 
@@ -716,7 +716,7 @@ Rearranging  the orthogonality conditions {eq}`eq:orthls` gives $X^\top  \Phi =
 
 
-### An Approximation
+### An approximation
 
 
@@ -817,7 +817,7 @@ We can then use  a decoded $\check X_{t+j}$ or $\hat X_{t+j}$ to forecast $X_{t+
 
 
-### Using Fewer Modes
+### Using fewer modes
 
 In applications, we'll actually  use only  a few modes, often  three or less.  
 
@@ -832,7 +832,7 @@ Counterparts of all of the salient formulas above then apply.
 
 
-## Source for Some Python Code
+## Source for some Python code
 
 You can find a Python implementation of DMD here:
 
diff --git a/lectures/von_neumann_model.md b/lectures/von_neumann_model.md
index ac0c5452b..70d7eefc7 100644
--- a/lectures/von_neumann_model.md
+++ b/lectures/von_neumann_model.md
@@ -358,7 +358,7 @@ $a_{\cdot j}$ and $a_{i\cdot}$
 denote the $j$ th column and $i$ th row of $A$,
 respectively.
 
-## Model Ingredients and Assumptions
+## Model ingredients and assumptions
 
 A pair $(A,B)$ of $m\times n$ non-negative matrices defines
 an economy.
@@ -461,7 +461,7 @@ n2 = Neumann(A2, B2)
 n2
 ```
 
-## Dynamic Interpretation
+## Dynamic interpretation
 
 Attach a time index $t$ to the preceding objects, regard an economy
 as a dynamic system, and study sequences
@@ -498,7 +498,7 @@ yesterday.
 Accordingly, $Ap_t$ tells the costs of production in period
 $t$ and $Bp_t$ tells revenues in period $t+1$.
 
-### Balanced Growth
+### Balanced growth
 
 We follow John von Neumann in studying “balanced growth”.
 
@@ -662,7 +662,7 @@ They show that
 this extra condition does not affect the existence result, while it
 significantly reduces the number of (relevant) solutions.
 
-## Interpretation as Two-player Zero-sum Game
+## Interpretation as two-player zero-sum game
 
 To compute the equilibrium $(\gamma^{*}, x_0, p_0)$, we follow the
 algorithm proposed by Hamburger, Thompson and Weil (1967), building on
@@ -711,7 +711,7 @@ $$
 V(C) = \max_x \min_p \hspace{2mm} x^T C p = \min_p \max_x \hspace{2mm} x^T C p = (x^*)^T C p^*
 $$
 
-### Connection with Linear Programming (LP)
+### Connection with linear programming (lp)
 
 Nash equilibria of a finite two-player zero-sum game solve  a linear programming problem.
 
@@ -956,7 +956,7 @@ case of an irreducible $(A,B)$ (like in Example 1), the maximal
 and minimal roots of $V(M(\gamma))$ necessarily coincide implying
 a ‘‘full duality’’ result, i.e. $\alpha_0 = \beta_0 = \gamma^*$ so that the expansion (and interest) rate $\gamma^*$ is unique.
 
-### Uniqueness and Irreducibility
+### Uniqueness and irreducibility
 
 As an illustration, compute first the maximal and minimal roots of
 $V(M(\cdot))$ for our Example 2 that has a reducible
@@ -998,7 +998,7 @@ $(\gamma^*, x_0, p_0)$.
 **Theorem II:** Adopt the conditions of Theorem 1. If the economy
 $(A,B)$ is irreducible, then $\gamma^*=\alpha_0=\beta_0$.
 
-### A Special Case
+### A special case
 
 There is a special $(A,B)$ that allows us to simplify the solution
 method significantly by invoking the powerful Perron-Frobenius theorem
diff --git a/lectures/wald_friedman.md b/lectures/wald_friedman.md
index 802f7908d..f92d28f5f 100644
--- a/lectures/wald_friedman.md
+++ b/lectures/wald_friedman.md
@@ -77,7 +77,7 @@ import pandas as pd
 
 This lecture uses ideas studied in {doc}`the lecture on likelihood ratio processes<likelihood_ratio_process>` and  {doc}`the lecture on Bayesian learning<likelihood_bayes>`.
 
-## Source of the Problem
+## Source of the problem
 
 On pages 137-139 of his 1998 book *Two Lucky People* with Rose Friedman {cite}`Friedman98`,
 Milton Friedman described a problem presented to him and Allen Wallis
@@ -123,7 +123,7 @@ Realizing that, they told Abraham Wald about the problem.
 
 That set  Wald on a path that led him  to create  *Sequential Analysis* {cite}`Wald47`.
 
-##  Neyman-Pearson Formulation
+## Neyman-pearson formulation
 
 It is useful to begin by describing the theory underlying the test
 that the U.S. Navy told  Captain G. S. Schuyler to use.
@@ -275,7 +275,7 @@ Here is how Wald introduces the notion of a sequential test
 > a random variable, since the value of $n$ depends on the outcome of the
 > observations.
 
-## Wald's Sequential Formulation 
+## Wald's sequential formulation
 
 By way of contrast to Neyman and Pearson's formulation of the problem, in Wald's formulation
 
@@ -341,7 +341,7 @@ Consequently, the observer has something to learn, namely, whether the observati
 The decision maker   wants  to decide which of the  two distributions is generating outcomes.
 
 
-### Type I and Type II Errors
+### Type I and type II errors
 
 If we regard  $f=f_0$ as a null hypothesis and $f=f_1$ as an alternative hypothesis,
 then 
@@ -392,7 +392,7 @@ The following figure illustrates aspects of Wald's procedure.
 
 ```
 
-## Links Between $A,B$ and $\alpha, \beta$
+## Links between $a,b$ and $\alpha, \beta$
 
 In chapter 3 of **Sequential Analysis** {cite}`Wald47`  Wald establishes the inequalities
 
@@ -1072,7 +1072,7 @@ This increases the probability of Type II errors.
 
 The table confirms this intuition: as $A$ decreases and $B$ increases from their optimal Wald values, both Type I and Type II error rates increase, while the mean stopping time decreases.
 
-## Related Lectures
+## Related lectures
 
 We'll dig deeper into some of the ideas used here in the following earlier and later lectures:
 
diff --git a/lectures/wald_friedman_2.md b/lectures/wald_friedman_2.md
index 6d5142dc6..eaf41626b 100644
--- a/lectures/wald_friedman_2.md
+++ b/lectures/wald_friedman_2.md
@@ -97,7 +97,7 @@ from numba.experimental import jitclass
 from math import gamma
 ```
 
-## A Dynamic Programming Approach
+## A dynamic programming approach
 
 The following presentation of the problem closely follows Dmitri
 Bertsekas's treatment in **Dynamic Programming and Stochastic Control** {cite}`Bertsekas75`. 
@@ -202,7 +202,7 @@ plt.tight_layout()
 plt.show()
 ```
 
-### Losses and Costs
+### Losses and costs
 
 After observing $z_k, z_{k-1}, \ldots, z_0$, the decision-maker
 chooses among three distinct actions:
@@ -222,7 +222,7 @@ kinds of losses:
 - A cost $c$ if he postpones deciding and chooses instead to draw
   another $z$
 
-### Digression on Type I and Type II Errors
+### Digression on type I and type II errors
 
 If we regard  $f=f_0$ as a null hypothesis and $f=f_1$ as an alternative hypothesis,
 then $L_1$ and $L_0$ are losses associated with two types of statistical errors
@@ -262,7 +262,7 @@ Our problem is to determine threshold values $A, B$ that somehow depend on the p
 You might like to pause at this point and try to predict the impact of a
 parameter such as $c$ or $L_0$ on $A$ or $B$.
 
-### A Bellman Equation
+### A Bellman equation
 
 Let $J(\pi)$ be the total loss for a decision-maker with current belief $\pi$ who chooses optimally.
 
@@ -537,7 +537,7 @@ ax.legend()
 plt.show()
 ```
 
-### Cost Function
+### Cost function
 
 To solve the model, we will call our `solve_model` function
 
@@ -725,7 +725,7 @@ def simulation_plot(wf):
 simulation_plot(wf)
 ```
 
-### Comparative Statics
+### Comparative statics
 
 Now let's consider the following exercise.
 
diff --git a/lectures/wealth_dynamics.md b/lectures/wealth_dynamics.md
index bdd5d7d8f..d615ae675 100644
--- a/lectures/wealth_dynamics.md
+++ b/lectures/wealth_dynamics.md
@@ -60,7 +60,7 @@ It also gives us a way to quantify such concentration, in terms of the tail inde
 
 One question of interest is whether or not we can replicate Pareto tails from a relatively simple model.
 
-### A Note on Assumptions
+### A note on assumptions
 
 The evolution of wealth for any given household depends on their
 savings behavior.
@@ -84,12 +84,12 @@ from numba import jit, float64, prange
 from numba.experimental import jitclass
 ```
 
-## Lorenz Curves and the Gini Coefficient
+## Lorenz curves and the Gini coefficient
 
 Before we investigate wealth dynamics, we briefly review some measures of
 inequality.
 
-### Lorenz Curves
+### Lorenz curves
 
 One popular graphical measure of inequality is the [Lorenz curve](https://en.wikipedia.org/wiki/Lorenz_curve).
 
@@ -152,7 +152,7 @@ You can see that, as the tail parameter of the Pareto distribution increases, in
 
 This is to be expected, because a higher tail index implies less weight in the tail of the Pareto distribution.
 
-### The Gini Coefficient
+### The Gini coefficient
 
 The definition and interpretation of the Gini coefficient can be found on the corresponding [Wikipedia page](https://en.wikipedia.org/wiki/Gini_coefficient).
 
@@ -192,7 +192,7 @@ plt.show()
 
 The simulation shows that the fit is good.
 
-## A Model of Wealth Dynamics
+## A model of wealth dynamics
 
 Having discussed inequality measures, let us now turn to wealth dynamics.
 
@@ -417,7 +417,7 @@ aggregate state is known.
 Let's try simulating the model at different parameter values and investigate
 the implications for the wealth distribution.
 
-### Time Series
+### Time series
 
 Let's look at the wealth dynamics of an individual household.
 
@@ -437,7 +437,7 @@ Notice the large spikes in wealth over time.
 
 Such spikes are similar to what we observed in time series when {doc}`we studied Kesten processes <kesten_processes>`.
 
-### Inequality Measures
+### Inequality measures
 
 Let's look at how inequality varies with returns on financial assets.
 

From 878e3d99d0fb6493c25800d889621a2a2e4af6a0 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 7 Aug 2025 04:41:16 +0000
Subject: [PATCH 3/5] Fix incorrectly capitalized Python comments in code cells

Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
 lectures/back_prop.md     | 2 +-
 lectures/hoist_failure.md | 6 +++---
 lectures/samuelson.md     | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/lectures/back_prop.md b/lectures/back_prop.md
index 935e87095..abbe7d8c0 100644
--- a/lectures/back_prop.md
+++ b/lectures/back_prop.md
@@ -24,7 +24,7 @@ kernelspec:
 
 ```{code-cell} ipython3
 import jax
-## To check that gpu is activated in environment
+## to check that gpu is activated in environment
 print(f"JAX backend: {jax.devices()[0].platform}")
 ```
 
diff --git a/lectures/hoist_failure.md b/lectures/hoist_failure.md
index f8dd2f32f..b474bb3b8 100644
--- a/lectures/hoist_failure.md
+++ b/lectures/hoist_failure.md
@@ -216,7 +216,7 @@ We'll start by generating samples of size 25000 of three independent  log normal
 Then we'll plot  histograms and compare them with convolutions of appropriate discretized log normal distributions.
 
 ```{code-cell} python3
-## Create sums of two and three log normal random variates ssum2 = s1 + s2 and ssum3 = s1 + s2 + s3
+## create sums of two and three log normal random variates ssum2 = s1 + s2 and ssum3 = s1 + s2 + s3
 
 
 mu1, sigma1 = 5., 1. # mean and standard deviation
@@ -292,10 +292,10 @@ m = .1 # increment size
 
 ```{code-cell} python3
 ## Cell to check -- note what happens when don't normalize!
-## Things match up without adjustment. compare with above
+## things match up without adjustment. Compare with above
 
 p1,p1_norm,x = pdf_seq(mu1,sigma1,I,m)
-## Compute number of points to evaluate the probability mass function
+## compute number of points to evaluate the probability mass function
 NT = x.size
 
 plt.figure(figsize = (8,8))
diff --git a/lectures/samuelson.md b/lectures/samuelson.md
index edcf728e5..1b6f31b90 100644
--- a/lectures/samuelson.md
+++ b/lectures/samuelson.md
@@ -619,8 +619,8 @@ $$
   pairs that would generate those roots
 
 ```{code-cell} python3
-### Code to reverse-engineer a cycle
-### Y_t = r^t (c_1 cos(ϕ t) + c2 sin(ϕ t))
+### code to reverse-engineer a cycle
+### y_t = r^t (c_1 cos(ϕ t) + c2 sin(ϕ t))
 ###
 
 def f(r, ϕ):

From 207081e95c6b922f6ee84eff164246a78496fff2 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 7 Aug 2025 04:44:11 +0000
Subject: [PATCH 4/5] Fix capitalization in index role for stochastic matrices
 heading

Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
 lectures/finite_markov.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lectures/finite_markov.md b/lectures/finite_markov.md
index 53e5aa8d4..089c65bda 100644
--- a/lectures/finite_markov.md
+++ b/lectures/finite_markov.md
@@ -64,7 +64,7 @@ from mpl_toolkits.mplot3d import Axes3D
 The following concepts are fundamental.
 
 (finite_dp_stoch_mat)=
-### {index}`stochastic matrices <single: stochastic matrices>`
+### {index}`Stochastic matrices <single: Stochastic matrices>`
 
 ```{index} single: Finite Markov Chains; Stochastic Matrices
 ```

From 8638d68b9a41b1d7ec63535a75e7ec2c5bb657f8 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Thu, 7 Aug 2025 05:54:26 +0000
Subject: [PATCH 5/5] Fix capitalization in all {index} roles within headings
 according to style guide

Co-authored-by: mmcky <8263752+mmcky@users.noreply.github.com>
---
 lectures/finite_markov.md  | 4 ++--
 lectures/linear_algebra.md | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/lectures/finite_markov.md b/lectures/finite_markov.md
index 089c65bda..897cf766f 100644
--- a/lectures/finite_markov.md
+++ b/lectures/finite_markov.md
@@ -64,7 +64,7 @@ from mpl_toolkits.mplot3d import Axes3D
 The following concepts are fundamental.
 
 (finite_dp_stoch_mat)=
-### {index}`Stochastic matrices <single: Stochastic matrices>`
+### {index}`stochastic matrices <single: stochastic matrices>`
 
 ```{index} single: Finite Markov Chains; Stochastic Matrices
 ```
@@ -79,7 +79,7 @@ Each row of $P$ can be regarded as a probability mass function over $n$ possible
 
 It is too not difficult to check [^pm] that if $P$ is a stochastic matrix, then so is the $k$-th power $P^k$ for all $k \in \mathbb N$.
 
-### {index}`markov chains <single: Markov chains>`
+### {index}`Markov chains <single: Markov chains>`
 
 ```{index} single: Finite Markov Chains
 ```
diff --git a/lectures/linear_algebra.md b/lectures/linear_algebra.md
index 2326b42c8..4d9b7b3db 100644
--- a/lectures/linear_algebra.md
+++ b/lectures/linear_algebra.md
@@ -1073,7 +1073,7 @@ the left-hand side is a *matrix norm* --- in this case, the so-called
 For example, for a square matrix $S$, the condition $\| S \| < 1$ means that $S$ is *contractive*, in the sense that it pulls all vectors towards the origin [^cfn].
 
 (la_neumann)=
-#### {index}`neumann's theorem <single: neumann's theorem>`
+#### {index}`Neumann's theorem <single: Neumann's theorem>`
 
 ```{index} single: Linear Algebra; Neumann's Theorem
 ```