feat: Expand discussion of constraints and modifiers in HistFactory I…

…ntro (#55) * Add further discussion of constraint terms and modifiers. * Add links to other tools that use HistFactory. * Add axis titles to plots.
pyhf · Jul 26, 2023 · aa3d13b · aa3d13b
1 parent b5fe6d0
commit aa3d13b
Showing 1 changed file with 65 additions and 48 deletions.
diff --git a/book/IntroToHiFa.ipynb b/book/IntroToHiFa.ipynb
@@ -48,9 +48,9 @@
     "Like HistFactory, `pyhf` does not work with unbinned analyses. These will not be covered in the tutorial.\n",
     "\n",
     "So what uses HistFactory?\n",
-    "* TRexFitter\n",
-    "* WSMaker\n",
-    "* HistFitter\n",
+    "* [TRexFitter](https://trexfitter-docs.web.cern.ch/trexfitter-docs/) (ATLAS internal)\n",
+    "* [WSMaker](https://gitlab.cern.ch/atlas-physics/higgs/hbb/WSMaker) (ATLAS internal)\n",
+    "* [HistFitter](https://github.com/histfitter/histfitter)\n",
     "\n",
     "Most everyone in SUSY and Exotics who performs an asymptotic fit as part of their analysis is likely using HistFactory!"
    ]
@@ -66,9 +66,9 @@
     "What is a histogram? Fundamentally, a histogram is a tool to bookkeep arrays of numbers:\n",
     "* binning\n",
     "* counts\n",
-    "* errors\n",
+    "* uncertainties\n",
     "\n",
-    "Beyond that, it contains lots of other magic ingredients to make them more user-friendly for common operations (addition, division, etc...)."
+    "Beyond that, it contains helpful ingredients to make them more user-friendly for common operations (addition, division, etc...)."
    ]
   },
   {
@@ -88,13 +88,13 @@
     "\n",
     "<img src=\"https://raw.githubusercontent.com/scikit-hep/pyhf/main/docs/_static/img/README_1bin_example.png\" alt=\"common operation - parameter scan\" width=400 />\n",
     "\n",
-    "Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this incredibly simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n",
+    "Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n",
     "\n",
     "$$\n",
     "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b)\n",
     "$$\n",
     "\n",
-    "Why Poisson? This is a counting experiment after all. A region we want to model will then just be a series of Poissons."
+    "A Poisson model is used as we are performing a counting experiment (counting the number of random events with an expected rate) in each bin of the observable."
    ]
   },
   {
@@ -127,23 +127,25 @@
    "outputs": [],
    "source": [
     "fig, ax = plt.subplots()\n",
-    "ax.bar(bins, expected_yields, 1.0, label=r\"expected\", edgecolor=\"blue\", alpha=0.5)\n",
-    "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
+    "ax.bar(bins, expected_yields, 1.0, label=r\"Expected\", edgecolor=\"blue\", alpha=0.5)\n",
+    "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
     "ax.set_ylim(0, 6)\n",
-    "ax.legend();"
+    "ax.legend()\n",
+    "ax.set_xlabel(\"Observable\", fontsize=12)\n",
+    "ax.set_ylabel(\"Count\", fontsize=12);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "However, we don't always often have just a single (MC) sample, and $\\lambda$ is often the sum of multiple sample yields\n",
+    "However, we don't always often have just a single expected (simulation) sample, and $\\lambda$ is often the sum of multiple sample yields\n",
     "\n",
     "$$\n",
     "\\lambda = \\sum_{\\mathrm{sample}\\ s} \\lambda_s\n",
     "$$\n",
     "\n",
-    "A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. Then the p.d.f. might look like\n",
+    "A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. The model is then\n",
     "\n",
     "$$\n",
     "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b) \\qquad \\lambda_b = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}\n",
@@ -169,22 +171,24 @@
    "outputs": [],
    "source": [
     "fig, ax = plt.subplots()\n",
-    "ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
+    "ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
     "ax.bar(\n",
-    "    bins, signal, 1.0, label=r\"signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n",
+    "    bins, signal, 1.0, label=r\"Signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n",
     ")\n",
-    "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
+    "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
     "ax.set_ylim(0, 6)\n",
-    "ax.legend();"
+    "ax.legend()\n",
+    "ax.set_xlabel(\"Observable\", fontsize=12)\n",
+    "ax.set_ylabel(\"Count\", fontsize=12);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc...\n",
+    "Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc.\n",
     "\n",
-    "Suppose we wanted a $\\mu_s$ that is a normalization factor scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n",
+    "Suppose we wanted a a normalization factor $\\mu_s$ scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n",
     "\n",
     "$$\n",
     "p(n|\\lambda(\\mu)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
@@ -211,46 +215,57 @@
     "    print(f\"observed: {observed}\\n\")\n",
     "\n",
     "    fig, ax = plt.subplots()\n",
-    "    ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
+    "    ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
     "    ax.bar(\n",
     "        bins,\n",
     "        signal,\n",
     "        1.0,\n",
-    "        label=r\"signal\",\n",
+    "        label=r\"Signal\",\n",
     "        edgecolor=\"blue\",\n",
     "        bottom=background,\n",
     "        alpha=0.5,\n",
     "    )\n",
-    "    ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
+    "    ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
     "    ax.set_ylim(0, 6)\n",
-    "    ax.legend();"
+    "    ax.legend()\n",
+    "    ax.set_xlabel(\"Observable\", fontsize=12)\n",
+    "    ax.set_ylabel(\"Count\", fontsize=12);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "One final thing to finish our build up of a simplified p.d.f. is about **auxiliary measurements**. What we mean is that perhaps the background sample is modeled by some normalization parameter, but we've also performed additional measurements in a separate analysis that constraints the parametrization (e.g. Jet Energy Scale) so we have stronger confidence that the true parameter is within a certain range.\n",
+    "One final thing to finish our build up of a simplified HistFactory model is the concept of **auxiliary measurements**. Perhaps the background sample rate is modified by some normalization parameter, and we've made measurements of this parameter in a separate analysis (e.g. studies of the Jet Energy Scale). These prior experimental studies give a constraint that the parameter lies within a certain range.\n",
     "\n",
-    "For some parameters in a statistical model, all we have to infer its values is the given analysis. These parameters are **unconstrained** ($\\eta$):\n",
+    "For some parameters in a statistical model we don't have prior experimental evidence for their values and must infer its values is the given analysis. These are **unconstrained** parameters ($\\eta$) and enter into the main model as parameters of the event rate $\\lambda(\\theta)$\n",
     "\n",
     "$$\n",
-    "p(n | \\lambda(\\theta))\n",
+    "p(n | \\lambda(\\theta)).\n",
     "$$\n",
     "\n",
-    "For many parameters, we have the **auxiliary data** ($a$) given as an *auxiliary measurement* which is added to the main p.d.f.. These parameters are **constrained** ($\\chi$).\n",
+    "For many model parameters, their values in the model are constrained by a _constraint term function_, included in the model along with the the main model p.d.f, which describes **auxiliary measurements/data** ($a$) about the model parameter. These are **constrained** parameters ($\\chi$) and enter into the model both in the constraint terms and as parameters of the event rate $\\lambda(\\theta)$\n",
     "\n",
     "$$\n",
     "p_\\chi(a | \\chi)\n",
     "$$\n",
     "\n",
-    "where $\\theta = \\{\\eta, \\chi\\}$. This constraining function can generally be anything, but most of the time in HistFactory - it's a Gaussian or a Poisson. The p.d.f. expands to be\n",
+    "where $\\theta = \\{\\eta, \\chi\\}$. This constraining function model is chosen by the physics it represents, but in HistFactory most constraint terms are modeled as a Normal (Gaussian) or Poisson.\n",
+    "\n",
+    "With the constraint terms the model expands to be\n",
     "\n",
     "$$\n",
     "p(n,a|\\lambda(\\theta)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
     "$$\n",
     "\n",
-    "For this simple example, let's consider a Gaussian centered at $\\mu=0$ with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8."
+    "where the expected event rate $\\lambda_b(\\theta)$ is modified from its nominal value by a **chosen interpolation function** that smoothly interpolates between the up- and down-variations $(\\pm1 \\sigma)$ of the constraint term to provide an event rate modifier for any value of $\\chi$."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this simple example, let's consider a constraint term of a Normal distribution centered at $\\mu=0$ (\"auxiliary measurement\" $a=0$) with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8."
    ]
   },
   {
@@ -259,11 +274,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def gaussian_constraint(mu_b=0.0):\n",
-    "    return norm.pdf(mu_b, loc=0.0, scale=1.0)\n",
+    "def normal_constraint(mu_b=0.0):\n",
+    "    # auxiliary measurement of 0\n",
+    "    # though note that for Normal observation and mean are symmetric under exchange\n",
+    "    return norm.pdf(0.0, loc=mu_b, scale=1.0)\n",
     "\n",
     "\n",
-    "# interpolating\n",
+    "# selected interpolation function\n",
     "def interpolate(down, nom, up, alpha):\n",
     "    if alpha >= 0:\n",
     "        return (up - nom) * alpha + 1\n",
@@ -281,51 +298,51 @@
     "    print(f\"signal: {signal}\")\n",
     "    print(f\"background: {background}\")\n",
     "    print(f\"observed: {observed}\")\n",
-    "    print(\n",
-    "        f\"likelihood scaled by: {gaussian_constraint(mu_b)/gaussian_constraint(0.0)}\\n\"\n",
-    "    )\n",
+    "    print(f\"likelihood scaled by: {normal_constraint(mu_b)/normal_constraint(0.0)}\\n\")\n",
     "\n",
     "    fig, ax = plt.subplots()\n",
-    "    ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
+    "    ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
     "    ax.bar(\n",
     "        bins,\n",
     "        signal,\n",
     "        1.0,\n",
-    "        label=r\"signal\",\n",
+    "        label=r\"Signal\",\n",
     "        edgecolor=\"blue\",\n",
     "        bottom=background,\n",
     "        alpha=0.5,\n",
     "    )\n",
-    "    ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
+    "    ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
     "    ax.set_ylim(0, 6)\n",
-    "    ax.legend();"
+    "    ax.legend()\n",
+    "    ax.set_xlabel(\"Observable\", fontsize=12)\n",
+    "    ax.set_ylabel(\"Count\", fontsize=12);"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "But that's not all! Notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different signal regions and control regions. Therefore, we compute the likelihood as\n",
+    "However, notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different analysis signal regions and control regions. Therefore, we compute the likelihood as\n",
     "\n",
     "$$\n",
     "p_\\text{main} = p_\\text{channel1} * p_\\text{channel2} * p_\\text{channel3} \\cdots\n",
     "$$\n",
     "\n",
-    "So in fact, we can then expand out the likelihood definition further\n",
+    "We then expand out the likelihood definition further across channels\n",
     "\n",
     "$$\n",
-    "p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
+    "p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}}\\, \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
     "$$\n",
     "\n",
-    "As you can see, this is sort of a bookkeeping problem. We have two pieces of this likelihood:\n",
-    "* our main model, which consists of\n",
+    "There are now two pieces of the model:\n",
+    "* the main model, which consists of\n",
     "  * several channels (regions, histograms, etc), where\n",
-    "  * each channel is a set of simultaneous Poissons measuring the bin count against an expected value, where\n",
+    "  * each channel is a set of Poissons measuring the bin count for an expected value, where\n",
     "  * the expected value is the sum of various samples, where\n",
     "  * each samples expected value can be a function of parameters (or modifiers)\n",
     "* the constraint model, which consists of\n",
-    "  * several auxiliary measurements, where\n",
-    "  * each measurement comes with auxiliary data\n",
+    "  * constraint terms on model parameters, where\n",
+    "  * each constraint term describes auxiliary measurements\n",
     "  \n",
     "It should be clear by now that this is quite a lot of pieces to keep track of. This is where HistFactory comes in to play. Using HistFactory, we can\n",
     "* describe observed event rates and expected event rates\n",
@@ -433,7 +450,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -447,7 +464,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.7"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,