From 714ef91ef7a681cf02554570ebcb97116e18dae0 Mon Sep 17 00:00:00 2001 From: Matthew Feickert Date: Tue, 25 Jul 2023 23:36:48 -0400 Subject: [PATCH 1/3] Add revisions to histograms --- book/IntroToHiFa.ipynb | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/book/IntroToHiFa.ipynb b/book/IntroToHiFa.ipynb index e25cfd1..d5955ef 100644 --- a/book/IntroToHiFa.ipynb +++ b/book/IntroToHiFa.ipynb @@ -48,9 +48,9 @@ "Like HistFactory, `pyhf` does not work with unbinned analyses. These will not be covered in the tutorial.\n", "\n", "So what uses HistFactory?\n", - "* TRexFitter\n", - "* WSMaker\n", - "* HistFitter\n", + "* [TRexFitter](https://trexfitter-docs.web.cern.ch/trexfitter-docs/) (ATLAS internal)\n", + "* [WSMaker](https://gitlab.cern.ch/atlas-physics/higgs/hbb/WSMaker) (ATLAS internal)\n", + "* [HistFitter](https://github.com/histfitter/histfitter)\n", "\n", "Most everyone in SUSY and Exotics who performs an asymptotic fit as part of their analysis is likely using HistFactory!" ] @@ -66,9 +66,9 @@ "What is a histogram? Fundamentally, a histogram is a tool to bookkeep arrays of numbers:\n", "* binning\n", "* counts\n", - "* errors\n", + "* uncertainties\n", "\n", - "Beyond that, it contains lots of other magic ingredients to make them more user-friendly for common operations (addition, division, etc...)." + "Beyond that, it contains helpful ingredients to make them more user-friendly for common operations (addition, division, etc...)." ] }, { From 8b6109a116eac9b7803f5f887c1381bd75365c49 Mon Sep 17 00:00:00 2001 From: Matthew Feickert Date: Wed, 26 Jul 2023 01:23:21 -0400 Subject: [PATCH 2/3] Add more detail on constraint terms --- book/IntroToHiFa.ipynb | 103 ++++++++++++++++++++++++----------------- 1 file changed, 60 insertions(+), 43 deletions(-) diff --git a/book/IntroToHiFa.ipynb b/book/IntroToHiFa.ipynb index d5955ef..556ad39 100644 --- a/book/IntroToHiFa.ipynb +++ b/book/IntroToHiFa.ipynb @@ -88,13 +88,13 @@ "\n", "\"common\n", "\n", - "Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this incredibly simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n", + "Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n", "\n", "$$\n", "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b)\n", "$$\n", "\n", - "Why Poisson? This is a counting experiment after all. A region we want to model will then just be a series of Poissons." + "A Poisson model is used as we are performing a counting experiment (counting the number of random events with an expected rate) in each bin of the observable." ] }, { @@ -127,23 +127,25 @@ "outputs": [], "source": [ "fig, ax = plt.subplots()\n", - "ax.bar(bins, expected_yields, 1.0, label=r\"expected\", edgecolor=\"blue\", alpha=0.5)\n", - "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", + "ax.bar(bins, expected_yields, 1.0, label=r\"Expected\", edgecolor=\"blue\", alpha=0.5)\n", + "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n", "ax.set_ylim(0, 6)\n", - "ax.legend();" + "ax.legend()\n", + "ax.set_xlabel(\"Observable\", fontsize=12)\n", + "ax.set_ylabel(\"Count\", fontsize=12);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "However, we don't always often have just a single (MC) sample, and $\\lambda$ is often the sum of multiple sample yields\n", + "However, we don't always often have just a single expected (simulation) sample, and $\\lambda$ is often the sum of multiple sample yields\n", "\n", "$$\n", "\\lambda = \\sum_{\\mathrm{sample}\\ s} \\lambda_s\n", "$$\n", "\n", - "A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. Then the p.d.f. might look like\n", + "A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. The model is then\n", "\n", "$$\n", "p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b) \\qquad \\lambda_b = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}\n", @@ -169,22 +171,24 @@ "outputs": [], "source": [ "fig, ax = plt.subplots()\n", - "ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n", + "ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n", "ax.bar(\n", - " bins, signal, 1.0, label=r\"signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n", + " bins, signal, 1.0, label=r\"Signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n", ")\n", - "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", + "ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n", "ax.set_ylim(0, 6)\n", - "ax.legend();" + "ax.legend()\n", + "ax.set_xlabel(\"Observable\", fontsize=12)\n", + "ax.set_ylabel(\"Count\", fontsize=12);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc...\n", + "Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc.\n", "\n", - "Suppose we wanted a $\\mu_s$ that is a normalization factor scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n", + "Suppose we wanted a a normalization factor $\\mu_s$ scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n", "\n", "$$\n", "p(n|\\lambda(\\mu)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n", @@ -211,46 +215,57 @@ " print(f\"observed: {observed}\\n\")\n", "\n", " fig, ax = plt.subplots()\n", - " ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n", + " ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n", " ax.bar(\n", " bins,\n", " signal,\n", " 1.0,\n", - " label=r\"signal\",\n", + " label=r\"Signal\",\n", " edgecolor=\"blue\",\n", " bottom=background,\n", " alpha=0.5,\n", " )\n", - " ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", + " ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n", " ax.set_ylim(0, 6)\n", - " ax.legend();" + " ax.legend()\n", + " ax.set_xlabel(\"Observable\", fontsize=12)\n", + " ax.set_ylabel(\"Count\", fontsize=12);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "One final thing to finish our build up of a simplified p.d.f. is about **auxiliary measurements**. What we mean is that perhaps the background sample is modeled by some normalization parameter, but we've also performed additional measurements in a separate analysis that constraints the parametrization (e.g. Jet Energy Scale) so we have stronger confidence that the true parameter is within a certain range.\n", + "One final thing to finish our build up of a simplified HistFactory model is the concept of **auxiliary measurements**. Perhaps the background sample rate is modified by some normalization parameter, and we've made measurements of this parameter in a separate analysis (e.g. studies of the Jet Energy Scale). These prior experimental studies give a constraint that the parameter lies within a certain range.\n", "\n", - "For some parameters in a statistical model, all we have to infer its values is the given analysis. These parameters are **unconstrained** ($\\eta$):\n", + "For some parameters in a statistical model we don't have prior experimental evidence for their values and must infer its values is the given analysis. These are **unconstrained** parameters ($\\eta$) and enter into the main model as parameters of the event rate $\\lambda(\\theta)$\n", "\n", "$$\n", - "p(n | \\lambda(\\theta))\n", + "p(n | \\lambda(\\theta)).\n", "$$\n", "\n", - "For many parameters, we have the **auxiliary data** ($a$) given as an *auxiliary measurement* which is added to the main p.d.f.. These parameters are **constrained** ($\\chi$).\n", + "For many model parameters, their values in the model are constrained by a _constraint term function_, included in the model along with the the main model p.d.f, which describes **auxiliary measurements/data** ($a$) about the model parameter. These are **constrained** parameters ($\\chi$) and enter into the model both in the constraint terms and as parameters of the event rate $\\lambda(\\theta)$\n", "\n", "$$\n", "p_\\chi(a | \\chi)\n", "$$\n", "\n", - "where $\\theta = \\{\\eta, \\chi\\}$. This constraining function can generally be anything, but most of the time in HistFactory - it's a Gaussian or a Poisson. The p.d.f. expands to be\n", + "where $\\theta = \\{\\eta, \\chi\\}$. This constraining function model is chosen by the physics it represents, but in HistFactory most constraint terms are modeled as a Normal (Gaussian) or Poisson.\n", + "\n", + "With the constraint terms the model expands to be\n", "\n", "$$\n", "p(n,a|\\lambda(\\theta)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n", "$$\n", "\n", - "For this simple example, let's consider a Gaussian centered at $\\mu=0$ with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8." + "where the expected event rate $\\lambda_b(\\theta)$ is modified from its nominal value by a **chosen interpolation function** that smoothly interpolates between the up- and down-variations $(\\pm1 \\sigma)$ of the constraint term to provide an event rate modifer for any value of $\\chi$." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this simple example, let's consider a constraint term of a Normal distribution centered at $\\mu=0$ (\"auxiliary measurement\" $a=0$) with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8." ] }, { @@ -259,11 +274,13 @@ "metadata": {}, "outputs": [], "source": [ - "def gaussian_constraint(mu_b=0.0):\n", - " return norm.pdf(mu_b, loc=0.0, scale=1.0)\n", + "def normal_constraint(mu_b=0.0):\n", + " # auxiliary measurement of 0\n", + " # though note that for Normal observation and mean are symmetric under exchange\n", + " return norm.pdf(0.0, loc=mu_b, scale=1.0)\n", "\n", "\n", - "# interpolating\n", + "# selected interpolation function\n", "def interpolate(down, nom, up, alpha):\n", " if alpha >= 0:\n", " return (up - nom) * alpha + 1\n", @@ -281,51 +298,51 @@ " print(f\"signal: {signal}\")\n", " print(f\"background: {background}\")\n", " print(f\"observed: {observed}\")\n", - " print(\n", - " f\"likelihood scaled by: {gaussian_constraint(mu_b)/gaussian_constraint(0.0)}\\n\"\n", - " )\n", + " print(f\"likelihood scaled by: {normal_constraint(mu_b)/normal_constraint(0.0)}\\n\")\n", "\n", " fig, ax = plt.subplots()\n", - " ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n", + " ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n", " ax.bar(\n", " bins,\n", " signal,\n", " 1.0,\n", - " label=r\"signal\",\n", + " label=r\"Signal\",\n", " edgecolor=\"blue\",\n", " bottom=background,\n", " alpha=0.5,\n", " )\n", - " ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n", + " ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n", " ax.set_ylim(0, 6)\n", - " ax.legend();" + " ax.legend()\n", + " ax.set_xlabel(\"Observable\", fontsize=12)\n", + " ax.set_ylabel(\"Count\", fontsize=12);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "But that's not all! Notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different signal regions and control regions. Therefore, we compute the likelihood as\n", + "Howerver, notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different analysis signal regions and control regions. Therefore, we compute the likelihood as\n", "\n", "$$\n", "p_\\text{main} = p_\\text{channel1} * p_\\text{channel2} * p_\\text{channel3} \\cdots\n", "$$\n", "\n", - "So in fact, we can then expand out the likelihood definition further\n", + "We then expand out the likelihood definition further across channels\n", "\n", "$$\n", - "p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n", + "p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}}\\, \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n", "$$\n", "\n", - "As you can see, this is sort of a bookkeeping problem. We have two pieces of this likelihood:\n", - "* our main model, which consists of\n", + "There are now two pieces of the model:\n", + "* the main model, which consists of\n", " * several channels (regions, histograms, etc), where\n", - " * each channel is a set of simultaneous Poissons measuring the bin count against an expected value, where\n", + " * each channel is a set of Poissons measuring the bin count for an expected value, where\n", " * the expected value is the sum of various samples, where\n", " * each samples expected value can be a function of parameters (or modifiers)\n", "* the constraint model, which consists of\n", - " * several auxiliary measurements, where\n", - " * each measurement comes with auxiliary data\n", + " * constraint terms on model parameters, where\n", + " * each constraint term describes auxiliary measurements\n", " \n", "It should be clear by now that this is quite a lot of pieces to keep track of. This is where HistFactory comes in to play. Using HistFactory, we can\n", "* describe observed event rates and expected event rates\n", @@ -433,7 +450,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -447,7 +464,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.7" + "version": "3.10.4" } }, "nbformat": 4, From bca5a7e72ba17a0c50a4e13037107d913225f7b4 Mon Sep 17 00:00:00 2001 From: Matthew Feickert Date: Wed, 26 Jul 2023 01:26:14 -0400 Subject: [PATCH 3/3] fix typos --- book/IntroToHiFa.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/book/IntroToHiFa.ipynb b/book/IntroToHiFa.ipynb index 556ad39..73324a9 100644 --- a/book/IntroToHiFa.ipynb +++ b/book/IntroToHiFa.ipynb @@ -258,7 +258,7 @@ "p(n,a|\\lambda(\\theta)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n", "$$\n", "\n", - "where the expected event rate $\\lambda_b(\\theta)$ is modified from its nominal value by a **chosen interpolation function** that smoothly interpolates between the up- and down-variations $(\\pm1 \\sigma)$ of the constraint term to provide an event rate modifer for any value of $\\chi$." + "where the expected event rate $\\lambda_b(\\theta)$ is modified from its nominal value by a **chosen interpolation function** that smoothly interpolates between the up- and down-variations $(\\pm1 \\sigma)$ of the constraint term to provide an event rate modifier for any value of $\\chi$." ] }, { @@ -322,7 +322,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Howerver, notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different analysis signal regions and control regions. Therefore, we compute the likelihood as\n", + "However, notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different analysis signal regions and control regions. Therefore, we compute the likelihood as\n", "\n", "$$\n", "p_\\text{main} = p_\\text{channel1} * p_\\text{channel2} * p_\\text{channel3} \\cdots\n",