Skip to content

Commit

Permalink
feat: Expand discussion of constraints and modifiers in HistFactory I…
Browse files Browse the repository at this point in the history
…ntro (#55)

* Add further discussion of constraint terms and modifiers.
* Add links to other tools that use HistFactory.
* Add axis titles to plots.
  • Loading branch information
matthewfeickert authored Jul 26, 2023
1 parent b5fe6d0 commit aa3d13b
Showing 1 changed file with 65 additions and 48 deletions.
113 changes: 65 additions & 48 deletions book/IntroToHiFa.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@
"Like HistFactory, `pyhf` does not work with unbinned analyses. These will not be covered in the tutorial.\n",
"\n",
"So what uses HistFactory?\n",
"* TRexFitter\n",
"* WSMaker\n",
"* HistFitter\n",
"* [TRexFitter](https://trexfitter-docs.web.cern.ch/trexfitter-docs/) (ATLAS internal)\n",
"* [WSMaker](https://gitlab.cern.ch/atlas-physics/higgs/hbb/WSMaker) (ATLAS internal)\n",
"* [HistFitter](https://github.com/histfitter/histfitter)\n",
"\n",
"Most everyone in SUSY and Exotics who performs an asymptotic fit as part of their analysis is likely using HistFactory!"
]
Expand All @@ -66,9 +66,9 @@
"What is a histogram? Fundamentally, a histogram is a tool to bookkeep arrays of numbers:\n",
"* binning\n",
"* counts\n",
"* errors\n",
"* uncertainties\n",
"\n",
"Beyond that, it contains lots of other magic ingredients to make them more user-friendly for common operations (addition, division, etc...)."
"Beyond that, it contains helpful ingredients to make them more user-friendly for common operations (addition, division, etc...)."
]
},
{
Expand All @@ -88,13 +88,13 @@
"\n",
"<img src=\"https://raw.githubusercontent.com/scikit-hep/pyhf/main/docs/_static/img/README_1bin_example.png\" alt=\"common operation - parameter scan\" width=400 />\n",
"\n",
"Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this incredibly simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n",
"Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n",
"\n",
"$$\n",
"p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b)\n",
"$$\n",
"\n",
"Why Poisson? This is a counting experiment after all. A region we want to model will then just be a series of Poissons."
"A Poisson model is used as we are performing a counting experiment (counting the number of random events with an expected rate) in each bin of the observable."
]
},
{
Expand Down Expand Up @@ -127,23 +127,25 @@
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"ax.bar(bins, expected_yields, 1.0, label=r\"expected\", edgecolor=\"blue\", alpha=0.5)\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
"ax.bar(bins, expected_yields, 1.0, label=r\"Expected\", edgecolor=\"blue\", alpha=0.5)\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
"ax.set_ylim(0, 6)\n",
"ax.legend();"
"ax.legend()\n",
"ax.set_xlabel(\"Observable\", fontsize=12)\n",
"ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, we don't always often have just a single (MC) sample, and $\\lambda$ is often the sum of multiple sample yields\n",
"However, we don't always often have just a single expected (simulation) sample, and $\\lambda$ is often the sum of multiple sample yields\n",
"\n",
"$$\n",
"\\lambda = \\sum_{\\mathrm{sample}\\ s} \\lambda_s\n",
"$$\n",
"\n",
"A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. Then the p.d.f. might look like\n",
"A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. The model is then\n",
"\n",
"$$\n",
"p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b) \\qquad \\lambda_b = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}\n",
Expand All @@ -169,22 +171,24 @@
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
"ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
"ax.bar(\n",
" bins, signal, 1.0, label=r\"signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n",
" bins, signal, 1.0, label=r\"Signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n",
")\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
"ax.set_ylim(0, 6)\n",
"ax.legend();"
"ax.legend()\n",
"ax.set_xlabel(\"Observable\", fontsize=12)\n",
"ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc...\n",
"Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc.\n",
"\n",
"Suppose we wanted a $\\mu_s$ that is a normalization factor scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n",
"Suppose we wanted a a normalization factor $\\mu_s$ scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n",
"\n",
"$$\n",
"p(n|\\lambda(\\mu)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
Expand All @@ -211,46 +215,57 @@
" print(f\"observed: {observed}\\n\")\n",
"\n",
" fig, ax = plt.subplots()\n",
" ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(\n",
" bins,\n",
" signal,\n",
" 1.0,\n",
" label=r\"signal\",\n",
" label=r\"Signal\",\n",
" edgecolor=\"blue\",\n",
" bottom=background,\n",
" alpha=0.5,\n",
" )\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
" ax.set_ylim(0, 6)\n",
" ax.legend();"
" ax.legend()\n",
" ax.set_xlabel(\"Observable\", fontsize=12)\n",
" ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One final thing to finish our build up of a simplified p.d.f. is about **auxiliary measurements**. What we mean is that perhaps the background sample is modeled by some normalization parameter, but we've also performed additional measurements in a separate analysis that constraints the parametrization (e.g. Jet Energy Scale) so we have stronger confidence that the true parameter is within a certain range.\n",
"One final thing to finish our build up of a simplified HistFactory model is the concept of **auxiliary measurements**. Perhaps the background sample rate is modified by some normalization parameter, and we've made measurements of this parameter in a separate analysis (e.g. studies of the Jet Energy Scale). These prior experimental studies give a constraint that the parameter lies within a certain range.\n",
"\n",
"For some parameters in a statistical model, all we have to infer its values is the given analysis. These parameters are **unconstrained** ($\\eta$):\n",
"For some parameters in a statistical model we don't have prior experimental evidence for their values and must infer its values is the given analysis. These are **unconstrained** parameters ($\\eta$) and enter into the main model as parameters of the event rate $\\lambda(\\theta)$\n",
"\n",
"$$\n",
"p(n | \\lambda(\\theta))\n",
"p(n | \\lambda(\\theta)).\n",
"$$\n",
"\n",
"For many parameters, we have the **auxiliary data** ($a$) given as an *auxiliary measurement* which is added to the main p.d.f.. These parameters are **constrained** ($\\chi$).\n",
"For many model parameters, their values in the model are constrained by a _constraint term function_, included in the model along with the the main model p.d.f, which describes **auxiliary measurements/data** ($a$) about the model parameter. These are **constrained** parameters ($\\chi$) and enter into the model both in the constraint terms and as parameters of the event rate $\\lambda(\\theta)$\n",
"\n",
"$$\n",
"p_\\chi(a | \\chi)\n",
"$$\n",
"\n",
"where $\\theta = \\{\\eta, \\chi\\}$. This constraining function can generally be anything, but most of the time in HistFactory - it's a Gaussian or a Poisson. The p.d.f. expands to be\n",
"where $\\theta = \\{\\eta, \\chi\\}$. This constraining function model is chosen by the physics it represents, but in HistFactory most constraint terms are modeled as a Normal (Gaussian) or Poisson.\n",
"\n",
"With the constraint terms the model expands to be\n",
"\n",
"$$\n",
"p(n,a|\\lambda(\\theta)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
"$$\n",
"\n",
"For this simple example, let's consider a Gaussian centered at $\\mu=0$ with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8."
"where the expected event rate $\\lambda_b(\\theta)$ is modified from its nominal value by a **chosen interpolation function** that smoothly interpolates between the up- and down-variations $(\\pm1 \\sigma)$ of the constraint term to provide an event rate modifier for any value of $\\chi$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this simple example, let's consider a constraint term of a Normal distribution centered at $\\mu=0$ (\"auxiliary measurement\" $a=0$) with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8."
]
},
{
Expand All @@ -259,11 +274,13 @@
"metadata": {},
"outputs": [],
"source": [
"def gaussian_constraint(mu_b=0.0):\n",
" return norm.pdf(mu_b, loc=0.0, scale=1.0)\n",
"def normal_constraint(mu_b=0.0):\n",
" # auxiliary measurement of 0\n",
" # though note that for Normal observation and mean are symmetric under exchange\n",
" return norm.pdf(0.0, loc=mu_b, scale=1.0)\n",
"\n",
"\n",
"# interpolating\n",
"# selected interpolation function\n",
"def interpolate(down, nom, up, alpha):\n",
" if alpha >= 0:\n",
" return (up - nom) * alpha + 1\n",
Expand All @@ -281,51 +298,51 @@
" print(f\"signal: {signal}\")\n",
" print(f\"background: {background}\")\n",
" print(f\"observed: {observed}\")\n",
" print(\n",
" f\"likelihood scaled by: {gaussian_constraint(mu_b)/gaussian_constraint(0.0)}\\n\"\n",
" )\n",
" print(f\"likelihood scaled by: {normal_constraint(mu_b)/normal_constraint(0.0)}\\n\")\n",
"\n",
" fig, ax = plt.subplots()\n",
" ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(\n",
" bins,\n",
" signal,\n",
" 1.0,\n",
" label=r\"signal\",\n",
" label=r\"Signal\",\n",
" edgecolor=\"blue\",\n",
" bottom=background,\n",
" alpha=0.5,\n",
" )\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
" ax.set_ylim(0, 6)\n",
" ax.legend();"
" ax.legend()\n",
" ax.set_xlabel(\"Observable\", fontsize=12)\n",
" ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But that's not all! Notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different signal regions and control regions. Therefore, we compute the likelihood as\n",
"However, notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different analysis signal regions and control regions. Therefore, we compute the likelihood as\n",
"\n",
"$$\n",
"p_\\text{main} = p_\\text{channel1} * p_\\text{channel2} * p_\\text{channel3} \\cdots\n",
"$$\n",
"\n",
"So in fact, we can then expand out the likelihood definition further\n",
"We then expand out the likelihood definition further across channels\n",
"\n",
"$$\n",
"p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
"p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}}\\, \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
"$$\n",
"\n",
"As you can see, this is sort of a bookkeeping problem. We have two pieces of this likelihood:\n",
"* our main model, which consists of\n",
"There are now two pieces of the model:\n",
"* the main model, which consists of\n",
" * several channels (regions, histograms, etc), where\n",
" * each channel is a set of simultaneous Poissons measuring the bin count against an expected value, where\n",
" * each channel is a set of Poissons measuring the bin count for an expected value, where\n",
" * the expected value is the sum of various samples, where\n",
" * each samples expected value can be a function of parameters (or modifiers)\n",
"* the constraint model, which consists of\n",
" * several auxiliary measurements, where\n",
" * each measurement comes with auxiliary data\n",
" * constraint terms on model parameters, where\n",
" * each constraint term describes auxiliary measurements\n",
" \n",
"It should be clear by now that this is quite a lot of pieces to keep track of. This is where HistFactory comes in to play. Using HistFactory, we can\n",
"* describe observed event rates and expected event rates\n",
Expand Down Expand Up @@ -433,7 +450,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -447,7 +464,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.7"
"version": "3.10.4"
}
},
"nbformat": 4,
Expand Down

0 comments on commit aa3d13b

Please sign in to comment.