Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Expand discussion of constraints and modifiers in HistFactory Intro #55

Merged
merged 3 commits into from
Jul 26, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 65 additions & 48 deletions book/IntroToHiFa.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@
"Like HistFactory, `pyhf` does not work with unbinned analyses. These will not be covered in the tutorial.\n",
"\n",
"So what uses HistFactory?\n",
"* TRexFitter\n",
"* WSMaker\n",
"* HistFitter\n",
"* [TRexFitter](https://trexfitter-docs.web.cern.ch/trexfitter-docs/) (ATLAS internal)\n",
"* [WSMaker](https://gitlab.cern.ch/atlas-physics/higgs/hbb/WSMaker) (ATLAS internal)\n",
"* [HistFitter](https://github.com/histfitter/histfitter)\n",
"\n",
"Most everyone in SUSY and Exotics who performs an asymptotic fit as part of their analysis is likely using HistFactory!"
]
Expand All @@ -66,9 +66,9 @@
"What is a histogram? Fundamentally, a histogram is a tool to bookkeep arrays of numbers:\n",
"* binning\n",
"* counts\n",
"* errors\n",
"* uncertainties\n",
"\n",
"Beyond that, it contains lots of other magic ingredients to make them more user-friendly for common operations (addition, division, etc...)."
"Beyond that, it contains helpful ingredients to make them more user-friendly for common operations (addition, division, etc...)."
]
},
{
Expand All @@ -88,13 +88,13 @@
"\n",
"<img src=\"https://raw.githubusercontent.com/scikit-hep/pyhf/main/docs/_static/img/README_1bin_example.png\" alt=\"common operation - parameter scan\" width=400 />\n",
"\n",
"Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this incredibly simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n",
"Let's make up some samples and histograms to go along with it to understand what's going on. Suppose we have an analysis with expected event rate $\\lambda$ and measurements $n$. For this simple case, the overall probability of the full experiment is the **joint probability** of each bin:\n",
"\n",
"$$\n",
"p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b)\n",
"$$\n",
"\n",
"Why Poisson? This is a counting experiment after all. A region we want to model will then just be a series of Poissons."
"A Poisson model is used as we are performing a counting experiment (counting the number of random events with an expected rate) in each bin of the observable."
]
},
{
Expand Down Expand Up @@ -127,23 +127,25 @@
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"ax.bar(bins, expected_yields, 1.0, label=r\"expected\", edgecolor=\"blue\", alpha=0.5)\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
"ax.bar(bins, expected_yields, 1.0, label=r\"Expected\", edgecolor=\"blue\", alpha=0.5)\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
"ax.set_ylim(0, 6)\n",
"ax.legend();"
"ax.legend()\n",
"ax.set_xlabel(\"Observable\", fontsize=12)\n",
"ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, we don't always often have just a single (MC) sample, and $\\lambda$ is often the sum of multiple sample yields\n",
"However, we don't always often have just a single expected (simulation) sample, and $\\lambda$ is often the sum of multiple sample yields\n",
"\n",
"$$\n",
"\\lambda = \\sum_{\\mathrm{sample}\\ s} \\lambda_s\n",
"$$\n",
"\n",
"A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. Then the p.d.f. might look like\n",
"A typical case might be multiple (sub)dominant backgrounds or having a model where the observed events are described by a signal + background p.d.f. The model is then\n",
"\n",
"$$\n",
"p(n|\\lambda) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b) \\qquad \\lambda_b = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}\n",
Expand All @@ -169,22 +171,24 @@
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
"ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
"ax.bar(\n",
" bins, signal, 1.0, label=r\"signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n",
" bins, signal, 1.0, label=r\"Signal\", edgecolor=\"blue\", bottom=background, alpha=0.5\n",
")\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
"ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
"ax.set_ylim(0, 6)\n",
"ax.legend();"
"ax.legend()\n",
"ax.set_xlabel(\"Observable\", fontsize=12)\n",
"ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc...\n",
"Already, you can see the p.d.f. for this simple case starts expanding to be a little bit more generic, and a little bit more flexible. Now we want to incorporate when the expected yields for signal and backgrounds depend on some **parameters**, perhaps how we applied calibrations to some objects, or how we configured our Monte-Carlo generators, etc.\n",
"\n",
"Suppose we wanted a $\\mu_s$ that is a normalization factor scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n",
"Suppose we wanted a a normalization factor $\\mu_s$ scaling up (or down!) the sample. For example, if we want to parametrize the signal strength (without changing background). So $\\lambda$ becomes a function of $\\theta = \\{\\mu\\}$ (a set of the parameters that determine the expected event rate), then our p.d.f. expands to be\n",
"\n",
"$$\n",
"p(n|\\lambda(\\mu)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
Expand All @@ -211,46 +215,57 @@
" print(f\"observed: {observed}\\n\")\n",
"\n",
" fig, ax = plt.subplots()\n",
" ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(\n",
" bins,\n",
" signal,\n",
" 1.0,\n",
" label=r\"signal\",\n",
" label=r\"Signal\",\n",
" edgecolor=\"blue\",\n",
" bottom=background,\n",
" alpha=0.5,\n",
" )\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
" ax.set_ylim(0, 6)\n",
" ax.legend();"
" ax.legend()\n",
" ax.set_xlabel(\"Observable\", fontsize=12)\n",
" ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One final thing to finish our build up of a simplified p.d.f. is about **auxiliary measurements**. What we mean is that perhaps the background sample is modeled by some normalization parameter, but we've also performed additional measurements in a separate analysis that constraints the parametrization (e.g. Jet Energy Scale) so we have stronger confidence that the true parameter is within a certain range.\n",
"One final thing to finish our build up of a simplified HistFactory model is the concept of **auxiliary measurements**. Perhaps the background sample rate is modified by some normalization parameter, and we've made measurements of this parameter in a separate analysis (e.g. studies of the Jet Energy Scale). These prior experimental studies give a constraint that the parameter lies within a certain range.\n",
"\n",
"For some parameters in a statistical model, all we have to infer its values is the given analysis. These parameters are **unconstrained** ($\\eta$):\n",
"For some parameters in a statistical model we don't have prior experimental evidence for their values and must infer its values is the given analysis. These are **unconstrained** parameters ($\\eta$) and enter into the main model as parameters of the event rate $\\lambda(\\theta)$\n",
"\n",
"$$\n",
"p(n | \\lambda(\\theta))\n",
"p(n | \\lambda(\\theta)).\n",
"$$\n",
"\n",
"For many parameters, we have the **auxiliary data** ($a$) given as an *auxiliary measurement* which is added to the main p.d.f.. These parameters are **constrained** ($\\chi$).\n",
"For many model parameters, their values in the model are constrained by a _constraint term function_, included in the model along with the the main model p.d.f, which describes **auxiliary measurements/data** ($a$) about the model parameter. These are **constrained** parameters ($\\chi$) and enter into the model both in the constraint terms and as parameters of the event rate $\\lambda(\\theta)$\n",
"\n",
"$$\n",
"p_\\chi(a | \\chi)\n",
"$$\n",
"\n",
"where $\\theta = \\{\\eta, \\chi\\}$. This constraining function can generally be anything, but most of the time in HistFactory - it's a Gaussian or a Poisson. The p.d.f. expands to be\n",
"where $\\theta = \\{\\eta, \\chi\\}$. This constraining function model is chosen by the physics it represents, but in HistFactory most constraint terms are modeled as a Normal (Gaussian) or Poisson.\n",
"\n",
"With the constraint terms the model expands to be\n",
"\n",
"$$\n",
"p(n,a|\\lambda(\\theta)) = \\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_b | \\lambda_b(\\theta)) \\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi) \\qquad \\lambda_b(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{bs}(\\theta)\n",
"$$\n",
"\n",
"For this simple example, let's consider a Gaussian centered at $\\mu=0$ with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8."
"where the expected event rate $\\lambda_b(\\theta)$ is modified from its nominal value by a **chosen interpolation function** that smoothly interpolates between the up- and down-variations $(\\pm1 \\sigma)$ of the constraint term to provide an event rate modifier for any value of $\\chi$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this simple example, let's consider a constraint term of a Normal distribution centered at $\\mu=0$ (\"auxiliary measurement\" $a=0$) with $\\sigma=1$ for constraining the normalization on the background where an up-variation ($\\mu_b = +1$) scales by 1.3, and a down-variation ($\\mu_b = -1$) scales by 0.8."
]
},
{
Expand All @@ -259,11 +274,13 @@
"metadata": {},
"outputs": [],
"source": [
"def gaussian_constraint(mu_b=0.0):\n",
" return norm.pdf(mu_b, loc=0.0, scale=1.0)\n",
"def normal_constraint(mu_b=0.0):\n",
" # auxiliary measurement of 0\n",
" # though note that for Normal observation and mean are symmetric under exchange\n",
" return norm.pdf(0.0, loc=mu_b, scale=1.0)\n",
"\n",
"\n",
"# interpolating\n",
"# selected interpolation function\n",
"def interpolate(down, nom, up, alpha):\n",
" if alpha >= 0:\n",
" return (up - nom) * alpha + 1\n",
Expand All @@ -281,51 +298,51 @@
" print(f\"signal: {signal}\")\n",
" print(f\"background: {background}\")\n",
" print(f\"observed: {observed}\")\n",
" print(\n",
" f\"likelihood scaled by: {gaussian_constraint(mu_b)/gaussian_constraint(0.0)}\\n\"\n",
" )\n",
" print(f\"likelihood scaled by: {normal_constraint(mu_b)/normal_constraint(0.0)}\\n\")\n",
"\n",
" fig, ax = plt.subplots()\n",
" ax.bar(bins, background, 1.0, label=r\"$t\\bar{t}$\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(bins, background, 1.0, label=r\"Background\", edgecolor=\"red\", alpha=0.5)\n",
" ax.bar(\n",
" bins,\n",
" signal,\n",
" 1.0,\n",
" label=r\"signal\",\n",
" label=r\"Signal\",\n",
" edgecolor=\"blue\",\n",
" bottom=background,\n",
" alpha=0.5,\n",
" )\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"observed\")\n",
" ax.scatter(bins, [3, 4, 4], color=\"black\", label=\"Observed\")\n",
" ax.set_ylim(0, 6)\n",
" ax.legend();"
" ax.legend()\n",
" ax.set_xlabel(\"Observable\", fontsize=12)\n",
" ax.set_ylabel(\"Count\", fontsize=12);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But that's not all! Notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different signal regions and control regions. Therefore, we compute the likelihood as\n",
"However, notice that all along, we've been only discussing a single \"channel\" with 3 bins. The statistical analysis being studied might involve **multiple channels** corresponding to different analysis signal regions and control regions. Therefore, we compute the likelihood as\n",
"\n",
"$$\n",
"p_\\text{main} = p_\\text{channel1} * p_\\text{channel2} * p_\\text{channel3} \\cdots\n",
"$$\n",
"\n",
"So in fact, we can then expand out the likelihood definition further\n",
"We then expand out the likelihood definition further across channels\n",
"\n",
"$$\n",
"p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
"p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}}\\, \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
"$$\n",
"\n",
"As you can see, this is sort of a bookkeeping problem. We have two pieces of this likelihood:\n",
"* our main model, which consists of\n",
"There are now two pieces of the model:\n",
"* the main model, which consists of\n",
" * several channels (regions, histograms, etc), where\n",
" * each channel is a set of simultaneous Poissons measuring the bin count against an expected value, where\n",
" * each channel is a set of Poissons measuring the bin count for an expected value, where\n",
" * the expected value is the sum of various samples, where\n",
" * each samples expected value can be a function of parameters (or modifiers)\n",
"* the constraint model, which consists of\n",
" * several auxiliary measurements, where\n",
" * each measurement comes with auxiliary data\n",
" * constraint terms on model parameters, where\n",
" * each constraint term describes auxiliary measurements\n",
" \n",
"It should be clear by now that this is quite a lot of pieces to keep track of. This is where HistFactory comes in to play. Using HistFactory, we can\n",
"* describe observed event rates and expected event rates\n",
Expand Down Expand Up @@ -433,7 +450,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -447,7 +464,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.7"
"version": "3.10.4"
}
},
"nbformat": 4,
Expand Down