Merge pull request #52 from zuoxunwu/Xunwu_Spring2021_BuBc_release

Merge BuBc code from Xunwu for release for publication
HEP-FCC · Aug 3, 2023 · 803daef · 803daef
2 parents c3ff9ae + b0e8cd5
commit 803daef
Show file tree

Hide file tree

Showing 30 changed files with 4,799 additions and 3 deletions.
diff --git a/case-studies/flavour/BuBc2TauNu/README.md b/case-studies/flavour/BuBc2TauNu/README.md
@@ -0,0 +1,54 @@
+## Combined search for Bu and Bc to tau + nu decays
+
+### Overview
+
+This analysis is a spin-off of the Bc2TauNu search. \
+It inherits the basic steps from Bc2TauNu but aims to consider both Bc2TauNu and Bu2TauNu as separate signals. \
+This analysis will also be used as a benchmark to help optimize the generic [FCC analysis workflow](https://github.com/HEP-FCC/FCCAnalyses).
+
+### Contact
+
+Clement Helsens <clement.helsens@cern.ch> \
+Xunwu Zuo <xunwu.zuo@cern.ch>
+
+
+### Scripts
+
+The config files with common variables
+- `user_config.py`: paths and common variables.
+- `decay_mode_xs.py`: definitions of the branching ratios and cross-sections for the exclusive B-hadron and C-hadron modes considered as background.
+
+The following scripts are used to perform BDT trainings:
+- `process_sig_bkg_samples_for_xgb.py`: select a fraction of first stage training ntuples (ROOT) and convert to pickle files (pandas). 
+- `train_xgb.py`: train first stage binary BDT. Takes pickle files from previous step as inputs.
+- `train_xgb_stage2.py`: train second stage multi-class BDT. Takes second stage training ntuples as inputs.
+
+The following scripts are used to make plots for MVA performance and validations against overtraining:
+- `plot_bdt_vars_stage1.py`: plot all input variables of the first stage BDT. Takes first stage training and testing ntuples as inputs.
+- `plot_bdt_vars_stage2.py`: plot all input variables of the second stage BDT. Takes second stage training and testing ntuples as inputs.
+- `plot_xgb.py`: make performance plots for first stage BDT. Takes first stage pickles as inputs.
+- `plot_xgb_stage2.py`: make performance plots for second stage BDT. Takes second stage training ntuples as inputs.
+- `overtrain_test_stage1.py`: validate first stage BDT against overtraining. Takes first stage training and testing ntuples as inputs.
+- `overtrain_test_stage2.py`: validate second stage BDT against overtraining. Takes second stage training and testing ntuples as inputs.
+
+The statistical analysis is performed with the following steps:
+- `bkg_eff_decide_baseline_sel.py`: Scan different MVA cuts and check the changes in (inclusive) background efficiency and variable shapes. The variable chosen for the final fit needs to be independent from BDT cuts.
+- `bkg_eff_compare_incl_and_excl.py`: After baseline selection, compare efficiencies for further cuts between inclusive and exclusive samples, to make sure the exclusive samples give good estimates of background efficiency.
+- `bkg_eff_make_tight_sel.py`: Evaluate efficiencies in exclusive samples after tight selection.
+- `bkg_eff_MVA_spline.py`: In each category, fit event efficiencies with regard to each of MVA1 and MVA2 (Bu or Bc) cut scans. Output splines.
+- `final_sel_pretrim_signal.py`: Process signal samples to skim events. Just to save time of loading samples in the final analysis. No need to pretrim backgrounds as they are quick to load.
+- `final_sel_estimate_purity.py`: Read splines and estimate signal and background yields. Scan MVA cuts for optimal signal purity, output to yields.json
+- `fit_simultaneous_template.py`: Generate pseudo-data and perform combined fit in two categories. If more than 1 pseudo-data, output to fit_results.json
+- `fit_multidim_simultaneous_template.py`: (DEPRECATED) It is a modification of `fit_simultaneous_template.py` to fit multi-dimensional distributions. 
+- `check_sig_shape_variation.py`: Split signal samples into a few subsets and compare the signal shape to get an estimate of signal shape uncertainty.
+- `fit_test_sig_shape_uncert.py`: Duplicate of `fit_simultaneous_template.py`, with additional setup for signal shape uncertainty.
+- `toy_study_stats.py`: It takes the fit results from the previous fit step and extract the total uncertainty on the signal strength.
+- `toy_plot_summary.py`: It reads the results from the previous toy stats step and make a summary plot for several toy groups.
+
+One small file for a custom defined function:
+- `double_sided_gaussian.py`: double sided gaussian function in zfit, used in `toy_study_stats.py`. Would be better to replace it with something more centralized. 
+
+
+This is an additional script to study the decay chains in background events after MVA selections, in order to understand what processes need to be generated for exclusive background samples.
+- `survey_exclusive_modes.py`: It takes a special sample set (produced by `analysis_extra_bkg_composition.py` in FCCAnalysis) as inputs. It loops through event with high MVA scores and rebuild decay chains that are selected as the signal-like decay. Output is a table in a text file.
+
diff --git a/case-studies/flavour/BuBc2TauNu/bkg_eff_MVA_spline.py b/case-studies/flavour/BuBc2TauNu/bkg_eff_MVA_spline.py
@@ -0,0 +1,306 @@
+import sys, os, argparse
+import json
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+import joblib
+from collections import OrderedDict
+import uproot
+import pickle
+from decay_mode_xs import modes as bkg_modes
+from decay_mode_xs import prod, b_hadrons, c_hadrons
+from scipy import interpolate
+import time
+
+#Local code
+from userConfig import loc, train_vars, train_vars_vtx, Ediff_cut, NBin_MVA_fit, MVA_cuts, FCC_label
+import plotting
+import utils as ut
+
+from matplotlib import rc
+rc('font',**{'family':'serif','serif':['Roman']})
+rc('text', usetex=True)
+
+
+
+def run(cat, doPlot):
+
+    #BDT variables to optimise, and the hemisphere energy difference which we cut on > 10 GeV
+    var_list = ["EVT_MVA1Bis", "EVT_MVA2_bc", "EVT_MVA2_bu", "EVT_MVA2_bkg", "EVT_ThrustEmax_E", "EVT_ThrustEmin_E"]
+    path = loc.ANALYSIS
+
+    EVT_MVA2 = 'EVT_MVA2_bc'
+    if cat == 'bu':
+      EVT_MVA2 = 'EVT_MVA2_bu'
+
+    Cut_truth = 'CUT_CandTruth==0 and CUT_CandTruth2==0'
+    Cut_sel = f'{Cut_truth} and CUT_CandRho==1 and CUT_CandVtxThrustEmin==1 and EVT_CandMass < 1.8 and EVT_ThrustDiff_E > {Ediff_cut}'
+    cut = f"EVT_MVA1Bis > {MVA_cuts['tight']['MVA1']} and {EVT_MVA2} > {MVA_cuts['tight']['MVA2_sig']} and 1 - EVT_MVA2_bkg > {MVA_cuts['tight']['MVA2_bkg']} and {Cut_sel}"
+#    cut = f"EVT_MVA1Bis > {MVA_cuts['tight']['MVA1']} and {EVT_MVA2} > {MVA_cuts['tight']['MVA2']} and {Cut_sel}"
+
+    modes = OrderedDict()
+    modes['bb'] = OrderedDict()
+    modes['cc'] = OrderedDict()
+    bb_prefix = "p8_ee_Zbb_ecm91_EvtGen"
+    cc_prefix = "p8_ee_Zcc_ecm91_EvtGen"
+    #Background decays
+    for b in b_hadrons:
+        for d in bkg_modes[b]:
+            modes['bb'][f"{b}_{d}"] = [f"{bb_prefix}_{b}2{d}", prod[b], bkg_modes[b][d]]
+
+    for c in c_hadrons:
+        for d in bkg_modes[c]:
+            modes['cc'][f"{c}_{d}"] = [f"{cc_prefix}_{c}2{d}", prod[c], bkg_modes[c][d]]
+
+    tree  = {}
+    N_gen = {}
+    df = {}
+    for m in modes["bb"]:
+        N_gen[m] = uproot.open(f"{path}/{modes['bb'][m][0]}.root")["eventsProcessed"].value
+        tree[m]  = uproot.open(f"{path}/{modes['bb'][m][0]}.root")["events"]
+        df[m] = tree[m].arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
+        df[m] = df[m].query(cut)
+        df[m] = df[m][var_list]
+        df[m]["log_EVT_MVA1"] = -np.log(1. - df[m]["EVT_MVA1Bis"])
+        df[m]["log_EVT_MVA2_bkg"] = -np.log(df[m]["EVT_MVA2_bkg"])
+        df[m]["log_EVT_MVA2_sig"] = -np.log(1. - df[m][EVT_MVA2])
+        print (f"{m} : {len(df[m])}")
+    for m in modes["cc"]:
+        N_gen[m] = uproot.open(f"{path}/{modes['cc'][m][0]}.root")["eventsProcessed"].value
+        tree[m]  = uproot.open(f"{path}/{modes['cc'][m][0]}.root")["events"]
+        df[m] = tree[m].arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
+        df[m] = df[m].query(cut)
+        df[m] = df[m][var_list]
+        df[m]["log_EVT_MVA1"] = -np.log(1. - df[m]["EVT_MVA1Bis"])
+        df[m]["log_EVT_MVA2_bkg"] = -np.log(df[m]["EVT_MVA2_bkg"])
+        df[m]["log_EVT_MVA2_sig"] = -np.log(1. - df[m][EVT_MVA2])
+        print (f"{m} : {len(df[m])}")
+
+
+    if doPlot:
+        tree_bb = uproot.open(f"{path}/p8_ee_Zbb_ecm91.root")["events"]
+        df['bb_inc'] = tree_bb.arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
+        df['bb_inc'] = df['bb_inc'].query(cut)
+        df['bb_inc'] = df['bb_inc'][var_list]
+        df['bb_inc']["log_EVT_MVA1"] = -np.log(1. - df['bb_inc']["EVT_MVA1Bis"])
+        df['bb_inc']["log_EVT_MVA2_bkg"] = -np.log(df['bb_inc']["EVT_MVA2_bkg"])
+        df['bb_inc']["log_EVT_MVA2_sig"] = -np.log(1. - df['bb_inc'][EVT_MVA2])
+        print (f"bb : {len(df['bb_inc'])}")
+
+        tree_cc = uproot.open(f"{path}/p8_ee_Zcc_ecm91.root")["events"]
+        df['cc_inc'] = tree_cc.arrays(library="pd", how="zip", filter_name=["EVT_*", "CUT_*"])
+        df['cc_inc'] = df['cc_inc'].query(cut)
+        df['cc_inc'] = df['cc_inc'][var_list]
+        df['cc_inc']["log_EVT_MVA1"] = -np.log(1. - df['cc_inc']["EVT_MVA1Bis"])
+        df['cc_inc']["log_EVT_MVA2_bkg"] = -np.log(df['cc_inc']["EVT_MVA2_bkg"])
+        df['cc_inc']["log_EVT_MVA2_sig"] = -np.log(1. - df['cc_inc'][EVT_MVA2])
+        print (f"cc : {len(df['cc_inc'])}")
+
+
+    # Make spline for MVA1
+    print ('\ngetting MVA1 spline\n-------------------')
+    time_start = time.time()
+    xmin = MVA_cuts["spline"][f'MVA1_in_{cat}']['xmin']
+    xmax = MVA_cuts["spline"][f'MVA1_in_{cat}']['xmax']
+
+    base = {}
+    count = {}
+    weight = {}
+    eff_scan = {}
+    spline = {}
+
+    for m in df:
+        base[m] = len(df[m])
+    for flav in ['bb', 'cc']:
+      base[flav] = 0
+      eff_scan[flav] = 0
+      weight[flav] = 0
+      bin_edges = 0 
+
+      for m in modes[flav]:
+        count[m], bin_edges = np.histogram(df[m]["log_EVT_MVA1"], NBin_MVA_fit['MVA1'], range=(xmin, xmax))
+        count[m] = np.flip(count[m])
+        count[m] = count[m].cumsum()
+        count[m] = np.flip(count[m])
+        weight[m] = np.sqrt(count[m])
+        base[flav]     += base[m]   / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
+        eff_scan[flav] += count[m]  / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
+        weight[flav]   += weight[m]
+      eff_scan[flav] = eff_scan[flav] / base[flav]
+      cut_bound = bin_edges[:-1]
+      spline[flav] = interpolate.splrep(cut_bound, eff_scan[flav]) #do not apply weight w=weight[flav]
+                                                                   #weighted splines are heavily skewed from actual eff values
+                                                                   #
+
+      if doPlot:
+          eff_inc, bin_edges = np.histogram(df[f'{flav}_inc']["log_EVT_MVA1"], NBin_MVA_fit['MVA1'], range=(xmin, xmax))
+          eff_inc = np.flip(eff_inc)
+          eff_inc = eff_inc.cumsum()
+          eff_inc = np.flip(eff_inc)
+          eff_inc = eff_inc / len(df[f'{flav}_inc'])
+
+          eff_spl = interpolate.splev(cut_bound, spline[flav])
+          fig, ax = plt.subplots(figsize=(12,8))
+          plt.plot(cut_bound, eff_inc, color='red', label='Inclusive sample',linewidth=3)
+          plt.plot(cut_bound, eff_scan[flav], color='blue', label='Exclusive samples',linewidth=3)
+          plt.plot(cut_bound, eff_spl, color='black', label='Spline interpolation', linestyle='dashed') 
+          ax.tick_params(axis='both', which='major', labelsize=20)
+          ax.set_title( FCC_label, loc='right', fontsize=20)
+          plt.xlim(xmin,xmax)
+          plt.xlabel('log(1-BDT1)',fontsize=30)
+          plt.ylabel("Efficiency",fontsize=30)
+          plt.yscale('log')
+          ymin,ymax = plt.ylim()
+          plt.ylim(1e-6,1)
+          plt.legend(fontsize=18, loc="upper right")
+          plt.grid(alpha=0.4,which="both")
+          plt.tight_layout()
+          fig.savefig(f"{loc.PLOTS}/cross_check_spline_MVA1_in_{cat}_{flav}.pdf")
+
+
+      print (eff_scan[flav])
+      print (weight[flav])  
+      with open(f'{loc.PKL}/spline/{cat}_MVA1_scan_{flav}_spline.pkl', 'wb') as f:
+          pickle.dump(spline[flav], f)
+    time_stop = time.time()
+    print (f'Time for forming MVA1 spline: {time_stop - time_start}')
+
+    # Make grid spline for MVA2
+    print ('\ngetting MVA2 spline\n-------------------')
+    time_start = time.time()
+    sig_min = MVA_cuts["spline"][f'MVA2_sig_in_{cat}']['xmin']
+    sig_max = MVA_cuts["spline"][f'MVA2_sig_in_{cat}']['xmax']
+    bkg_min = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmin']
+    bkg_max = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmax']
+
+    for flav in ['bb', 'cc']:
+      base[flav] = 0
+      eff_scan[flav] = 0
+      weight[flav] = 0
+      xedges, yedges = 0, 0
+
+      for m in modes[flav]:
+        count[m], xedges, yedges = np.histogram2d(df[m]["log_EVT_MVA2_sig"], df[m]["log_EVT_MVA2_bkg"], bins=[NBin_MVA_fit['MVA2_sig'], NBin_MVA_fit['MVA2_bkg']], range=[[sig_min, sig_max],[bkg_min, bkg_max]])
+        count[m] = np.flip(count[m])
+        count[m] = count[m].cumsum(axis=0)
+        count[m] = count[m].cumsum(axis=1)
+        count[m] = np.flip(count[m])
+
+        weight[m] = np.sqrt(count[m])
+        base[flav]     += base[m]   / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
+        eff_scan[flav] += count[m]  / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
+        weight[flav]   += weight[m]
+      eff_scan[flav] = eff_scan[flav] / base[flav]
+      cut_bound_x = xedges[:-1]
+      cut_bound_y = yedges[:-1]
+      spline[flav] = interpolate.RectBivariateSpline(cut_bound_x, cut_bound_y, eff_scan[flav])  # RectBivariateSpline does not take weight
+
+      if doPlot:
+          eff_inc, xedges, yedges = np.histogram2d(df[f'{flav}_inc']["log_EVT_MVA2_sig"], df[f'{flav}_inc']["log_EVT_MVA2_bkg"], bins=[NBin_MVA_fit['MVA2_sig'], NBin_MVA_fit['MVA2_bkg']], range=[[sig_min, sig_max],[bkg_min, bkg_max]])
+          eff_inc = np.flip(eff_inc)
+          eff_inc = eff_inc.cumsum(axis=0)
+          eff_inc = eff_inc.cumsum(axis=1)
+          eff_inc = np.flip(eff_inc)
+          eff_inc = eff_inc / len(df[f'{flav}_inc']) 
+
+          eff_spl = []
+          for cut_val in cut_bound_x:
+            eff_spl.append( spline[flav].ev(cut_val, bkg_min) )
+
+          fig, ax = plt.subplots(figsize=(12,8))
+          plt.plot(cut_bound_x, eff_inc[:,0], color='red', label='Inclusive sample', linewidth=3)
+          plt.plot(cut_bound_x, eff_scan[flav][:,0], color='blue', label='Exclusive samples', linewidth=3)
+          plt.plot(cut_bound_x, eff_spl, color='black', label='Spline interpolation', linestyle='dashed')
+          ax.tick_params(axis='both', which='major', labelsize=20)
+          ax.set_title( FCC_label, loc='right', fontsize=20)
+          plt.xlim(sig_min,sig_max)
+          plt.xlabel(f'log(1-BDT2 {cat})',fontsize=30)
+          plt.ylabel("Efficiency",fontsize=30)
+          plt.yscale('log')
+          ymin,ymax = plt.ylim()
+          plt.ylim(1e-6,1)
+          plt.legend(fontsize=18, loc="upper right")
+          plt.grid(alpha=0.4,which="both")
+          plt.tight_layout()
+          fig.savefig(f"{loc.PLOTS}/cross_check_spline_MVA2_sig_in_{cat}_{flav}.pdf")
+
+          eff_spl = []
+          for cut_val in cut_bound_y:
+            eff_spl.append( spline[flav].ev(sig_min, cut_val) )
+          fig, ax = plt.subplots(figsize=(12,8))
+          plt.plot(cut_bound_y, eff_inc[0,:], color='red', label='Inclusive sample', linewidth=3)
+          plt.plot(cut_bound_y, eff_scan[flav][0,:], color='blue', label='Exclusive samples', linewidth=3)
+          plt.plot(cut_bound_y, eff_spl, color='black', label='Spline interpolation', linestyle='dashed')
+          ax.tick_params(axis='both', which='major', labelsize=20)
+          ax.set_title( FCC_label, loc='right', fontsize=20)
+          plt.xlim(bkg_min,bkg_max)
+          plt.xlabel(f'log(BDT2 bkg)',fontsize=30)
+          plt.ylabel("Efficiency",fontsize=30)
+          plt.yscale('log')
+          ymin,ymax = plt.ylim()
+          plt.ylim(1e-6,1)
+          plt.legend(fontsize=18, loc="upper right")
+          plt.grid(alpha=0.4,which="both")
+          plt.tight_layout()
+          fig.savefig(f"{loc.PLOTS}/cross_check_spline_MVA2_bkg_in_{cat}_{flav}.pdf")
+
+
+      print (eff_scan[flav])
+      print (weight[flav])
+      with open(f'{loc.PKL}/spline/{cat}_MVA2_2d_scan_{flav}_spline.pkl', 'wb') as f:
+          pickle.dump(spline[flav], f)
+    time_stop = time.time()
+    print (f'Time for forming MVA2 spline: {time_stop - time_start}') 
+
+
+
+
+
+
+
+    # Make 1D spline for MVA2_bkg, as a cross check
+#    print ('\ngetting MVA2 spline\n-------------------')
+#    time_start = time.time()
+#    bkg_min = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmin']
+#    bkg_max = MVA_cuts["spline"][f'MVA2_bkg_in_{cat}']['xmax']
+#
+#    for flav in ['bb', 'cc']:
+#      base[flav] = 0
+#      eff_scan[flav] = 0
+#      weight[flav] = 0
+#      bin_edges = 0
+#
+#      for m in modes[flav]:
+#        count[m], bin_edges = np.histogram(df[m]["log_EVT_MVA2_bkg"], NBin_MVA_fit['MVA1'], range=(bkg_min, bkg_max))
+#        count[m] = np.flip(count[m])
+#        count[m] = count[m].cumsum()
+#        count[m] = np.flip(count[m])
+#        weight[m] = np.sqrt(count[m])
+#        base[flav]     += base[m]   / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
+#        eff_scan[flav] += count[m]  / N_gen[m] * modes[flav][m][1] * modes[flav][m][2]
+#        weight[flav]   += weight[m]
+#      eff_scan[flav] = eff_scan[flav] / base[flav]
+#      cut_bound = bin_edges[:-1]
+#      spline[flav] = interpolate.splrep(cut_bound, eff_scan[flav], w=weight[flav])
+#
+#      print (eff_scan[flav])
+#      print (weight[flav])
+#      with open(f'{loc.PKL}/spline/{cat}_MVA2_1d_bkg_scan_{flav}_spline.pkl', 'wb') as f:
+#          pickle.dump(spline[flav], f)
+#    time_stop = time.time()
+#    print (f'Time for forming MVA2 spline: {time_stop - time_start}')
+
+
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Estimate optimal cuts and associated yields')
+    parser.add_argument("--cat", choices=['bu','bc'],required=False,default='bc')
+    parser.add_argument("--doPlot", choices=[True, False],required=False,default=True)
+    args = parser.parse_args()
+
+    run(args.cat, args.doPlot)
+
+if __name__ == '__main__':
+    main()
+