Artem Ryzhikov$^{1,2}$, Andrey Ustyuzhanin$^{1,2,3}$
ACAT 2017 University of Washington, Seattle, August 21-25, 2017
In the research, a new approach for finding rare events in high-energy physics was tested. As an example of physics channel the decay of
The method shows significantly better results than Data Doping [2]. Moreover, gradient reversal gives more flexibility and helps to ensure flatness of the network output wrt certain variables (e.g. nuisance parameters) as well. Thus, described approach is going to be new state-of-the-art.
Nowadays there are lots of machine learning approaches applied for filtering rare events in high-energy physics. Methods like Decision Trees, Linear models, Boosting and Neural Networks found a great application in such issues. Today, with the advent of new approaches of deep learning, new frameworks and greater computing power neural networks are becoming more and more relevant. However the application of a large number of machine learning methods for the problems of high-energy physics is difficult or even impossible. Neural networks is not exception. The reason of that is low generalization quality and high number of physical restrictions for ML models. Besides high values of metrics (in classic ML problems) models should be physical-interpretable. For example, it's absolutely obvious, that the model able to determine the events from only specific channel of background is not interesting for physics. The models able to determine the whole family of events with common physics are interesting much more. In the research we tried new approach for finding rare events in high energy physics. The method based on cross-domain adaptation with gradient reversal [1], and presents dense multi-branch neural network: first branch is responsible for signal detection, other branches helps to avoid overfit on Monte-Carlo and mass pick. The given results shows that explored architecture shows best results (in comparison with Data Doping technique [2]). It helps to avoid overfiting without any loss of quality! Thus, described approach is going to be new state-of-the-art.
Usage of Monte Carlo-generated sample is fairly common approach in the High Energy Physics. However it's often hard to include all physics factors into Monte-Carlo simulation. Moreover, not all variables can be simulated accurately enough, so the discrepancies may lead either to: a) expensive simulation of both signal and background, or to.. b) ML models trained on simulated sample that overfits to the simulated artifacts and work poorly on the real data.
In particular such issue is actual in such problem like rare signal detection. The general reason is that it's too hard to mine enough real signals for training dataset.
Figure 1. Training on the mixture of simulated (MC) and real dataIn the research we use τ → 3µ events as signal (analysis) channel that has been published at the Data Science challenge on kaggle.com [3]. The challenge is three-fold:
- Since the classifier is trained on a mixture of simulated signal and real data background, it is possible to reach a high performance by exploiting features that are not perfectly modeled in the simulation. We require that the classifier should not have a large discrepancy when applied to real and simulated data. To verify this, we use a control channel,
$D_s → φπ$ , that has a similar topology as the signal decay,$τ → 3µ$ (analysis channel).$D_s → φπ$ is a well-known decay, as it happens much more frequently. So the goal is to train a classifier able to separate A from B but not C from D (Figure 1). A Kolmogorov-Smirnov (KS) test is used to evaluate the differences between the classifier distribution on each sample. In our problem KS is calculated between prediction’s distributions for real and simulated data for$Ds → φπ$ channel. The KS-value of the test should be less than 0.09. - The classifier output should not be correlated with reconstructed mass feature, i.e. it’s output distribution should not sculpt artificial bumps that could be interpreted as a (false) signal. To test the flatness we’ve used Cramer-von Mises (CvM) test that gives the uniformity of the distribution [4].
- The quality of signal discrimination should be as much as possible. The evaluation metric for signal discrimination is Weighted Area Under the ROC Curve (truncated AUC) [3]
On publication moment Data Doping [2] was the best technique to train classifier on Monte-Carlo without overfit. We selected it as a baseline. The idea of Data Doping is to “dope” the training set with a small number of Monte-Carlo events from the control channel C, but labeled as background. Thus, it helps the classifier to dissalow features discriminating real and background. The technique is shown on Figure 3 below.
Figure 3. Data dopingThe optimal number of doping events was taken from [2].
As an alternative for Data Doping we discovered new method, based on Cross-Domain adaptation with gradient reversal [1]. The concept is the same as in GAN (Generative Adversarial models) [7]: we use an additional branch, which is trained to discriminate real from background (discriminator). But we also reverse gradient from discriminator to make general model not able to discriminate real from background. It's visualized on Figure 4
Figure 4 Cross-domain adaptation with gradient reversal
Instead, we placed reversal into loss sign using following simple equation
The network architecture has a dense 2(3*)-branch structure (Figure 5) and consists from following parts: 1. Feature extractor – responsible for feature generation 2. Label predictor – responsible for the target prediction (signal/background discrimination) 3. Domain classifier – responsible for cross-domain adaptation and prevents the network from overfitting to MC domain 4*. Mass predictor - helps to eliminate the correlation between classifier predictions and reconstructed mass of the decay
Training dataset (Analysis channel) consists of 67000+ events of signal (
The architecture was implemented on Python 2.7 using Lasagne (ver. 2.1) framework. We tuned the following parameters to obtain stable results:
- learning rates ratio between branches (learning_rate_multiplier);
- batch sizes ratio for branches. The best observed values were 1000 and 300 for Label predictor and Domain classifier respectively
- number of batches per epoch ratio. The best observed ratio between batches number was 6:1 for Label predictor and Domain classifier respectively
The model was trained for 20 epoches with RMSProp optimizer. To achieve highest stability and reproducibility we trained only Feature extractor and Label predictor parts several (20) epoches with frozen Domain classifier (1'st step) and after that Domain classifier was trained (2'nd step).
In first step, as it was mentioned before, only Feature extractor and Label predictor were trained. Training procedure was 20-epoches RMSProp-optimization of categorical (2-classes) cross-entropy loss. Batch size was 1000. Learning rate at first step was 0.01 and was decayed 10 times each 5 epoches On the second step we restored Feature extractor and Label predictor from previous step and trained all parts. Feature extractor and Label predictor trained with the same way (20 epoches, same loss, same batch size, same initial learning rate and the same learning rate decay policy). Domain classifier was trained with the same parameters in each 6 batches of Feature extractor and Label predictor training and with different batch size (batch_size=300)
To eliminate KS-value we increased such Domain classifier’s parameters as learning rate, corresponding batch size and batches frequency. Figure 6 represents such dependency from one of such parameters. It was observed that too small values of KS makes CvM values higher and AUC metric smaller. So the goal was to find balance between KS, CvM and AUC using parameters described above.
Figure 6. Metrics dependency from domain classifier’s learning_rate_multiplier In the research following models were compared: Baseline (label predictor from Figure 5 without Domain Adaptation), Domain Adaptation (our approach), Data Doping. Models were tested on 85000+ events of signal (
AUC (truncated) | KS (< 0.09) | CvM (<0.002) | |
---|---|---|---|
Mass-aware Classifier | 0.999 | 0.18 | 0.0008 |
Data Doping | 0.9744 | 0.087 | 0.0011 |
Domain-adaptation | 0.979 | 0.06 | 0.0008 |
The method proposed is shown to work well on a typical particle physics analysis problem:
- Remarkable classification quality;
- Ro- Robustness to MC / Real data mixture;
- Uniformity of the output wrt chosen (mass or nuisance parameter) feature
- Tradeoff between discrimination power and overfitting tuned (Figure 6)
[1] Ganin, Y, and V. Lempitsky. Unsupervised domain adaptation by backpropagation. International Conference on Machine Learning. 2015. [2] V. Gaitan. Data Doping solution for “Flavours in Physics” challenge https://indico.cern.ch/event/433556/contributions/1930582/ [3] Flavours of Physics Competition, https://www.kaggle.com/c/flavours-of-physics [4] A. Rogozhnikov, A. Bukva, V. Gligorov, A. Ustyuzhanin, M. Williams New approaches for boosting to uniformity, JINST, 2015 [5] A. Ryzhikov, A. Ustyuzhanin Source code for Domain Adaptation research https://github.com/Leensman/Cross-domain-adaptation-on-HEP-HSE-course-work- [6] A. Ryzhikov, A. Ustyuzhanin Gradient reversal for MC/real data calibration https://indico.cern.ch/event/567550/contributions/2629724/attachments/1513629/2361286/Ryzhikov_poster_v6.pdf [7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio Generative Adversarial Networks https://arxiv.org/abs/1406.2661