Gradient reversal for MC/real data calibration

Artem Ryzhikov$^{1,2}$, Andrey Ustyuzhanin$^{1,2,3}$

ACAT 2017 University of Washington, Seattle, August 21-25, 2017

_{$^1$ NRU Higher School of Economics, Russia
$^2$ Yandex School of Data Analysis
$^3$ Moscow Institute of Physics and Technology
E-mail: artemryzhikov@gmail.com}

Abstract

In the research, a new approach for finding rare events in high-energy physics was tested. As an example of physics channel the decay of $\tau \rightarrow 3 \mu$ is taken that has been published on Kaggle within LHCb-supported challenge. The training sample consists of simulated signal and real background, so the challenge is to train classifier in such way that it picks up signal/background differences and doesn’t overfits to simulation-specific features. The approach suggested is based on cross-domain adaptation using neural networks with gradient reversal [1]. The network architecture is a dense multi-branch structure. One branch is responsible for signal/background discrimination, the second branch helps to avoid overfitting on Monte-Carlo training dataset. The tests showed that this architecture is a robust a mechanism for choosing tradeoff between discrimination power and overfitting, moreover, it also improves the quality of the baseline prediction. Thus, this approach allowed us to train deep learning models without reducing the quality, which allow us to distinguish physical parameters, but do not allow us to distinguish simulated events from real ones. The third network branch helps to eliminate the correlation between classifier predictions and reconstructed mass of the decay, thereby making such approach highly viable for great variety of physics searches.

The method shows significantly better results than Data Doping [2]. Moreover, gradient reversal gives more flexibility and helps to ensure flatness of the network output wrt certain variables (e.g. nuisance parameters) as well. Thus, described approach is going to be new state-of-the-art.

1. Introduction

Nowadays there are lots of machine learning approaches applied for filtering rare events in high-energy physics. Methods like Decision Trees, Linear models, Boosting and Neural Networks found a great application in such issues. Today, with the advent of new approaches of deep learning, new frameworks and greater computing power neural networks are becoming more and more relevant. However the application of a large number of machine learning methods for the problems of high-energy physics is difficult or even impossible. Neural networks is not exception. The reason of that is low generalization quality and high number of physical restrictions for ML models. Besides high values of metrics (in classic ML problems) models should be physical-interpretable. For example, it's absolutely obvious, that the model able to determine the events from only specific channel of background is not interesting for physics. The models able to determine the whole family of events with common physics are interesting much more. In the research we tried new approach for finding rare events in high energy physics. The method based on cross-domain adaptation with gradient reversal [1], and presents dense multi-branch neural network: first branch is responsible for signal detection, other branches helps to avoid overfit on Monte-Carlo and mass pick. The given results shows that explored architecture shows best results (in comparison with Data Doping technique [2]). It helps to avoid overfiting without any loss of quality! Thus, described approach is going to be new state-of-the-art.

2. Problem

Usage of Monte Carlo-generated sample is fairly common approach in the High Energy Physics. However it's often hard to include all physics factors into Monte-Carlo simulation. Moreover, not all variables can be simulated accurately enough, so the discrepancies may lead either to: a) expensive simulation of both signal and background, or to.. b) ML models trained on simulated sample that overfits to the simulated artifacts and work poorly on the real data.

In particular such issue is actual in such problem like rare signal detection. The general reason is that it's too hard to mine enough real signals for training dataset.

Figure 1. Training on the mixture of simulated (MC) and real data

In the research we use τ → 3µ events as signal (analysis) channel that has been published at the Data Science challenge on kaggle.com [3]. The challenge is three-fold:

Since the classifier is trained on a mixture of simulated signal and real data background, it is possible to reach a high performance by exploiting features that are not perfectly modeled in the simulation. We require that the classifier should not have a large discrepancy when applied to real and simulated data. To verify this, we use a control channel, $D_s → φπ$, that has a similar topology as the signal decay, $τ → 3µ$ (analysis channel). $D_s → φπ$ is a well-known decay, as it happens much more frequently. So the goal is to train a classifier able to separate A from B but not C from D (Figure 1). A Kolmogorov-Smirnov (KS) test is used to evaluate the differences between the classifier distribution on each sample. In our problem KS is calculated between prediction’s distributions for real and simulated data for $Ds → φπ$ channel. The KS-value of the test should be less than 0.09.
The classifier output should not be correlated with reconstructed mass feature, i.e. it’s output distribution should not sculpt artificial bumps that could be interpreted as a (false) signal. To test the flatness we’ve used Cramer-von Mises (CvM) test that gives the uniformity of the distribution [4].
The quality of signal discrimination should be as much as possible. The evaluation metric for signal discrimination is Weighted Area Under the ROC Curve (truncated AUC) [3]

Figure 2. Illustration of the CvM correlation test [4]. On the left side there is no correlation with mass (small CvM values). On the right side model’s predictions are highly correlated with mass (high CvM values)

3. Baseline

On publication moment Data Doping [2] was the best technique to train classifier on Monte-Carlo without overfit. We selected it as a baseline. The idea of Data Doping is to “dope” the training set with a small number of Monte-Carlo events from the control channel C, but labeled as background. Thus, it helps the classifier to dissalow features discriminating real and background. The technique is shown on Figure 3 below.

Figure 3. Data doping

The optimal number of doping events was taken from [2].

4. Domain adaptation

As an alternative for Data Doping we discovered new method, based on Cross-Domain adaptation with gradient reversal [1]. The concept is the same as in GAN (Generative Adversarial models) [7]: we use an additional branch, which is trained to discriminate real from background (discriminator). But we also reverse gradient from discriminator to make general model not able to discriminate real from background. It's visualized on Figure 4

Figure 4 Cross-domain adaptation with gradient reversal

Instead, we placed reversal into loss sign using following simple equation $$-\frac{\partial L_d}{\partial \Theta_d}=\frac{\partial-L_d}{\partial \Theta_d}$$ and used minus cross-entropy as a final objective for Domain classifier part (Figure 5)

Figure 5 Equivalent form of gradient reversal architecture (Figure 4)

The network architecture has a dense 2(3*)-branch structure (Figure 5) and consists from following parts: 1. Feature extractor – responsible for feature generation 2. Label predictor – responsible for the target prediction (signal/background discrimination) 3. Domain classifier – responsible for cross-domain adaptation and prevents the network from overfitting to MC domain 4*. Mass predictor - helps to eliminate the correlation between classifier predictions and reconstructed mass of the decay

_{$^*$-Mass predictor part (branch) wasn’t tested in this research and our architecture was tested without this part. The Figure 5 was drawn without this part. Theoretically it was designed as additional branch as domain classifier, working along the same principle}

5. Data

Training dataset (Analysis channel) consists of 67000+ events of signal ($τ → 3µ$) and background events. Control channel consists of 71000+ events of signal ($D_s → φπ$) and background. All events are described by 46 features.

6. Training

The architecture was implemented on Python 2.7 using Lasagne (ver. 2.1) framework. We tuned the following parameters to obtain stable results:

learning rates ratio between branches (learning_rate_multiplier);
batch sizes ratio for branches. The best observed values were 1000 and 300 for Label predictor and Domain classifier respectively
number of batches per epoch ratio. The best observed ratio between batches number was 6:1 for Label predictor and Domain classifier respectively

The model was trained for 20 epoches with RMSProp optimizer. To achieve highest stability and reproducibility we trained only Feature extractor and Label predictor parts several (20) epoches with frozen Domain classifier (1'st step) and after that Domain classifier was trained (2'nd step).

In first step, as it was mentioned before, only Feature extractor and Label predictor were trained. Training procedure was 20-epoches RMSProp-optimization of categorical (2-classes) cross-entropy loss. Batch size was 1000. Learning rate at first step was 0.01 and was decayed 10 times each 5 epoches On the second step we restored Feature extractor and Label predictor from previous step and trained all parts. Feature extractor and Label predictor trained with the same way (20 epoches, same loss, same batch size, same initial learning rate and the same learning rate decay policy). Domain classifier was trained with the same parameters in each 6 batches of Feature extractor and Label predictor training and with different batch size (batch_size=300)

To eliminate KS-value we increased such Domain classifier’s parameters as learning rate, corresponding batch size and batches frequency. Figure 6 represents such dependency from one of such parameters. It was observed that too small values of KS makes CvM values higher and AUC metric smaller. So the goal was to find balance between KS, CvM and AUC using parameters described above.

Figure 6. Metrics dependency from domain classifier’s learning_rate_multiplier

7. Results

In the research following models were compared: Baseline (label predictor from Figure 5 without Domain Adaptation), Domain Adaptation (our approach), Data Doping. Models were tested on 85000+ events of signal ($τ → 3μ$) and background. The tests showed (Figure 7) that this architecture is a robust mechanism for choosing tradeoff between discrimination power and overfitting, moreover, it also improves the quality of the baseline prediction. Thus, this approach allowed us to train deep learning models without reducing the quality, which allow us to distinguish physical parameters, but do not allow us to distinguish simulated events from real ones. As shown in the table below our method provides the best solution for signal detection problem ($𝜏 → 3𝜇$).

	AUC (truncated)	KS (< 0.09)	CvM (<0.002)
Mass-aware Classifier	0.999	0.18	0.0008
Data Doping	0.9744	0.087	0.0011
Domain-adaptation	0.979	0.06	0.0008

Figure 7. Results

8. Conclusion

The method proposed is shown to work well on a typical particle physics analysis problem:

Remarkable classification quality;
Ro- Robustness to MC / Real data mixture;
Uniformity of the output wrt chosen (mass or nuisance parameter) feature
Tradeoff between discrimination power and overfitting tuned (Figure 6)

References

[1] Ganin, Y, and V. Lempitsky. Unsupervised domain adaptation by backpropagation. International Conference on Machine Learning. 2015. [2] V. Gaitan. Data Doping solution for “Flavours in Physics” challenge https://indico.cern.ch/event/433556/contributions/1930582/ [3] Flavours of Physics Competition, https://www.kaggle.com/c/flavours-of-physics [4] A. Rogozhnikov, A. Bukva, V. Gligorov, A. Ustyuzhanin, M. Williams New approaches for boosting to uniformity, JINST, 2015 [5] A. Ryzhikov, A. Ustyuzhanin Source code for Domain Adaptation research https://github.com/Leensman/Cross-domain-adaptation-on-HEP-HSE-course-work- [6] A. Ryzhikov, A. Ustyuzhanin Gradient reversal for MC/real data calibration https://indico.cern.ch/event/567550/contributions/2629724/attachments/1513629/2361286/Ryzhikov_poster_v6.pdf [7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio Generative Adversarial Networks https://arxiv.org/abs/1406.2661

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gradient reversal for MC/real data calibration

Abstract

1. Introduction

2. Problem

3. Baseline

4. Domain adaptation

5. Data

6. Training

7. Results

8. Conclusion

References

About

Releases

Packages

Languages

HolyBayes/Cross-domain-adaptation-on-HEP-HSE-course-work-

Folders and files

Latest commit

History

Repository files navigation

Gradient reversal for MC/real data calibration

Abstract

1. Introduction

2. Problem

3. Baseline

4. Domain adaptation

5. Data

6. Training

7. Results

8. Conclusion

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages