Skip to content

Easily train and evaluate multiple flow matching generative models on various particle physics datasets

License

Notifications You must be signed in to change notification settings

ewencedr/particle_fm

Repository files navigation

Particle-FM: Flow Matching for Particle Physics

Easily train and evaluate multiple generative models on various particle physics datasets

python pytorch lightning hydra
black isort Template

arxiv arxiv Paper

ℹ️ Description

Flow Matching (FM) combines different generative models under one framework (e.g. diffusion, score-based models, continuous normalizing flows) and allows for easy comparison of these models. This repository contains multiple (mostly) generative neural networks combined under the FM framework and multiple point cloud datasets from particle physics. With this repository, you can easily train and evaluate multiple models on various datasets.

This repository contains code for the following papers as well as additional models and datasets. For the code repositories that only contains the code for the papers, please refer to the repositories linked in the papers.

🤖 Models

Generative Models

Architectures:

Loss Functions:

Classification Models

📊 Datasets

Click on the dataset to get more information about the dataset, its features, and how to download it.

JetNet
  • Description: (dataset reference)

    • Simulated particle jets produced by proton-proton collisions in a simplified detector. The dataset is split into jets originating from tops, light quarks, gluons, W bosons, and Z bosons and has a maximum number of 150 particles per jet.
  • Features:

    • Lightning DataModule for easy exchange of datasets
    • Preprocessing and postprocessing of data
    • Evaluation during training and after training with comet and wandb
    • Many settings for trainings (e.g. conditioning on selected features, training on multiple jet types, etc.)
  • Download Can be downloaded from Zenodo and should be saved under data_folder_specified_in_env/jetnet/

LHC Olympics
  • Description: (dataset reference)

    • A dataset for Anomaly Detection, where the generative models are used to generate the Standard Model background. It consists of 1M QCD simulated dijet events that after clustering result in 2 jets per event with up to 279 particles per jet.
  • Features:

    • Lightning DataModule for easy exchange of datasets
    • Preprocessing and postprocessing of data
    • Evaluation during training and after training with comet and wandb
    • Many settings for trainings (e.g. conditioning on selected features, training separately on dijets, on both dijets, on the whole event, etc.)
  • Download and Preprocessing Can be downloaded from Zenodo. The events_anomalydetection_v2.h5 is the file needed as it contains all the particles from an event. Before use, the events need to be clustered and brought into point cloud format. This preprocessing can be done with this Code. The events_anomalydetection_v2.h5 and the preprocessed data should be saved under your_spedata_folder_specified_in_env/lhco

JetClass
  • Description: (dataset reference)

    • Simulated particle jets like in JetNet, but JetClass provided much more data, more jet types and more particle features.
    • If you run into any issues with code for this dataset, please have a look at the official repository of the paper
  • Features:

    • Lightning DataModule for easy exchange of datasets
    • Preprocessing and postprocessing of data
    • Evaluation during training and after training with comet and wandb
    • Many settings for trainings (e.g. conditioning on selected features, training on multiple jet types, etc.)
  • Download and Preprocessing Can be downloaded from Zenodo by following the instructions from jet-universe/particle_transformer. Adjust the paths in the configs/preprocessing/data.yaml and run

    python scripts/prepare_dataset.py && python scripts/preprocessing.py
CaloChallenge
  • Description: (dataset reference)

    • Dataset 2 from the CaloChallenge, where the data is represented as point clouds (see also here)
    • Thanks to Benno Käch for providing the code for the data loader
  • Download and Preprocessing:

    • The data can be downloaded from Zenodo and can be preprocessed with the Python script scripts/preprocessing_calo_challenge.py. Note, that the file paths in the script need to be adjusted to the correct paths.
    • The file paths in the data module also need to be adjusted to the correct paths. Inside the DESY network, the default paths are also accessible and can be used, so that the data does not need to be downloaded.
TwoMoons
  • Description: (dataset reference)

  • Simple toy dataset for testing the models in the notebook. Does not need to be downloaded because the dataset can be generated via a scikit-learn function.

🌟 Features

⭐️ Easily train multiple models on various particle physics datasets

⭐️ Lightning DataModules are provided for each dataset to allow easy exchange of datasets with automatic preprocessing and postprocessing

⭐️ Lightning Callbacks are provided for each dataset to automatically evaluate during training and log to wandb and comet (EMA callback also available)

⭐️ Best Practices for coding thanks to the Lightning-Hydra-Template with all major benefits of PyTorch Lightning and Hydra (configurations, logging, multi-GPU, data modules, callbacks, hyperparameter search, continuous integration etc.). See Lightning-Hydra-Template for more information.

⚡️ Quickstart

⚙️ Installation

Option 1: Install via pip

If you want to use some modules, you can install the package via pip like so:

pip install git+git://github.com/ewencedr/particle_fm.git --upgrade

This allows to access the modules like:

from particle_fm.data.jetnet_datamodule import JetNetDataModule

Option 2: Install from source

Install dependencies

# clone project
git clone https://github.com/ewencedr/particle_fm
cd particle_fm

# [OPTIONAL] create conda environment
conda create -n myenv python=3.10
conda activate myenv

# install pytorch according to instructions
# https://pytorch.org/get-started/

# install requirements
pip install -r requirements.txt

Create .env file to set paths and API keys

PROJEKT_ROOT="/folder/folder/"
DATA_DIR="/folder/folder/"
LOG_DIR="/folder/folder/"
COMET_API_TOKEN="XXXXXXXXXX"

🧠 Training

Train model with default configuration

# train on one GPU
python particle_fm/train.py trainer=gpu

# train on multiple GPUs
python particle_fm/train.py trainer=ddp

Train model with chosen experiment configuration from configs/experiment/

python particle_fm/train.py experiment=experiment_name.yaml

You can override any parameter from the command line like this

python particle_fm/train.py trainer.max_epochs=20 data.batch_size=64

📈 Evaluation

During training and evaluation, metrics and plots are automatically evaluated via custom lightning callbacks and logged via comet and wandb. After training most models will also be evaluated automatically and the final results will be saved locally and logged via the selected loggers. The evaluation can also be manually started like this

python particle_fm/eval.py experiment=experiment_name.yaml ckpt_path=checkpoint_path

You can also specify the config file that was saved at the beginning of the training

python particle_fm/eval.py cfg_path=<cfg_file_path> ckpt_path=<checkpoint_path>

Notebooks are available to quickly train, evaluate models and create plots.

🚀 Preconfigured Experiments

The experiments are defined in yaml files and specify which loss function and architecture to use, which dataset to use, and which hyperparameters to use. Feel free to create your own experiments, but some preconfigured experiments are available in the configs/experiment/ folder.

Click on the dataset names to find out more about all the available experiments for the dataset.

JetNet Dataset For the JetNet dataset, the experiments from the paper [2310.00049](https://arxiv.org/abs/2310.00049) are available:
  • fm_tops30_cond, fm_tops30, fm_tops150_cond, fm_tops150, where all are EPiC Flow Matching models trained on the top dataset. The numbers indicate whether the model is trained on the top30 or top150 dataset and the _cond indicates that the model is conditioned on the jet mass and pt.
  • diffusion_tops30_cond, diffusion_tops30, diffusion_tops150_cond, diffusion_tops150, where all are EPiC-JeDi models trained on the top dataset. The numbers indicate whether the model is trained on the top30 or top150 dataset and the _cond indicates that the model is conditioned on the jet mass and pt.

Although not shown in the paper, the models can easily be trained on different combinations of jet types and jet sizes. Examples are:

  • fm_alljet150_cond, which is an EPiC Flow Matching model trained on all jet types with a maximum of 150 particles per jet and conditioning on jet mass and pt.

  • diffusion_alljet150_cond, which is an EPiC-JeDi model trained on all jet types with a maximum of 150 particles per jet and conditioning on jet mass and pt.

    Additionally, other architectures can be used:

  • fm_mdma, which is an EPiC Flow Matching model trained on the top dataset with the MDMA architecture

LHCO Dataset The LHCO dataset consists of two dijets per event, which allows for multiple ways of generating these two jets. By clustering the event into two jets, both jets can be seen as a single point cloud similar to the JetNet dataset. Using this clustering, the following experiments are available for the point cloud models:
  • lhco/both_jets One EPiC-FM model trained on point clouds of jet 1 and jet 2 at the same time (experiment from the paper 2310.06897)
  • lhco/x_jet / lhco_y_jet One EPiC-FM model trained on one point cloud, where lhco/x_jet trains the model on jet 1 and lhco/y_jet trains the model on jet 2
  • lhco/jets_crossattention Same as lhco/both_jets but with a cross attention transformer from 2307.06836 instead of the EPiC architecture
  • lhco/transformer Same as lhco/both_jets but with a full transformer from 2307.06836 instead of the EPiC architecture

All these models require a conditioning on the jet features of the full dijet event:

  • lhco/jet_features FM-Model with fully connected architecture trained on the jet features of both dijets to condition the generation of the point clouds (experiment from the paper 2310.06897)

Instead of the two-step approach, the event can also be generated in more complex ways:

  • lhco/bigPC Both point clouds of the dijets are put into one large point cloud and the model is trained on this large point cloud. In the evaluation, the point cloud is clustered into two jets again
  • lhco/wholeEvent The generative model can also directly be trained on the whole event, which is more difficult for the model to learn and clustering for evaluation is also necessary. However, this still works well and shows that these models are powerful enough to learn large point clouds with fewer restrictions.

Additionally, classifiers are available to check, if the generated events are distinguishable from the original events.

  • lhco/epic_classifier point cloud classifier based on the EPiC architecture. Paths to data must be specified in the config file.
  • lhco/hl_classifier fully connected classifier as in 2109.00546 to compare high level features. Paths to data must be specified in the config file.
JetClass Dataset
  • jetclass/jetclass_cond EPiC Flow Matching model trained on the JetClass dataset and conditioned
  • jetclass/jetclass_classifier After evaluating the generative model, a classifier test can be run. For this, the paths to the generated data needs to be specified in the config file.
CaloChallenge Dataset
  • calo_challenge/fm_mdma Flow Matching model with MDMA architecture trained on the CaloChallenge dataset

🫱🏼‍🫲🏽 Contributing

Please feel free to contribute to this repository. If you have any questions, feel free to open an issue or contact me directly. When contributing, please make sure to follow style guidelines by using the pre-commit hooks.

⚠ Note of Caution

This repository was originally used for a research project and is now being adapted to be more general. The experiments published in the papers have been tested and should work. If you run into issues, please have a look at the official repositories for the papers. All other preconfigured experiments should work but might not have been tested for functionality. Some code might be a bit specific to a certain use case and could be generalized further to allow for more flexibility. Additionally, some leftovers from development might also still exist that don't work. Please create an issue if you encounter any problems or have any questions.

📚 Citation

When using this repository in research, please cite the following papers:

@misc{birk2023flow,
      title={Flow Matching Beyond Kinematics: Generating Jets with Particle-ID and Trajectory Displacement Information},
      author={Joschka Birk and Erik Buhmann and Cedric Ewen and Gregor Kasieczka and David Shih},
      year={2023},
      eprint={2312.00123},
      archivePrefix={arXiv},
      primaryClass={hep-ph}
}
@misc{buhmann2023phase,
      title={Full Phase Space Resonant Anomaly Detection},
      author={Erik Buhmann and Cedric Ewen and Gregor Kasieczka and Vinicius Mikuni and Benjamin Nachman and David Shih},
      year={2023},
      eprint={2310.06897},
      archivePrefix={arXiv},
      primaryClass={hep-ph}
}
@misc{buhmann2023epicly,
      title={EPiC-ly Fast Particle Cloud Generation with Flow-Matching and Diffusion},
      author={Erik Buhmann and Cedric Ewen and Darius A. Faroughy and Tobias Golling and Gregor Kasieczka and Matthew Leigh and Guillaume Quétant and John Andrew Raine and Debajyoti Sengupta and David Shih},
      year={2023},
      eprint={2310.00049},
      archivePrefix={arXiv},
      primaryClass={hep-ph}
}