Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlflow benchmark profiler update #38

Open
wants to merge 32 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
6507424
fix: saving frequency bug for inference checkpoints
anaprietonem Aug 20, 2024
7d2d620
Merge branch 'develop' into 257-bug-inference-checkpoints-saving-freq…
anaprietonem Aug 20, 2024
0027046
chore: update CHANGELOG
anaprietonem Aug 20, 2024
8cf698b
feat: add anemoi profiler with mlflow compatibility
anaprietonem Aug 20, 2024
d647bf9
fix: format error
anaprietonem Aug 20, 2024
352cd29
fix: removed atos path from noteook and fixed update_paths function
anaprietonem Aug 23, 2024
c7ab208
add hta functionality in documentation
anaprietonem Oct 7, 2024
ebe33bd
updating docs for profiler
anaprietonem Oct 7, 2024
9c67f3e
update profiler docs
anaprietonem Oct 7, 2024
2bcf957
update profiler docs
anaprietonem Oct 7, 2024
2e6a168
update profiler docs
anaprietonem Oct 7, 2024
29232ce
update profiler docs
anaprietonem Oct 7, 2024
c646e38
update profiler docs
anaprietonem Oct 7, 2024
4d9610b
update profiler docs
anaprietonem Oct 7, 2024
0a4070c
update profiler docs
anaprietonem Oct 7, 2024
45e7a7b
update profiler docs
anaprietonem Oct 7, 2024
3cea9d9
update profiler docs
anaprietonem Oct 7, 2024
3c2f2d9
update profiler docs
anaprietonem Oct 7, 2024
b8fcf99
update profiler docs
anaprietonem Oct 7, 2024
80e5522
Merge branch 'develop' into mlflow_benchmark_profiler_update
anaprietonem Oct 7, 2024
990aea9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 7, 2024
5aeeca4
fixing pre-commits on docs
anaprietonem Oct 7, 2024
b85eac2
fix pre-commit docs
anaprietonem Oct 7, 2024
ef54ffb
fix pre-commit docs
anaprietonem Oct 7, 2024
56e222f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 7, 2024
4aa225a
minor updates
anaprietonem Oct 7, 2024
81b57d8
Merge branch 'mlflow_benchmark_profiler_update' of github.com:ecmwf/a…
anaprietonem Oct 7, 2024
86e58ba
added docs for anemoi profiler
anaprietonem Oct 7, 2024
e943782
add section about profiling in overview
anaprietonem Oct 8, 2024
e177bd6
add section about profiling in overview
anaprietonem Oct 8, 2024
328ca19
add comment to avoid confussion with profiler for troubleshooting
anaprietonem Oct 8, 2024
702287e
added note about limit batches
anaprietonem Oct 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ Keep it human-readable, your future self will thank you!
- Feature: Add configurable models [#50](https://github.com/ecmwf/anemoi-training/pulls/50)
- Feature: Support training for datasets with missing time steps [#48](https://github.com/ecmwf/anemoi-training/pulls/48)
- Long Rollout Plots
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report


### Fixed

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/anemoi_profiler_config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_memory_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_memory_timeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_model_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_model_summary_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_system_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/example_time_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/idle_time_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/kernel_breakdown_dfs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/kernel_breakdown_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/memory_snapshot_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/profiler/temporal_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This package provides the *Anemoi* training functionality.
user-guide/training
user-guide/models
user-guide/tracking
user-guide/benchmarking
user-guide/distributed
user-guide/debugging

Expand Down
12 changes: 12 additions & 0 deletions docs/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,18 @@ and resolve issues during the training process, including:
- Debug configurations for quick error identification
- Guidance on isolating and addressing common problems

8. Benchmarking and HPC Profiling
=================================

Anemoi Training offers tools and configurations to support benchmarking
and High-Performance Computing (HPC) profiling, allowing users to
optimize training performance. This includes:

- Benchmarking configurations for evaluating training efficiency across
different hardware setups.
- Profiling tools for monitoring resource utilization (CPU, GPU,
memory) and identifying performance bottlenecks.

**************************
Components and Structure
**************************
Expand Down
746 changes: 746 additions & 0 deletions docs/user-guide/benchmarking.rst

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,13 @@ optional-dependencies.docs = [
"sphinx-argparse",
"sphinx-rtd-theme",
]
optional-dependencies.profile = [
"holistictraceanalysis>=0.2",
"pandas>=1.3.2",
"rich>=13.6",
"tabulate>=0.9",
]

optional-dependencies.tests = [ "hypothesis", "pytest", "pytest-mock" ]

urls.Changelog = "https://github.com/ecmwf/anemoi-training/CHANGELOG.md"
Expand Down
47 changes: 47 additions & 0 deletions src/anemoi/training/commands/profiler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# (C) Copyright 2024 ECMWF.
#
# This software is licensed under the terms of the Apache Licence Version 2.0
# which can be obtained at http://www.apache.org/licenses/LICENSE-2.0.
# In applying this licence, ECMWF does not waive the privileges and immunities
# granted to it by virtue of its status as an intergovernmental organisation
# nor does it submit to any jurisdiction.

from __future__ import annotations

import logging
import sys
from typing import TYPE_CHECKING

from anemoi.training.commands import Command

if TYPE_CHECKING:
import argparse

LOGGER = logging.getLogger(__name__)


class Profile(Command):
"""Commands to profile Anemoi models."""

accept_unknown_args = True

@staticmethod
def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
return parser

@staticmethod
def run(args: list[str], unknown_args: list[str] | None = None) -> None:
del args

if unknown_args is not None:
sys.argv = [sys.argv[0], *unknown_args]
else:
sys.argv = [sys.argv[0]]

LOGGER.info("Running anemoi profiling command with overrides: %s", sys.argv[1:])
from anemoi.training.train.profiler import main as anemoi_profile

anemoi_profile()


command = Profile
22 changes: 22 additions & 0 deletions src/anemoi/training/config/diagnostics/eval_rollout.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,28 @@ debug:
# remember to also activate the tensorboard logger (below)
profiler: False

# Use anemoi-profile to profile the training process
benchmark_profiler:
memory:
enabled: True
steps: 5 # wait warmup steps and then do steps (too many steps would lead to a big file)
warmup: 2
extra_plots: False
trace_rank0_only: False #set to true and it will profile rank 0 only. Reads SLURM_PROC_ID so won't work when not running via Slurm
time:
enabled: True
verbose: False #If true, output every action the profiler caputres, otherwise output a subset defined in PROFILER_ACTIONS at the top of aifs/diagnostics/profiler.py
speed:
enabled: True
system:
enabled: True
model_summary:
enabled: True
snapshot:
enabled: True
steps: 4 # wait warmup steps and then do steps
warmup: 0

checkpoint:
every_n_minutes:
save_frequency: 30 # Approximate, as this is checked at the end of training steps
Expand Down
2 changes: 2 additions & 0 deletions src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ multistep_input: 2
# the effective batch size becomes num-devices * batch_size * k
accum_grad_batches: 1

num_sanity_val_steps: 6

# clipp gradients, 0 : don't clip, default algorithm: norm, alternative: value
gradient_clip:
val: 32.
Expand Down
66 changes: 66 additions & 0 deletions src/anemoi/training/diagnostics/callbacks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
from pytorch_lightning.callbacks import Callback
from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint
from pytorch_lightning.utilities import rank_zero_only
from pytorch_lightning.utilities.types import STEP_OUTPUT

from anemoi.training.diagnostics.plots import init_plot_settings
from anemoi.training.diagnostics.plots import plot_graph_features
Expand Down Expand Up @@ -870,6 +871,71 @@ def on_load_checkpoint(
pl_module.hparams["metadata"]["parent_uuid"] = checkpoint["hyper_parameters"]["metadata"]["uuid"]


class MemorySnapshotRecorder(Callback):
"""Record memory snapshot using torch.cuda._record_memory_history()."""

def __init__(self, config):
super().__init__()
self.config = config
self.dirpath = Path(self.config.hardware.paths.profiler)

self.warmup = self.config.diagnostics.benchmark_profiler.snapshot.warmup
if not self.warmup:
self.warmup = 0
self.num_steps = (
self.config.diagnostics.benchmark_profiler.snapshot.steps + self.warmup
) # be consistent with profiler scheduler
self.status = False

assert (
self.num_steps % self.config.dataloader.batch_size.training == 0
), "Snapshot steps is not a multiple of batch size"
assert (
self.warmup % self.config.dataloader.batch_size.training == 0
), "Snapshot Warmup steps is not a multiple of batch size"

@rank_zero_only
def _start_snapshot_recording(self):
LOGGER.info("Starting snapshot record_memory_history")
torch.cuda.memory._record_memory_history()
self.status = True

@rank_zero_only
def _save_snapshot(self):
self.memory_snapshot_fname = self.dirpath / "memory_snapshot.pickle"
try:
LOGGER.info("Saving memory snapshot to %s", self.memory_snapshot_fname)
torch.cuda.memory._dump_snapshot(f"{self.memory_snapshot_fname}")
except Exception as e:
LOGGER.error(f"Failed to capture memory snapshot {e}")

@rank_zero_only
def stop_record_memory_history(self) -> None:
LOGGER.info("Stopping snapshot record_memory_history")
torch.cuda.memory._record_memory_history(enabled=None)

def on_train_batch_start(
self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", batch: Any, batch_idx: int
) -> None:
if trainer.global_step == self.warmup:
self._start_snapshot_recording()

def on_train_batch_end(
self,
trainer: "pl.Trainer",
pl_module: "pl.LightningModule",
outputs: STEP_OUTPUT,
batch: Any,
batch_idx: int,
) -> None:
if trainer.global_step == self.num_steps:
if self.status is True:
self._save_snapshot()
self.stop_record_memory_history()
else:
LOGGER.info("Snapshot recording was not started so no snapshot was saved")


class AnemoiCheckpoint(ModelCheckpoint):
"""A checkpoint callback that saves the model after every validation epoch."""

Expand Down
Loading
Loading