Custom metrics emitting #170

priyanka-ganesha · 2023-09-22T14:40:17Z

Cloud monitoring prototype
Checkpoint initialization metrics emitting

…ting

rwitten · 2023-09-25T15:24:37Z

MaxText/configs/base.yml

@@ -175,3 +175,6 @@ stack_trace_interval_seconds: 600  # Stack trace collection frequency in seconds

 # Use iota operator in Embed
 use_iota_embed: False
+
+#Monitoring parameters


Please write a note explaining this a little more carefully. "Export in-workload metrics to Cloud monitoring"

Also have you visualized these metrics? Where do I see them, etc? Probably when we print tensorboard stuff we should also print a link to see these metrics.

The metrics can be visualized on our monitoring dashboard titled 'Maxtext Metrics'. I've modified to print a link that points to all the project's dashboards, where one can create a new dashboard to use these metrics or add to existing dashboards.

Can you post a link for a run you did?

https://pantheon.corp.google.com/monitoring/dashboards/builder/0392c0a4-317c-4385-b446-296784f7c97b;startTime=2023-11-30T08:16:10.096Z;endTime=2023-12-03T02:05:25.320Z?e=13802955&mods=allow_workbench_image_override&project=tpu-prod-env-multipod

MaxText/train.py

rwitten · 2023-09-25T15:26:30Z

requirements.txt

@@ -3,6 +3,8 @@ absl-py
 argparse
 cloud-tpu-diagnostics
 datetime
+google-cloud-compute==1.6.1
+google-cloud-monitoring==2.11.3


No pin please! It is hard to maintain the pin -- who updates it and when?

rwitten · 2023-09-25T15:27:46Z

MaxText/monitoring_api.py

+  event_time = time.strftime(
+      "%d %b %Y %H:%M:%S UTC", time.gmtime(seconds_since_epoch_utc)
+  )
+  print(


maxlogging in MaxText

rwitten · 2023-09-25T15:28:08Z

MaxText/train.py

+  monitoring_enabled = config.enable_cloud_monitoring
+
+  if monitoring_enabled:
+    monitoring_api.create_custom_metric('checkpointing_init_start', "Checkpointing Initialization Start")


These metric creations should be hidden inside of a function inside of maxutils, something like "register_train_metrics"

rwitten · 2023-09-25T15:29:21Z

MaxText/train.py

+    monitoring_api.create_custom_metric('checkpoint_test_run_start', "Checkpointing Test Run Start")
+    monitoring_api.create_custom_metric('checkpoint_test_run_end', "Checkpointing Test Run End")
+
+  monitoring_api.write_time_series_step('checkpoint_test_run_start', 0, monitoring_enabled)


I'm a little confused by what is happening here. It looks to me like write_time_series_step has arguments in a different order than you're passing them in?

I think it may have been an error on a non-latest commit - Fixed.

tonyjohnchen

Looks great in general! Thanks for putting this together. This would be great to measure goodput!

Could you please:
Verify this works e2e on v4 and v5 again?
Add an integration test as Rafi suggested, maybe add a new XLML test?

tonyjohnchen

Looks great!

rwitten · 2023-12-11T20:20:24Z

MaxText/monitoring_api.py

@@ -0,0 +1,182 @@
+# pylint: disable=unused-argument, no-name-in-module


It really feels like this file is in the wrong project. I wonder if Branden/Surbhi could recommend a better home for it?

rwitten · 2023-12-11T20:21:37Z

MaxText/train.py

+
+  monitoring_enabled = config.enable_cloud_monitoring
+
+  if monitoring_enabled:


let's move registrations and all this logic into a standalone function. register_all_train_metrics

rwitten · 2023-12-11T20:22:45Z

MaxText/max_utils.py

@@ -227,6 +228,9 @@ def setup_initial_state(model, tx, config, rng, mesh, checkpoint_manager):
  state = unbox_logicallypartioned_trainstate(state)
  return state, state_mesh_annotations

+def register_train_metrics(metric_name, metric_description):


This seems like it should be in monitoring_api.py (notice that monitoring_api.py isn't ML, MaxText or Max specific AFAICT)

rwitten · 2023-12-11T20:23:35Z

MaxText/train.py

+  monitoring_enabled = config.enable_cloud_monitoring
+
+  if monitoring_enabled:
+    max_utils.register_train_metrics('checkpointint_init_start', "Checkpointing Initialization Start")


Typo in checkpointing!

rwitten · 2023-12-11T20:23:54Z

MaxText/train.py

+  if monitoring_enabled:
+    max_utils.register_train_metrics('checkpointint_init_start', "Checkpointing Initialization Start")
+    max_utils.register_train_metrics('checkpointing_init_end', "Checkpointing Initialization End")
+    max_utils.register_train_metrics('checkpoint_test_run_start', "Checkpointing Test Run Start")


What do test_run_start and test_run_end mean in this context?

rwitten · 2023-12-11T20:24:43Z

MaxText/train.py

  writer = SummaryWriter(config.tensorboard_dir)
+
+  monitoring_api.write_time_series_step('checkpointing_init_start', monitoring_enabled, pyconfig, 1)


Why is this step 1?

rwitten · 2023-12-11T20:24:48Z

MaxText/train.py

  checkpoint_manager = checkpointing.create_orbax_checkpoint_manager(
      config.checkpoint_dir,
      config.enable_checkpointing,
      config.async_checkpointing,
      config.save_period,
  )
+
+  monitoring_api.write_time_series_step('checkpointing_init_end', monitoring_enabled, pyconfig, 1)


Why is this step 1?

Change-Id: I976a2b1c09d577392fb65f940ae03848dfca06a7

priyanka-ganesha added 6 commits September 22, 2023 07:30

metrics

1f99e87

cloud monitoring prototype and checkpoint initialization metrics emit…

ec2a177

…ting

pylint errors

6326514

pylint

494ebf6

rename

5ee4fe3

rename

54caff6

rwitten requested changes Sep 25, 2023

View reviewed changes

priyanka-ganesha added 3 commits October 2, 2023 13:41

changes based on comments

c585140

pylint

3e374e8

change enable_monitoring

a5767b4

tonyjohnchen requested changes Nov 17, 2023

View reviewed changes

address comments

48c5dc2

priyanka-ganesha requested review from tonyjohnchen and rwitten December 1, 2023 17:06

tonyjohnchen approved these changes Dec 1, 2023

View reviewed changes

rwitten requested changes Dec 11, 2023

View reviewed changes

SurbhiJainUSC self-requested a review December 16, 2023 00:52

A9isha pushed a commit that referenced this pull request Apr 11, 2024

[PT/XLA] Use on-demand capacity for v4 tests (#170)

a88a2c4

Change-Id: I976a2b1c09d577392fb65f940ae03848dfca06a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom metrics emitting #170

Custom metrics emitting #170

priyanka-ganesha commented Sep 22, 2023

rwitten Sep 25, 2023

rwitten Sep 25, 2023

priyanka-ganesha Dec 11, 2023

rwitten Dec 11, 2023

priyanka-ganesha Dec 11, 2023

rwitten Sep 25, 2023

priyanka-ganesha Dec 11, 2023

rwitten Sep 25, 2023

priyanka-ganesha Dec 11, 2023

rwitten Sep 25, 2023

priyanka-ganesha Dec 11, 2023

rwitten Sep 25, 2023

priyanka-ganesha Dec 11, 2023

tonyjohnchen left a comment

tonyjohnchen left a comment

rwitten Dec 11, 2023

rwitten Dec 11, 2023

rwitten Dec 11, 2023

rwitten Dec 11, 2023

rwitten Dec 11, 2023

rwitten Dec 11, 2023

rwitten Dec 11, 2023

		@@ -0,0 +1,182 @@
		# pylint: disable=unused-argument, no-name-in-module


		monitoring_enabled = config.enable_cloud_monitoring

		if monitoring_enabled:

		writer = SummaryWriter(config.tensorboard_dir)

		monitoring_api.write_time_series_step('checkpointing_init_start', monitoring_enabled, pyconfig, 1)

Custom metrics emitting #170

Are you sure you want to change the base?

Custom metrics emitting #170

Conversation

priyanka-ganesha commented Sep 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonyjohnchen left a comment

Choose a reason for hiding this comment

tonyjohnchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment