Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom metrics emitting #170

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Custom metrics emitting #170

wants to merge 10 commits into from

Conversation

priyanka-ganesha
Copy link
Collaborator

  • Cloud monitoring prototype
  • Checkpoint initialization metrics emitting

@@ -175,3 +175,6 @@ stack_trace_interval_seconds: 600 # Stack trace collection frequency in seconds

# Use iota operator in Embed
use_iota_embed: False

#Monitoring parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a note explaining this a little more carefully. "Export in-workload metrics to Cloud monitoring"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also have you visualized these metrics? Where do I see them, etc? Probably when we print tensorboard stuff we should also print a link to see these metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics can be visualized on our monitoring dashboard titled 'Maxtext Metrics'. I've modified to print a link that points to all the project's dashboards, where one can create a new dashboard to use these metrics or add to existing dashboards.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post a link for a run you did?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxText/train.py Show resolved Hide resolved
requirements.txt Outdated
@@ -3,6 +3,8 @@ absl-py
argparse
cloud-tpu-diagnostics
datetime
google-cloud-compute==1.6.1
google-cloud-monitoring==2.11.3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No pin please! It is hard to maintain the pin -- who updates it and when?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

event_time = time.strftime(
"%d %b %Y %H:%M:%S UTC", time.gmtime(seconds_since_epoch_utc)
)
print(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxlogging in MaxText

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

MaxText/train.py Outdated
monitoring_enabled = config.enable_cloud_monitoring

if monitoring_enabled:
monitoring_api.create_custom_metric('checkpointing_init_start', "Checkpointing Initialization Start")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metric creations should be hidden inside of a function inside of maxutils, something like "register_train_metrics"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

MaxText/train.py Outdated
monitoring_api.create_custom_metric('checkpoint_test_run_start', "Checkpointing Test Run Start")
monitoring_api.create_custom_metric('checkpoint_test_run_end', "Checkpointing Test Run End")

monitoring_api.write_time_series_step('checkpoint_test_run_start', 0, monitoring_enabled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by what is happening here. It looks to me like write_time_series_step has arguments in a different order than you're passing them in?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may have been an error on a non-latest commit - Fixed.

Copy link
Collaborator

@tonyjohnchen tonyjohnchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great in general! Thanks for putting this together. This would be great to measure goodput!

Could you please:
Verify this works e2e on v4 and v5 again?
Add an integration test as Rafi suggested, maybe add a new XLML test?

Copy link
Collaborator

@tonyjohnchen tonyjohnchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@@ -0,0 +1,182 @@
# pylint: disable=unused-argument, no-name-in-module
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really feels like this file is in the wrong project. I wonder if Branden/Surbhi could recommend a better home for it?


monitoring_enabled = config.enable_cloud_monitoring

if monitoring_enabled:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move registrations and all this logic into a standalone function. register_all_train_metrics

@@ -227,6 +228,9 @@ def setup_initial_state(model, tx, config, rng, mesh, checkpoint_manager):
state = unbox_logicallypartioned_trainstate(state)
return state, state_mesh_annotations

def register_train_metrics(metric_name, metric_description):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be in monitoring_api.py (notice that monitoring_api.py isn't ML, MaxText or Max specific AFAICT)

monitoring_enabled = config.enable_cloud_monitoring

if monitoring_enabled:
max_utils.register_train_metrics('checkpointint_init_start', "Checkpointing Initialization Start")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in checkpointing!

if monitoring_enabled:
max_utils.register_train_metrics('checkpointint_init_start', "Checkpointing Initialization Start")
max_utils.register_train_metrics('checkpointing_init_end', "Checkpointing Initialization End")
max_utils.register_train_metrics('checkpoint_test_run_start', "Checkpointing Test Run Start")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do test_run_start and test_run_end mean in this context?

writer = SummaryWriter(config.tensorboard_dir)

monitoring_api.write_time_series_step('checkpointing_init_start', monitoring_enabled, pyconfig, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this step 1?

checkpoint_manager = checkpointing.create_orbax_checkpoint_manager(
config.checkpoint_dir,
config.enable_checkpointing,
config.async_checkpointing,
config.save_period,
)

monitoring_api.write_time_series_step('checkpointing_init_end', monitoring_enabled, pyconfig, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this step 1?

@SurbhiJainUSC SurbhiJainUSC self-requested a review December 16, 2023 00:52
A9isha pushed a commit that referenced this pull request Apr 11, 2024
Change-Id: I976a2b1c09d577392fb65f940ae03848dfca06a7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants