-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom metrics emitting #170
base: main
Are you sure you want to change the base?
Conversation
priyanka-ganesha
commented
Sep 22, 2023
- Cloud monitoring prototype
- Checkpoint initialization metrics emitting
MaxText/configs/base.yml
Outdated
@@ -175,3 +175,6 @@ stack_trace_interval_seconds: 600 # Stack trace collection frequency in seconds | |||
|
|||
# Use iota operator in Embed | |||
use_iota_embed: False | |||
|
|||
#Monitoring parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write a note explaining this a little more carefully. "Export in-workload metrics to Cloud monitoring"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also have you visualized these metrics? Where do I see them, etc? Probably when we print tensorboard stuff we should also print a link to see these metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics can be visualized on our monitoring dashboard titled 'Maxtext Metrics'. I've modified to print a link that points to all the project's dashboards, where one can create a new dashboard to use these metrics or add to existing dashboards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post a link for a run you did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requirements.txt
Outdated
@@ -3,6 +3,8 @@ absl-py | |||
argparse | |||
cloud-tpu-diagnostics | |||
datetime | |||
google-cloud-compute==1.6.1 | |||
google-cloud-monitoring==2.11.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No pin please! It is hard to maintain the pin -- who updates it and when?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
MaxText/monitoring_api.py
Outdated
event_time = time.strftime( | ||
"%d %b %Y %H:%M:%S UTC", time.gmtime(seconds_since_epoch_utc) | ||
) | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxlogging in MaxText
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
MaxText/train.py
Outdated
monitoring_enabled = config.enable_cloud_monitoring | ||
|
||
if monitoring_enabled: | ||
monitoring_api.create_custom_metric('checkpointing_init_start', "Checkpointing Initialization Start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metric creations should be hidden inside of a function inside of maxutils, something like "register_train_metrics"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
MaxText/train.py
Outdated
monitoring_api.create_custom_metric('checkpoint_test_run_start', "Checkpointing Test Run Start") | ||
monitoring_api.create_custom_metric('checkpoint_test_run_end', "Checkpointing Test Run End") | ||
|
||
monitoring_api.write_time_series_step('checkpoint_test_run_start', 0, monitoring_enabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by what is happening here. It looks to me like write_time_series_step
has arguments in a different order than you're passing them in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may have been an error on a non-latest commit - Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great in general! Thanks for putting this together. This would be great to measure goodput!
Could you please:
Verify this works e2e on v4 and v5 again?
Add an integration test as Rafi suggested, maybe add a new XLML test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@@ -0,0 +1,182 @@ | |||
# pylint: disable=unused-argument, no-name-in-module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really feels like this file is in the wrong project. I wonder if Branden/Surbhi could recommend a better home for it?
|
||
monitoring_enabled = config.enable_cloud_monitoring | ||
|
||
if monitoring_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move registrations and all this logic into a standalone function. register_all_train_metrics
@@ -227,6 +228,9 @@ def setup_initial_state(model, tx, config, rng, mesh, checkpoint_manager): | |||
state = unbox_logicallypartioned_trainstate(state) | |||
return state, state_mesh_annotations | |||
|
|||
def register_train_metrics(metric_name, metric_description): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be in monitoring_api.py (notice that monitoring_api.py isn't ML, MaxText or Max specific AFAICT)
monitoring_enabled = config.enable_cloud_monitoring | ||
|
||
if monitoring_enabled: | ||
max_utils.register_train_metrics('checkpointint_init_start', "Checkpointing Initialization Start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in checkpointing!
if monitoring_enabled: | ||
max_utils.register_train_metrics('checkpointint_init_start', "Checkpointing Initialization Start") | ||
max_utils.register_train_metrics('checkpointing_init_end', "Checkpointing Initialization End") | ||
max_utils.register_train_metrics('checkpoint_test_run_start', "Checkpointing Test Run Start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do test_run_start and test_run_end mean in this context?
writer = SummaryWriter(config.tensorboard_dir) | ||
|
||
monitoring_api.write_time_series_step('checkpointing_init_start', monitoring_enabled, pyconfig, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this step 1?
checkpoint_manager = checkpointing.create_orbax_checkpoint_manager( | ||
config.checkpoint_dir, | ||
config.enable_checkpointing, | ||
config.async_checkpointing, | ||
config.save_period, | ||
) | ||
|
||
monitoring_api.write_time_series_step('checkpointing_init_end', monitoring_enabled, pyconfig, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this step 1?
Change-Id: I976a2b1c09d577392fb65f940ae03848dfca06a7