Skip to content

Run standalone dataloader and checkpointer on CPUs.#387

Merged
copybara-service[bot] merged 1 commit intomainfrom
roshanin_dl_ckpt
Feb 8, 2024
Merged

Run standalone dataloader and checkpointer on CPUs.#387
copybara-service[bot] merged 1 commit intomainfrom
roshanin_dl_ckpt

Conversation

@RoshaniN
Copy link
Contributor

@RoshaniN RoshaniN commented Jan 31, 2024

  • Introducing JAX distributed initialize for CPU multihost environment. (Thanks to @priyanka-ganesha for finding the required CPU settings.)
  • Added retry logic to ensure processes connect to the coordinator once it is up.
  • Updating the workflow test names.
  • Fixing bug in standalone_dataloader where Mesh was being initialized twice. Bug was introduced by commit http://shortn/_Q5ITxrra63 .
  • Introducing 'hint_hardware' in the config.

@RoshaniN RoshaniN requested a review from rwitten as a code owner January 31, 2024 01:13
@RoshaniN RoshaniN marked this pull request as draft February 1, 2024 23:29
@RoshaniN RoshaniN requested a review from gobbleturk February 2, 2024 19:48
jax.config.update('jax_default_prng_impl', 'unsafe_rbg')
jax.config.update('jax_cpu_enable_gloo_collectives', True)
jax.config.update('jax_platforms', 'cpu')
jax.distributed.initialize(coordinator_address=socket.gethostbyname(os.environ.get("JAX_COORDINATOR_ADDRESS")),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwitten @gobbleturk I think this line needs to be specific to CPUs (v4 runner tests are failing due to this line) as these JAX variables will only be set for CPUs.

Is there a check that you could recommend?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Matt problem!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will wait for his response, thanks Rafi!

Copy link
Contributor Author

@RoshaniN RoshaniN Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the offline advise, Matt! @gobbleturk

I moved the jax distributed initialize for CPUs to max utils. I also tested that async checkpointer now works for CPUs, with these changes.

Logs:
with async checkpointer - https://cloudlogging.app.goo.gl/V7Qog8rJvRDgJRci8
with sync checkpointer - https://cloudlogging.app.goo.gl/XzxRKqBYD2tXxW8g8

Test failures are due to "Install dependencies" step, probably unrelated to my changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#399 will unblock the tests, I will rebase onto it shortly.

@RoshaniN RoshaniN force-pushed the roshanin_dl_ckpt branch 2 times, most recently from 8f188af to ebe22fb Compare February 2, 2024 22:56
@RoshaniN RoshaniN marked this pull request as ready for review February 2, 2024 23:09
@RoshaniN RoshaniN force-pushed the roshanin_dl_ckpt branch 2 times, most recently from 6c2da5e to a65d0b6 Compare February 6, 2024 19:16
Copy link
Collaborator

@rwitten rwitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little nits, please fix them before merging.

@RoshaniN RoshaniN marked this pull request as draft February 7, 2024 18:51
@RoshaniN
Copy link
Contributor Author

RoshaniN commented Feb 7, 2024

I noticed that the dataloader is timing out and the connection to the JAX coordinator is shutdown. Hence moving this PR to draft again.

Dataloader was working fine previously with these changes, will check what changed.

Update - standalone_dataloader was updated in http://shortn/_Q5ITxrra63 , where Mesh was being initialized twice, hence the timeouts. Fixed now.

@RoshaniN RoshaniN force-pushed the roshanin_dl_ckpt branch 2 times, most recently from c7d1fde to e73e87f Compare February 8, 2024 18:04
@RoshaniN RoshaniN marked this pull request as ready for review February 8, 2024 19:02
Copy link
Collaborator

@rwitten rwitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little nit but giving approval. Please deal with nit before merging

@RoshaniN RoshaniN force-pushed the roshanin_dl_ckpt branch 2 times, most recently from b14b231 to 5011003 Compare February 8, 2024 19:49
@copybara-service copybara-service bot merged commit 8d0a594 into main Feb 8, 2024
@copybara-service copybara-service bot deleted the roshanin_dl_ckpt branch February 8, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants