Run standalone dataloader and checkpointer on CPUs.#387
Run standalone dataloader and checkpointer on CPUs.#387copybara-service[bot] merged 1 commit intomainfrom
Conversation
MaxText/standalone_dataloader.py
Outdated
| jax.config.update('jax_default_prng_impl', 'unsafe_rbg') | ||
| jax.config.update('jax_cpu_enable_gloo_collectives', True) | ||
| jax.config.update('jax_platforms', 'cpu') | ||
| jax.distributed.initialize(coordinator_address=socket.gethostbyname(os.environ.get("JAX_COORDINATOR_ADDRESS")), |
There was a problem hiding this comment.
@rwitten @gobbleturk I think this line needs to be specific to CPUs (v4 runner tests are failing due to this line) as these JAX variables will only be set for CPUs.
Is there a check that you could recommend?
There was a problem hiding this comment.
Sure, will wait for his response, thanks Rafi!
There was a problem hiding this comment.
Thanks for the offline advise, Matt! @gobbleturk
I moved the jax distributed initialize for CPUs to max utils. I also tested that async checkpointer now works for CPUs, with these changes.
Logs:
with async checkpointer - https://cloudlogging.app.goo.gl/V7Qog8rJvRDgJRci8
with sync checkpointer - https://cloudlogging.app.goo.gl/XzxRKqBYD2tXxW8g8
Test failures are due to "Install dependencies" step, probably unrelated to my changes.
There was a problem hiding this comment.
#399 will unblock the tests, I will rebase onto it shortly.
8f188af to
ebe22fb
Compare
6c2da5e to
a65d0b6
Compare
a65d0b6 to
ac7950d
Compare
rwitten
left a comment
There was a problem hiding this comment.
Little nits, please fix them before merging.
|
I noticed that the dataloader is timing out and the connection to the JAX coordinator is shutdown. Hence moving this PR to draft again. Dataloader was working fine previously with these changes, will check what changed. Update - standalone_dataloader was updated in http://shortn/_Q5ITxrra63 , where Mesh was being initialized twice, hence the timeouts. Fixed now. |
c7d1fde to
e73e87f
Compare
rwitten
left a comment
There was a problem hiding this comment.
Little nit but giving approval. Please deal with nit before merging
b14b231 to
5011003
Compare
5011003 to
c78de11
Compare
Uh oh!
There was an error while loading. Please reload this page.