Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config for TPU pod #239

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Config for TPU pod #239

wants to merge 1 commit into from

Conversation

vochicong
Copy link

I ran train.py on a TPU pod v3-256 and got the following error:

ValueError: TPUConfig.num_shards is not set correctly ....

Found in https://cloud.google.com/tpu/docs/training-on-tpu-pods#providing_the_tpu_name_and_region_to_tpuclusterresolver that

For single device training, you can specify either the TPU name or an IP address, for example: grpc://1.2.3.4:8470.
For TPU Pods you must use the TPU name so that TensorFlow can discover the IP addresses of all the hosts available for training distribution.

So, in the case of a TPU pod, setting master doesn't work. I just tried setting cluster and it worked, all 32 hosts in the TPU pod were detected and used correctly.

I ran `train.py` on a TPU pod v3-256 and got the following error:

    ValueError: TPUConfig.num_shards is not set correctly ....

Found in https://cloud.google.com/tpu/docs/training-on-tpu-pods#providing_the_tpu_name_and_region_to_tpuclusterresolver that
> For single device training, you can specify either the TPU name or an IP address, for example: `grpc://1.2.3.4:8470`.
> For TPU Pods you must use the TPU name so that TensorFlow can discover the IP addresses of all the hosts available for training distribution.

So, in the case of a TPU pod, setting `master` doesn't work. I just tried setting `cluster` and it worked, all 32 hosts in the TPU pod were detected and used correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant