-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change aiplatform.gapic.AcceleratorType used from TPU to A100 GPU #7
Comments
@StateGovernment please post the error message. Is there a reason you want to use A100? TPU trains really fast and the model weights can be easily converted to pytorch weights with I haven't run this code with GPUs, but it should technically work. My guess is that the machine type needs to be changed to one that supports A100s. If you're using a single A100 (40GB), change machine_type line to For a compatibility of machine types to GPU types take a look at this link You'll also need to install the jaxlib cuda version, change this line to:
Rebuild and push the container to gcr and run gcp_run_train.py again. |
@entrpn I only have a TPU quota of 8, so the training fails after 4-5mins, I requested to increase the quota to 30 which will take a while. So in the meanwhile I'd like to see how the model trains on A100s, probably even have metrics to compare it with TPUs once I have some quota. This was the error I ran into as I tried to change the accelerator type. |
@StateGovernment that's because you need to set the accelerator count to minimum of 8, so if you set the accelerator count to 8 with TPU, it should work. |
@entrpn The accelerator count was by default set to 8, and I only had 8 limited TPU quota for my account. I tried to change the count to 6 through cli but it didn't let me, so the count is hard-set to 8 from what I believe. Training still stops after 11mins, let me attach a screenshot of what I see on console when the training stops. |
@entrpn I've successfully launched a training job with A100 changing the configuration as suggested above, but there was almost no activity in the console or logs, it took almost 25mins and it still says in progress with 0 activity. Please refer to the screenshots below, along with CPU utilisation and logs at the very end. Please help. |
@StateGovernment I forgot to add another step, the container doesn't install cuda drivers, so it won't use the GPU, and will be extremely slow. You'll need to change (this line)[https://github.com/entrpn/serving-model-cards/blob/main/training-dreambooth/Dockerfile#L1] to something like:
At this point, you might need to make extra modifications to the Dockerfile, you can look at (this)[https://github.com/entrpn/serving-model-cards/blob/main/stable-diffusion-batch-job/Dockerfile] dockerfile for reference. |
@entrpn I've followed the instructions above but the training wouldn't start at all. please refer to screenshots below, I've also attached the Dockerfile I've used to build, and config to launch the job. please help. Dockerfile
Config used to launch training-job
|
the reason why your job completes is because the base TPU image knows to find main.sh as the entrypoint. Add this to the end of your Dockerfile:
This should start the job |
How do I change the default accelerator type used for Dreambooth training.
Simply changing the following line is throwing me a cascade of RPC errors, please point me towards a way.
serving-model-cards/training-dreambooth/gcp_run_train.py
Line 21 in cd3cd10
The text was updated successfully, but these errors were encountered: