Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change aiplatform.gapic.AcceleratorType used from TPU to A100 GPU #7

Open
StateGovernment opened this issue Mar 15, 2023 · 9 comments

Comments

@StateGovernment
Copy link

How do I change the default accelerator type used for Dreambooth training.

Simply changing the following line is throwing me a cascade of RPC errors, please point me towards a way.

"accelerator_type": aiplatform.gapic.AcceleratorType.TPU_V3,

@entrpn
Copy link
Owner

entrpn commented Mar 15, 2023

@StateGovernment please post the error message.

Is there a reason you want to use A100? TPU trains really fast and the model weights can be easily converted to pytorch weights with diffusers later if needed.

I haven't run this code with GPUs, but it should technically work. My guess is that the machine type needs to be changed to one that supports A100s. If you're using a single A100 (40GB), change machine_type line to a2-highgpu-1g and call gcp_run_train.py with --accelerator-count=1.

For a compatibility of machine types to GPU types take a look at this link

You'll also need to install the jaxlib cuda version, change this line to:

RUN pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Rebuild and push the container to gcr and run gcp_run_train.py again.

@StateGovernment
Copy link
Author

@entrpn I only have a TPU quota of 8, so the training fails after 4-5mins, I requested to increase the quota to 30 which will take a while. So in the meanwhile I'd like to see how the model trains on A100s, probably even have metrics to compare it with TPUs once I have some quota.

This was the error I ran into as I tried to change the accelerator type.
Screenshot 2023-03-16 at 10 13 48 AM

@entrpn
Copy link
Owner

entrpn commented Mar 16, 2023

@StateGovernment that's because you need to set the accelerator count to minimum of 8, so if you set the accelerator count to 8 with TPU, it should work.

@StateGovernment
Copy link
Author

StateGovernment commented Mar 16, 2023

@entrpn The accelerator count was by default set to 8, and I only had 8 limited TPU quota for my account. I tried to change the count to 6 through cli but it didn't let me, so the count is hard-set to 8 from what I believe. Training still stops after 11mins, let me attach a screenshot of what I see on console when the training stops.

Screenshot 2023-03-16 at 11 02 00 AM

@StateGovernment
Copy link
Author

@entrpn I've successfully launched a training job with A100 changing the configuration as suggested above, but there was almost no activity in the console or logs, it took almost 25mins and it still says in progress with 0 activity. Please refer to the screenshots below, along with CPU utilisation and logs at the very end. Please help.

Screenshot 2023-03-16 at 3 08 57 PM

Screenshot 2023-03-16 at 3 11 32 PM

Screenshot 2023-03-16 at 3 12 05 PM

@entrpn
Copy link
Owner

entrpn commented Mar 16, 2023

@StateGovernment I forgot to add another step, the container doesn't install cuda drivers, so it won't use the GPU, and will be extremely slow. You'll need to change (this line)[https://github.com/entrpn/serving-model-cards/blob/main/training-dreambooth/Dockerfile#L1] to something like:

FROM nvidia/cuda:11.3.1-base-ubuntu20.04

At this point, you might need to make extra modifications to the Dockerfile, you can look at (this)[https://github.com/entrpn/serving-model-cards/blob/main/stable-diffusion-batch-job/Dockerfile] dockerfile for reference.

@StateGovernment
Copy link
Author

@entrpn I see, I somehow missed that detail too, thank you for pointing out.

I also believe this line needs to change. Am not sure what to change it to though, please help me out.

I might even end up making a different Dockerfile altogether for GPUs.

@StateGovernment
Copy link
Author

StateGovernment commented Apr 5, 2023

@entrpn I've followed the instructions above but the training wouldn't start at all. please refer to screenshots below, I've also attached the Dockerfile I've used to build, and config to launch the job. please help.

Screenshot 2023-04-05 at 5 56 20 PM

Screenshot 2023-04-05 at 5 56 35 PM

Dockerfile

FROM nvidia/cuda:11.3.1-base-ubuntu20.04

RUN apt-get update && \
    apt-get install -y software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
    apt-get update && \
    apt install -y python3.8 && \
    apt-get -y install python3-pip

RUN apt-get update && apt-get -y upgrade \
  && apt-get install -y --no-install-recommends \
    git \
    wget \
    g++ \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && apt-get install -y curl
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | \
    tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
    tee /usr/share/keyrings/cloud.google.gpg && apt-get update -y && apt-get install google-cloud-sdk -y

# RUN pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
RUN pip install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
RUN pip install git+https://github.com/huggingface/diffusers.git
RUN pip install transformers flax optax torch torchvision ftfy tensorboard modelcards


WORKDIR 'training_dreambooth'

COPY . .

Config used to launch training-job

custom_job = {
        "display_name": "training-dreambooth-alisha-1000steps",
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        # "machine_type": "cloud-tpu",
                        # "accelerator_type": aiplatform.gapic.AcceleratorType.TPU_V3,
                        # "accelerator_count": 8,
                        "machine_type": "a2-highgpu-1g",
                        "accelerator_type": aiplatform.gapic.AcceleratorType.NVIDIA_TESLA_A100,
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "disk_spec" : {
                        "boot_disk_type": "pd-ssd",
                        "boot_disk_size_gb" : 500
                    },
                    "container_spec": {
                        "image_uri": "gcr.io/dreamboothtest/training-dreambooth-new-gpu:latest",
                        "command": [],
                        "args": [],
                        "env" : [
                            {"name" : "MODEL_NAME", "value" : "runwayml/stable-diffusion-v1-5"},
                            {"name" : "INSTANCE_PROMPT", "value" : "a photo of al45 person"},
                            {"name" : "GCS_OUTPUT_DIR", "value" : "gs://alishadreamboothtest"},
                            {"name" : "RESOLUTION", "value" : "512"},
                            {"name" : "BATCH_SIZE", "value" : "1"},
                            {"name" : "LEARNING_RATE", "value" : "1e-6"},
                            {"name" : "MAX_TRAIN_STEPS", "value" : "1000"},
                            {"name" : "HF_TOKEN", "value" : "<>"},
                            {"name" : "CLASS_PROMPT", "value" : "A photo of a person"},
                            {"name" : "NUM_CLASS_IMAGES", "value" : "56"},
                            {"name" : "PRIOR_LOSS_WEIGHT", "value" : "1.0"},
                            {"name" : "GCS_INPUT_DIR", "value" : "gs://alishadreamboothtest/training_images"},
                        ]
                    },
                }
            ],
            "enable_web_access" : True
        },
    }

@entrpn
Copy link
Owner

entrpn commented Apr 6, 2023

the reason why your job completes is because the base TPU image knows to find main.sh as the entrypoint. Add this to the end of your Dockerfile:

ENTRYPOINT ["./main.sh"]

This should start the job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants