You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This can be trivially reproduced given that you have an access to some Google Cloud project in which you have permissions to run Vertex AI custom training jobs and create Google Cloud Storage (GCS) buckets.
In Vertex AI, submit some fake custom training job (using e.g. python:3.10 Docker image and command: ["sleep", "3600"]) with interactive shell enabled (if you're submitting this job via Python's SDK, simply pass enable_web_access=True to CustomJob's .submit() method).
Once you'd get this job in "Training" state, open two terminals by clicking "Launch web terminal" in its GUI - we will use them for Python and Bash, respectively.
In GCS, create some bucket of your choice and make sure that you can write to it.
In the first terminal, run pip install tensorboard and then launch Python interpreter.
Copy & paste this code to Python's shell, replacing <NAME_OF_YOUR_BUCKET> with the name of the bucket that you've created in step (3):
In the second terminal (Bash), cd to /gcs/<NAME_OF_YOUR_BUCKET>/ and make sure that there are no existing event files there.
We will start with the default writer - paste this to Python's shell:
writer=EventFileWriter(path)
Switch to Bash and run the following (we will be using this to monitor the size of event files - at the beginning you should see one such file having 88 bytes):
whiletrue;do date '+%H:%M:%S'; ls -l | grep event; sleep 1;done
Switch to Bash and observe how your event file is growing - it should reach slightly above 10 KB. Once it will stop growing, terminate your monitoring loop and note how long it took (for me, it was 2m23s). Delete this file afterwards.
Now, we will repeat this experiment with slightly modified writer. Switch to Python and paste the following:
Switch to Bash and run the same monitoring loop as in step (8).
Switch to Python and run the same loop as in step (9), followed by writer.flush() because otherwise you'd have to wait extra 2 minutes.
Switch to Bash and notice that your event file reached its expected size almost instantly.
Q: What does this experiment show?
A: It shows that EventFileWriter is not aware of Cloud Storage FUSE filesystem, which leads to suboptimal performance (and to make it clear: this monkey-patching of fs_supports_append doesn't mean that this fs doesn't support append - it's just a hack meant to foolEventFileWriter).
Q: But writing events to disk is asynchronous, so even if it is slow, why should we care?
A: Because some training frameworks (e.g. PyTorch Lightning) explicitly flush metrics at the end of the validation loop (which is quite reasonable thing to do, I'd say), and with hundreds of them (600+ in my case), training is stuck for non-negligible time after each validation (7-8 minutes in my case) - and the more frequent the validations, the more badly it accumulates w.r.t. total training time and GPU utilization. And besides, good engineers simply can't stand seeing that writing such relatively small amount of data takes so much time ;)
The text was updated successfully, but these errors were encountered:
This can be trivially reproduced given that you have an access to some Google Cloud project in which you have permissions to run Vertex AI custom training jobs and create Google Cloud Storage (GCS) buckets.
python:3.10
Docker image andcommand: ["sleep", "3600"]
) with interactive shell enabled (if you're submitting this job via Python's SDK, simply passenable_web_access=True
toCustomJob
's.submit()
method).pip install tensorboard
and then launch Python interpreter.<NAME_OF_YOUR_BUCKET>
with the name of the bucket that you've created in step (3):cd
to/gcs/<NAME_OF_YOUR_BUCKET>/
and make sure that there are no existing event files there.2m23s
). Delete this file afterwards.writer.flush()
because otherwise you'd have to wait extra 2 minutes.Q: What does this experiment show?
A: It shows that
EventFileWriter
is not aware of Cloud Storage FUSE filesystem, which leads to suboptimal performance (and to make it clear: this monkey-patching offs_supports_append
doesn't mean that this fs doesn't support append - it's just a hack meant to foolEventFileWriter
).Q: But writing events to disk is asynchronous, so even if it is slow, why should we care?
A: Because some training frameworks (e.g. PyTorch Lightning) explicitly flush metrics at the end of the validation loop (which is quite reasonable thing to do, I'd say), and with hundreds of them (600+ in my case), training is stuck for non-negligible time after each validation (7-8 minutes in my case) - and the more frequent the validations, the more badly it accumulates w.r.t. total training time and GPU utilization. And besides, good engineers simply can't stand seeing that writing such relatively small amount of data takes so much time ;)
The text was updated successfully, but these errors were encountered: