Avoid idle gpu #1939

APJansen · 2024-02-13T13:40:17Z

The idea

We observed large gaps between training steps in the tensorboard profile when running on the GPU. After a lot of fiddling were found to be (at least partially) due to a per epoch overhead from tensorflow. This is reduced by redefining one actual training step as a single batch of size 1, not as a whole epoch as it used to be.

Implementation wise, this is done by copying the input up to 100 times, creating a set of 100 identical training inputs. Existing callbacks simply implement on_step_end instead of on_epoch_end and inherit from CallbackStep to take care of the conversion between steps, batches and epochs.
One extra issue is that Keras computes metrics cumulatively, they are converted back to per step in CallbackStep.correct_logs. This is the only source of slight numerical differences, which however only appear in the logs and do not propagate at all, training results remain identical.

Performance

Timings for 1000 epochs of the main runcard (NNPDF40_nnlo_as_01180_1000), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. In brackets the GPU memory used.

branch	commit hash	1 replica	100 replicas
master	`0a5fc61`	145	91
avoid-idle-gpu	`bb366aa`	145	67

This saves 24 seconds (or 25%) per 1k epochs.

Profile

Note: slightly outdated profiles

This branch:

and before this single commit for comparison:

APJansen · 2024-03-06T08:29:06Z

The issue with the regression test was fixed by simply rebasing. (And then broke another, but I think it's just a fluke that will be fixed by rerunning).

I've updated the timings as well.
(Don't worry about the higher time on the CPU, that's just because the CPU nodes I was using before aren't available today, this PR only affects >1 replica.)

Side comment: I added in a slight fix to the logging, where for multiple replicas it would print "Validation chi2s:" and then nothing. Incidentally I don't think the comment # The partial chi2 makes no sense for more than one replica at once is true anymore?

scarlehoff · 2024-06-06T15:28:11Z

I'll try to review this tomorrow and leave it ready for you to have a second look and merge it in case you need to touch n3fit as discussed @RoyStegeman . I need to check a few corner cases but afaics this seems ok.

github-actions · 2024-06-07T08:58:19Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-9cf1f0522-2024-06-07
Fit Report wrt master: https://vp.nnpdf.science/XCW60m6ySVO1cqBCe6Zlig==
Fit Report wrt latest stable reference: https://vp.nnpdf.science/72Hgjp4vQOaUBgBm6PNnnQ==
Fit Data: https://data.nnpdf.science/fits/NNBOT-9cf1f0522-2024-06-07.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

n3fit/src/n3fit/backends/keras_backend/MetaModel.py

scarlehoff

I think this can be merged. #2188 can be dealt with in a separate PR.

epoch. Avoids memory overhead by only combining up to 100 steps into one epoch, and not changing anything when using only 1 replica (i.e. on CPU).

…-gpu.

github-actions · 2024-07-17T13:27:00Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-c2dc50df8-2024-07-17
Fit Report wrt master: https://vp.nnpdf.science/1cLR0izpTQqk193L6Ih_6w==
Fit Report wrt latest stable reference: https://vp.nnpdf.science/fEvlOQ1QREyxenwccx4zIA==
Fit Data: https://data.nnpdf.science/fits/NNBOT-c2dc50df8-2024-07-17.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

scarlehoff · 2024-07-17T13:34:11Z

I'm going to merge this since the tests are passing and it is rebased on top of master (which means it is probably fixing something that has changed since the last merge, probably the TF / np version).

Worst case scenario, it can be reverted. The report in #2127 https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw== was made with a PR which is on top of this one, so we have a reproduction of 4.0 with this changes that seems to do ok.

APJansen added n3fit Issues and PRs related to n3fit performance escience labels Feb 13, 2024

APJansen self-assigned this Feb 13, 2024

APJansen mentioned this pull request Feb 13, 2024

Go from one step being an epoch to one step being a batch #1802

Closed

APJansen force-pushed the avoid-idle-gpu branch from 48f9625 to 9032cdf Compare February 28, 2024 13:12

APJansen mentioned this pull request Mar 4, 2024

Finalizing eScience contributions #1977

Closed

APJansen force-pushed the avoid-idle-gpu branch 2 times, most recently from ae92568 to bb366aa Compare March 6, 2024 07:58

APJansen force-pushed the avoid-idle-gpu branch from bb366aa to 95a7685 Compare March 8, 2024 13:02

scarlehoff force-pushed the avoid-idle-gpu branch from 95a7685 to 83f252a Compare June 6, 2024 13:46

scarlehoff changed the base branch from master to small_fixes_tf June 6, 2024 13:46

scarlehoff assigned scarlehoff and unassigned APJansen Jun 6, 2024

scarlehoff self-requested a review June 6, 2024 15:26

scarlehoff marked this pull request as ready for review June 6, 2024 15:27

scarlehoff added the run-fit-bot Starts fit bot from a PR. label Jun 7, 2024

RoyStegeman force-pushed the small_fixes_tf branch from 750204d to e5ca1ed Compare June 18, 2024 12:22

Base automatically changed from small_fixes_tf to master June 18, 2024 15:07

scarlehoff mentioned this pull request Jul 15, 2024

No agreement between parallel gpu and sequential cpu fits #2118

Closed

scarlehoff force-pushed the avoid-idle-gpu branch from a77ead0 to e72ce3c Compare July 15, 2024 09:05

scarlehoff added redo-regressions Recompute the regression data and removed run-fit-bot Starts fit bot from a PR. labels Jul 15, 2024

scarlehoff reviewed Jul 15, 2024

View reviewed changes

n3fit/src/n3fit/backends/keras_backend/MetaModel.py Outdated Show resolved Hide resolved

scarlehoff approved these changes Jul 15, 2024

View reviewed changes

scarlehoff and others added 2 commits July 17, 2024 14:13

add shape to boolean mask

feecfc9

Avoid TensorFlow overhead by making one step a batch rather than an

e7b6a5b

epoch. Avoids memory overhead by only combining up to 100 steps into one epoch, and not changing anything when using only 1 replica (i.e. on CPU).

scarlehoff and others added 3 commits July 17, 2024 14:13

rebase + small changes

c0d1ad3

Automatically regenerated regressions from PR 1939, branch avoid-idle…

3bf822f

…-gpu.

Update n3fit/src/n3fit/backends/keras_backend/MetaModel.py

1637dba

scarlehoff force-pushed the avoid-idle-gpu branch from a3ea3ac to 1637dba Compare July 17, 2024 12:13

scarlehoff added run-fit-bot Starts fit bot from a PR. and removed redo-regressions Recompute the regression data labels Jul 17, 2024

scarlehoff merged commit 68722a9 into master Jul 17, 2024
8 checks passed

scarlehoff deleted the avoid-idle-gpu branch July 17, 2024 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid idle gpu #1939

Avoid idle gpu #1939

APJansen commented Feb 13, 2024 •

edited

Loading

APJansen commented Mar 6, 2024

scarlehoff commented Jun 6, 2024

github-actions bot commented Jun 7, 2024

scarlehoff left a comment

github-actions bot commented Jul 17, 2024

scarlehoff commented Jul 17, 2024

Avoid idle gpu #1939

Avoid idle gpu #1939

Conversation

APJansen commented Feb 13, 2024 • edited Loading

The idea

Performance

Profile

APJansen commented Mar 6, 2024

scarlehoff commented Jun 6, 2024

github-actions bot commented Jun 7, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 17, 2024

scarlehoff commented Jul 17, 2024

APJansen commented Feb 13, 2024 •

edited

Loading