Disable jit compilation in tf > 2.16 #2135

scarlehoff · 2024-07-25T12:11:03Z

This is enough to run fits in GPU with TF > 2.16. At least in my systems. I can run 120 replicas (it took me a few iterations of cuda drivers to make it work).

This disables XLA also in CPU (if it is active, which at the moment I think it isn't by default)

--tf_xla_cpu_global_jit=false           bool    Enables global JIT compilation for CPU via SessionOptions.

I would keep running with 2.15 since that one we know for sure it works and I've only tested in one system for now (python 3.12, TF 2.17, RTX 3070).

I don't appreciate any degradation of performance but this GPU only has 8 GB of RAM so maybe I was already bottlenecked due to the memory before.

Note that before TF 2.16, XLA compilation was disabled by default. Funny thing is, if you enable it for TF 2.15 we also see some problems (even in CPU).
I don't know whether this is a fundamental problem with XLA (and thus nothing we can do) or whether there is a problem in our code that makes XLA not work.
If someone wants to investigate, probably the best thing to do is to try TF 2.15 and set JIT_COMPILE=True, because from 2.16 onwards it will work but it will just crash due to the memory leak.

scarlehoff · 2024-07-25T16:41:41Z

Here's a report:

https://vp.nnpdf.science/8vyjWCvZRvSzu0JGbLbErQ==/

The computer used to run this fit was
Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz, 16 GB of RAM
RTX 3070 8 GB

Took about 60 minutes for 120 replicas.

I'll merge once the tests are passing (there has been some trouble with the CERN server and some pdfs were not downloaded from LHAPDF) since this changes nothing.

Also, I'm relatively proud of this stability taking into account the amount of things that have changed between these two fits (including the fact that after 4.0.9 we are no longer backwards compatibility by law) :)

RoyStegeman · 2024-07-25T17:42:31Z

Do you know why the training length distribution looks so different?

scarlehoff · 2024-07-25T18:00:58Z

The specific pattern looks more different than it is because by chance the fits stopped in a way that made them flatter. But the fits that arrive to the last bin (which is the only relevant one) are ~25 vs ~17

That said, for some reason and with N=2 fits in the GPU seem to produce flatter diatributions. Don't know whether there's a reason: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==/

or whether it is related to the change in the positivity datasets

RoyStegeman · 2024-07-25T18:14:37Z

I should have done a cpu fit with that runcard and master at some point, but let me quickly just redo it to be sure.

RoyStegeman · 2024-07-26T18:40:46Z

We already discussed it, but for later reference let me put here the results of the fit with the current version of the master branch but on cpu and using the same runcard as in your fit: : https://vp.nnpdf.science/C0EQblACS2qzaMbpNLuePQ==/

As expected, the TL distribution looks different, but it would be interesting to see how the TL distribution changes on GPU if the pseudodata sampling seed is changed as in both your fits this is the same.

scarlehoff · 2024-07-26T21:32:38Z

Good eye!

Indeed, there was a problem with the multi-replica stopping, where the replica could become active again. Here's the same fit with this corrected:

https://vp.nnpdf.science/RWnazARaSb6TbKMtkxONGA==

It doesn't seem to have an effect on the fit (in this report even the seed for the pseudodata is different) but good that it has been caught :)

@Cmurilochem in the same manner, it should not impact your hyperopt runs, but if you need to run new ones, better if you use this branch I guess

scarlehoff · 2024-07-27T14:38:40Z

For good measure, a fit using the same seeds as the first one https://vp.nnpdf.science/CyjXLYyARtKfWK1iUnJrYg==/

This is ready to merge

github-actions · 2024-07-27T17:40:59Z

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Fit Name: NNBOT-1acb473f6-2024-07-27
Fit Report wrt master: https://vp.nnpdf.science/7_gAZ0CFSi6OAA1hD8YGYw==
Fit Report wrt latest stable reference: https://vp.nnpdf.science/u06sU7kfSKy16UvgXJ2A6g==
Fit Data: https://data.nnpdf.science/fits/NNBOT-1acb473f6-2024-07-27.tar.gz

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

RoyStegeman

Also with the same runcard as your first (240726-jcm-004 wasn't uploaded so couldn't compare to that) https://vp.nnpdf.science/rr476H7nT9GjOb6WBasw7Q==/

Took 2:15h and 25 gb on a V100

Anyway, looks good to me

scarlehoff · 2024-07-28T11:32:30Z

Also with the same runcard as your first (240726-jcm-004 wasn't uploaded so couldn't compare to that)

they are the same

I'm surprised my puny rtx was faster though. I wonder why.

goord · 2024-07-29T09:34:32Z

From the XLA docs:

All operations must have inferrable shapes
XLA needs to be able to infer the shapes for all of operations it compiles given the inputs to the computation. So a model function that produces a Tensor with an unpredictable shape will fail with an error when run. (In this example, the shape of the output of tf.expand_dims depends on random_dim_size which cannot be inferred given x, y and z.)

Note that because XLA is a JIT compiler, the shapes can vary across runs, as long as they can be inferred given the inputs to the cluster. So this example is fine.

Could it be that the masking layers give rise to 'unpredictable shape' tensors?

scarlehoff · 2024-07-29T09:37:23Z

Then it should crash at compilation time (so either way it is a bug in their side). If, for fun, you want to look further into this I suggest this as a starting point: #2137
a code that is at the same time compatible with pytorch and tensorflow is the most explicit thing you can get (and actually, to make it compatible with pytorch I had to add the output shape in a few places where tensorflow as able to infer them but pytorch wasn't).

scarlehoff added the n3fit Issues and PRs related to n3fit label Jul 25, 2024

scarlehoff requested review from goord and Cmurilochem July 25, 2024 12:11

scarlehoff marked this pull request as ready for review July 25, 2024 12:11

Cmurilochem self-assigned this Jul 25, 2024

scarlehoff force-pushed the disable_xla_tf216 branch 2 times, most recently from e7fd0ca to 6498f0f Compare July 25, 2024 16:41

scarlehoff added 2 commits July 27, 2024 15:39

disable jit compilation in tf > 2.16

accd5cc

fix bug in stopping

2120355

RoyStegeman force-pushed the disable_xla_tf216 branch from 5721909 to 2120355 Compare July 27, 2024 14:39

RoyStegeman added the run-fit-bot Starts fit bot from a PR. label Jul 27, 2024

RoyStegeman reviewed Jul 28, 2024

View reviewed changes

RoyStegeman approved these changes Jul 28, 2024

View reviewed changes

RoyStegeman merged commit 68f5c66 into master Jul 28, 2024
8 checks passed

RoyStegeman deleted the disable_xla_tf216 branch July 28, 2024 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable jit compilation in tf > 2.16 #2135

Disable jit compilation in tf > 2.16 #2135

scarlehoff commented Jul 25, 2024 •

edited

Loading

scarlehoff commented Jul 25, 2024 •

edited

Loading

RoyStegeman commented Jul 25, 2024

scarlehoff commented Jul 25, 2024

RoyStegeman commented Jul 25, 2024

RoyStegeman commented Jul 26, 2024

scarlehoff commented Jul 26, 2024 •

edited

Loading

scarlehoff commented Jul 27, 2024

github-actions bot commented Jul 27, 2024

RoyStegeman left a comment

scarlehoff commented Jul 28, 2024 •

edited

Loading

goord commented Jul 29, 2024

scarlehoff commented Jul 29, 2024

Disable jit compilation in tf > 2.16 #2135

Disable jit compilation in tf > 2.16 #2135

Conversation

scarlehoff commented Jul 25, 2024 • edited Loading

scarlehoff commented Jul 25, 2024 • edited Loading

RoyStegeman commented Jul 25, 2024

scarlehoff commented Jul 25, 2024

RoyStegeman commented Jul 25, 2024

RoyStegeman commented Jul 26, 2024

scarlehoff commented Jul 26, 2024 • edited Loading

scarlehoff commented Jul 27, 2024

github-actions bot commented Jul 27, 2024

RoyStegeman left a comment

Choose a reason for hiding this comment

scarlehoff commented Jul 28, 2024 • edited Loading

goord commented Jul 29, 2024

scarlehoff commented Jul 29, 2024

scarlehoff commented Jul 25, 2024 •

edited

Loading

scarlehoff commented Jul 25, 2024 •

edited

Loading

scarlehoff commented Jul 26, 2024 •

edited

Loading

scarlehoff commented Jul 28, 2024 •

edited

Loading