-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable jit compilation in tf > 2.16 #2135
Conversation
e7fd0ca
to
6498f0f
Compare
Here's a report: https://vp.nnpdf.science/8vyjWCvZRvSzu0JGbLbErQ==/ The computer used to run this fit was Took about 60 minutes for 120 replicas. I'll merge once the tests are passing (there has been some trouble with the CERN server and some pdfs were not downloaded from LHAPDF) since this changes nothing. Also, I'm relatively proud of this stability taking into account the amount of things that have changed between these two fits (including the fact that after 4.0.9 we are no longer backwards compatibility by law) :) |
Do you know why the training length distribution looks so different? |
The specific pattern looks more different than it is because by chance the fits stopped in a way that made them flatter. But the fits that arrive to the last bin (which is the only relevant one) are ~25 vs ~17 That said, for some reason and with N=2 fits in the GPU seem to produce flatter diatributions. Don't know whether there's a reason: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==/ or whether it is related to the change in the positivity datasets |
I should have done a cpu fit with that runcard and master at some point, but let me quickly just redo it to be sure. |
We already discussed it, but for later reference let me put here the results of the fit with the current version of the master branch but on cpu and using the same runcard as in your fit: : https://vp.nnpdf.science/C0EQblACS2qzaMbpNLuePQ==/ As expected, the TL distribution looks different, but it would be interesting to see how the TL distribution changes on GPU if the pseudodata sampling seed is changed as in both your fits this is the same. |
Good eye! Indeed, there was a problem with the multi-replica stopping, where the replica could become active again. Here's the same fit with this corrected: https://vp.nnpdf.science/RWnazARaSb6TbKMtkxONGA== It doesn't seem to have an effect on the fit (in this report even the seed for the pseudodata is different) but good that it has been caught :) @Cmurilochem in the same manner, it should not impact your hyperopt runs, but if you need to run new ones, better if you use this branch I guess |
For good measure, a fit using the same seeds as the first one https://vp.nnpdf.science/CyjXLYyARtKfWK1iUnJrYg==/ This is ready to merge |
5721909
to
2120355
Compare
Greetings from your nice fit 🤖 !
Check the report carefully, and please buy me a ☕ , or better, a GPU 😉! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also with the same runcard as your first (240726-jcm-004
wasn't uploaded so couldn't compare to that) https://vp.nnpdf.science/rr476H7nT9GjOb6WBasw7Q==/
Took 2:15h and 25 gb on a V100
Anyway, looks good to me
they are the same I'm surprised my puny rtx was faster though. I wonder why. |
From the XLA docs: All operations must have inferrable shapes Note that because XLA is a JIT compiler, the shapes can vary across runs, as long as they can be inferred given the inputs to the cluster. So this example is fine. Could it be that the masking layers give rise to 'unpredictable shape' tensors? |
Then it should crash at compilation time (so either way it is a bug in their side). If, for fun, you want to look further into this I suggest this as a starting point: #2137 |
This is enough to run fits in GPU with TF > 2.16. At least in my systems. I can run 120 replicas (it took me a few iterations of cuda drivers to make it work).
This disables XLA also in CPU (if it is active, which at the moment I think it isn't by default)
I would keep running with 2.15 since that one we know for sure it works and I've only tested in one system for now (python 3.12, TF 2.17, RTX 3070).
I don't appreciate any degradation of performance but this GPU only has 8 GB of RAM so maybe I was already bottlenecked due to the memory before.
Note that before TF 2.16, XLA compilation was disabled by default. Funny thing is, if you enable it for TF 2.15 we also see some problems (even in CPU).
I don't know whether this is a fundamental problem with XLA (and thus nothing we can do) or whether there is a problem in our code that makes XLA not work.
If someone wants to investigate, probably the best thing to do is to try TF 2.15 and set
JIT_COMPILE=True
, because from 2.16 onwards it will work but it will just crash due to the memory leak.