Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fk refactor #1936

Merged
merged 8 commits into from
Mar 5, 2024
Merged

Fk refactor #1936

merged 8 commits into from
Mar 5, 2024

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Feb 12, 2024

The idea

It's a relatively small change, only affecting the observable layers, changing a bit the order in which indices are contracted, and changing from a boolean mask to a float mask.

Performance

Timings for 1000 epochs of the main runcard (NNPDF40_nnlo_as_01180_1000), on Snellius, with 100 replicas on the GPU or 1 replica on the CPU. In brackets the GPU memory used.

branch commit hash 1 replica 100 replicas 500 replicas 1000 replicas
multi-dense + trvl 59e5b58 145 320 (16.8 Gb) x x
fk-refactor 67cd5f0 122 176 (4.5Gb) 423 (16.8Gb) x
fk-refactor (precompute) 22ef7b0 175 100
fk-refactor (enforce order einsum) 1a751c2 175 90
fk-refactor (fix 1 replica) 16551f6 118 90

Profile

image

The validation step will be addressed in #1855 and the gaps in #1802.

@APJansen APJansen self-assigned this Feb 12, 2024
@APJansen APJansen added n3fit Issues and PRs related to n3fit performance escience labels Feb 12, 2024
This was referenced Feb 13, 2024
@APJansen APJansen force-pushed the fk-refactor branch 2 times, most recently from 5ff498f to d794992 Compare February 22, 2024 08:14
@APJansen APJansen force-pushed the fk-refactor branch 2 times, most recently from ad94132 to 5d742c9 Compare February 22, 2024 12:35
@APJansen APJansen marked this pull request as ready for review February 23, 2024 11:29
@APJansen APJansen force-pushed the fk-refactor branch 2 times, most recently from 9375dec to 16551f6 Compare February 26, 2024 11:34
Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that if you run 1 replica in a GPU it would be faster, right?

(otherwise it makes no sense that running 1 replica would be slower than running 100).

n3fit/src/n3fit/layers/DIS.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/DY.py Outdated Show resolved Hide resolved
# the masked convolution removes the batch dimension
ret = op.transpose(self.operation(results))
return op.batchit(ret)
self.compute_observable = compute_observable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as before, having these function as module level functions would be great

I would also expect a speed up and memory reduction since they will be shared by different observables... You might want to put a @tf.function decorator on them.

... unless the speed up is coming from a memory-tradeoff by having these functions be observable specific? But I would hope not...
if that were the case you can still compile them when you attach them to the given layer by doing

self.compute_observable = @tf.function(compute_observable)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a decorator or not shouldn't matter, as they're already used in a call which should get compiled, but I can do a test.
I don't think the speedup has to do with them being observable specific, at least not intentionally.

n3fit/src/n3fit/layers/observable.py Show resolved Hide resolved
n3fit/src/n3fit/layers/observable.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/observable.py Outdated Show resolved Hide resolved
@@ -42,36 +43,104 @@ def __init__(self, fktable_data, fktable_arr, operation_name, nfl=14, **kwargs):
super(MetaLayer, self).__init__(**kwargs)

self.nfl = nfl
self.num_replicas = None # set in build
self.compute_observable = None # set in build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about compute_observable being a function with signature (pdf, masked_pdf) that need to be overwritten by any children of this class.

Actually, maybe it makes sense to make compute_observable be an abstract method and then in DIS.py and DY.py the choice of which (outside) function to use becomes

def compute_observable(self, pdf, fk):
    if self._one_replica:
        return _compute_dis_observable_one_replica(pdf, fk)
    return _compute_dis_observable_many_replica(pdf, fk)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the comment, wanted to avoid these if statements in the call. It probably doesn't matter but I thought it looked cleaner. But can make it an abstract method if you prefer.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

Thanks for the comments, I addressed them all, will do some timings after these changes, and with vs without a decorator on the compute_observables, as well as 1 replica on the GPU.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

I have no idea why, but the 1 replica case became significantly faster: 92s (vs 118 before, both on CPU).
100 replicas on GPU now takes 92 seconds as well, slightly more but most likely just random variations.

1 replica on the GPU takes 65 seconds, indeed faster than 100 of course, but the scaling is great.

The @tf.function decorators don't make a difference in performance.

@scarlehoff
Copy link
Member

I have no idea why, but the 1 replica case became significantly faster: 92s (vs 118 before, both on CPU).

It might be able to compile the functions differently now? Maybe playing with the tf.function options https://www.tensorflow.org/api_docs/python/tf/function you might be able to get it even faster.

1 replica on the GPU takes 65 seconds, indeed faster than 100 of course, but the scaling is great.

Indeed, a factor of 100 in exchange for a factor of 1.5!!!!

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, thank you!

I have just one question... why is masking the PDF instead of the fktable faster in the case of one replica? Could've this been also a fluke of the cpu profiling?

It might be worthwhile to test again because part of the complexity here is coming from that. If that limitation is lifted this would be super clear and elegant!!

Same operations as above but a specialized implementation that is more efficient for 1 replica,
masking the PDF rather than the fk table.
"""
# TODO: check if that is actually true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are already testing the timings, and we are already free from the yoke of 4.0.9 (which was supposed to be 7... so only two versions out, not bad)... why don't you remove the conditional and try to use the "multireplica" version in all of them?

Copy link
Collaborator Author

@APJansen APJansen Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, just tested using the many replica version in all the observables. On the GPU it's actually faster, 53 seconds. On the CPU it's way slower, 330 seconds.

There are 2 factors here, one is just einsum vs tensordot, another is the order of contractions for DIS.

Masking the PDF has the downside that it cannot be precomputed, but it has the upside that it reduces the number of flavours. Masking the fk table only has to be done once, but it enlarges it (perhaps masking is not the best word in this case, it's expanding it into all flavours with zeroes).

I've tested also the approach of masking the fk table but using tensordot for 1 replica on the CPU

def compute_dy_observable_one_replica(pdf, masked_fk):
    pdf = pdf[0][0]  # yg
    fk_pdf = op.tensor_product(masked_fk, pdf, axes=[(3, 4), (0, 1)])  # nxfyg, yg -> nxf
    observable = op.tensor_product(fk_pdf, pdf, axes=[(1, 2), (0, 1)])  # nxf, xf -> n
    return op.batchit(op.batchit(observable))  # brn

This takes 177 seconds on CPU, and again 53 on the GPU.

I don't fully understand these timings, but unfortunately we'll have to keep the branching, unless you're prepared to completely do away with CPU runs, but I don't think that's the case.

edit: for clarity: 1 replica timings

CPU\GPU einsum tensordot
mask pdf - 92 \ 65
mask fk 330 \ 53 \ 177 \ 53

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masking the fk table only has to be done once, but it enlarges it (perhaps masking is not the best word in this case, it's expanding it into all flavours with zeroes).

ah!! Okok, I was misled by the masking word. From this everything else makes sense.

Sadly we cannot avoid cpu runs because it will still be what most people use.
Could you add (basically this comment as a comment) at the top of the DIS.py module for instance with the version of tensorflow that you used.

We can then revisit it in the future. The important thing is that in the 1 replica case using einsum adds a factor of 2 (and the multireplica needs einsum). And since we already have the branching, we might as well mask the PDF for an extra factor of two.

Thank you for these checks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added something, is that what you meant?
Perhaps I should change masked_fk to something like padded_fk?

Also did a quick check in multireplica. Using observable code as it is now (so tensordot and masking pdf), but using einsum in multidense, it comes out as 115 seconds, so also slower.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

I've rewritten it as padded_fk, it's clearer right?

@scarlehoff
Copy link
Member

Even better. imho mask was ok once the clarification was added

@scarlehoff scarlehoff added the redo-regressions Recompute the regression data label Mar 5, 2024
@scarlehoff
Copy link
Member

oh, I thought this was tracking master. is it fine for you if I rebase this one on master, and then the fix to 2.16 on this one?

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 5, 2024

Yep that's fine, I rebased on master a while ago, but not recently.

APJansen and others added 2 commits March 5, 2024 15:20
Add timing comment

additional comment

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
@scarlehoff scarlehoff added redo-regressions Recompute the regression data run-fit-bot Starts fit bot from a PR. and removed redo-regressions Recompute the regression data labels Mar 5, 2024
Copy link

github-actions bot commented Mar 5, 2024

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

@scarlehoff scarlehoff merged commit 23844cb into master Mar 5, 2024
8 checks passed
@scarlehoff scarlehoff deleted the fk-refactor branch March 5, 2024 16:20
@scarlehoff scarlehoff mentioned this pull request Mar 5, 2024
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escience n3fit Issues and PRs related to n3fit performance redo-regressions Recompute the regression data run-fit-bot Starts fit bot from a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants