-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove OverridenKVCache and fix some peculiar cases of prims.copy_
+ NVFuser
#788
Conversation
c19f2cd
to
7fb979d
Compare
7fb979d
to
593d61c
Compare
index_copy
now.
593d61c
to
2aae6cd
Compare
I have a fix that appears to resolve all our issues... |
2aae6cd
to
ce2e0a1
Compare
index_copy
now.prims.copy_
+ NVFuser
ce2e0a1
to
c20b9e2
Compare
The fix is in! @jjsjann123, @crcrpar, @tfogal , could you please also have a look and tell me whether you approve the fix in the NVFuser fusion logic. |
c20b9e2
to
44d8845
Compare
44d8845
to
a4d5767
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I may have missed something in the motivation for dropping the test option. Could you educate me on where you're coming from here?
For the nvfuser aspect, in general I think our fusion logic should be revisited. This seems like an extension in the same vein as what's there, and as such seems reasonable. I also think that the higher-level rework is a big can of worms, as we're presently experiencing with the small-scale change of bookends. So I'm inclined to say we should just go with what's here and accrue a little bit more debt pending a larger rewrite.
But it's more reasonable for @jjsjann123 to make that call than me.
# NOTE: filter all first "dangling" no-op copies | ||
while len(bsyms) > 0 and bsyms[0].sym.id is prims.PrimIDs.COPY_: | ||
fused_bsyms.append(bsyms[0]) | ||
bsyms = bsyms[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It seems we're looking at the absolute order of operations, rather than a dataflow order. But maybe we don't have infra to do so via dataflow?
- Should logic like this go into
_should_fuse
, above, instead of here? @jjsjann123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, side note: could I ask you to have the comment define what "dangling" means here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a similar idea. Tried the easiest solution first... I suspect the usage of prims.copy_
is very structured, and issues do appear only at fusion boundaries. But I totally agree - I'd rather move it into the data-flow logic...
@tfogal , after some thought, I believe the fix is legit. I have fixed the issue of bsym groups not being topologically sorted (horizontally) when forming bsym groups (in #656). Therefore, because the logic of in-place ops is op + copy, if op is not claimed, the problematic copy (or copies) can only appear at the top of a fusion group. So, there is no need in following the dataflow order, it is being implicitly encoded already through bsyms being topologically sorted both horizontally and vertically (only for the op + copy combo, of course, and their positions in the graph).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's exactly the kind of thing I was worried about :-)
As you're highlighting, we have a lot of implicit assumptions here on the sorting of ops because we are not relying on dataflow. Such a requirement is implicit on earlier code's behavior, and there's really not a strong reason for that earlier behavior to exist (at least before this).
This is a good definition of fragility--when disconnected, distant pieces of code impose specific requirements on one another.
One of the reasons to ping Jie to get his thoughts is that maybe this fragility is fine here: if our thought is that we're going to scrap much of the fusion logic entirely in a rewrite, a little bit of extra debt here probably is not worth worrying about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luckily, it only concerns the usage of prims.copy_
, which is, as it appears, is a helper-/not-user-facing-symbol. It does nothing if/when prims.copy_
is removed and if/when we re-implement the in-place logic to be something different. The current logic of fusing based on just parent->child relation is limiting as it does not consider horizontal relationships (same level nodes. But it got fixed in #656). Luckily, if the input program is valid, it is already topologically sorted both horizontally and vertically, so it only makes sense to preserve this structure and exploit it if possible (including fusions, fusion regions, and/or any valid subset of a program). The logic here appears to me like a good and a simple example of such exploitations... But I agree, I should add a comment and "proof" correctness.
+1 from me but I'd prefer we wait for Jie's thoughts before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There goes my chance to merge without review.
Thank you @nikitaved @tfogal
As per title, we can remove
OverridenKVCache
in ourlitgpt
tests since the missing op,index_copy_
is in the system.Additionally, this PR
index_copy_
: incorrect with CUDA inputs #789prims.copy_
with NVFuser: sometimes it has to be a kernel. #791