-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] properly support incremental merging with Union mode #17563
Conversation
Test Results0 tests 0 ✅ 0s ⏱️ Results for commit bbcd1b6. ♻️ This comment has been updated with latest results. |
When serializing the header and header extension we're currently assuming that the first time we serialize it we don't have a header extension. While this is true in most cases, it's not true when we are incrementally merging an existing RNTuple. With this fix we correctly handle this case by skipping all columns belonging to the extension header when serializing the regular header.
- verify that we can correctly serialize/deserialize a header that already contains a header extension before the first serialization - verify that we can correctly serialize/deserialize a header that contains deferred columns in the non-extension header
c867a32
to
9469b92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor comments but looks good in general! However I don't have the full RNTupleMerger picture so an extra set of eyes will be necessary.
@@ -1399,6 +1410,8 @@ public: | |||
const RNTupleDescriptor &GetDescriptor() const { return fDescriptor; } | |||
RNTupleDescriptor MoveDescriptor(); | |||
|
|||
void CreateFromSchema(const RNTupleDescriptor &descriptor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this method name is a bit vague, perhaps something like SetSchemaFromExisting
?
tree/ntuple/v7/src/RNTupleMerger.cxx
Outdated
return R__FAIL("Passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental " | ||
"merging) can only be done if you provided a valid RNTupleModel"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return R__FAIL("Passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental " | |
"merging) can only be done if you provided a valid RNTupleModel"); | |
return R__FAIL("passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental " | |
"merging) can only be done with a valid RNTupleModel"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the longer version because it might not be immediately clear where this "valid RNTupleModel" should be provided. In fact, I think I want to make it even more obvious, like "if you provided a valid RNTupleModel when constructing the RNTupleMerger" (because it's a bit of a long-distance relation, where you might have constructed the merger far away in the code from the place where you call Merge. I think it's helpful to make the user's life easier by being very verbose with this error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion mainly is a stylistic one, stemming from the fact that all of the other error messages (as far as I'm aware) are written in the passive voice and non-capitalized. I agree with the verbosity! I would personally prefer to try to keep consistent with the style of phrasing in the errors, but also open to have my mind changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your point and agree with it to an extent: I'm certainly fixing the non-capitalization, but I wouldn't know how to write the error in a way that doesn't sound weird in a passive voice - and I don't believe it would add value in this case. I'd argue that for user-facing messages it's more important to be easy to read and act upon, rather than keeping a super-consistent formal style (and passive style sometimes hinders readability imho).
However if you have ideas on how to rephrase this message I'm open to suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments inline. There are again many unrelated changes sprinkled into the various commits which makes it very hard to follow. As a suggestion, it may be worth splitting out the descriptor work into a separate PR and get the cloning and serialization of "nonstandard" header right before plugging it into the incremental merging (now that you have a working prototype and all individual pieces figured out)
@@ -1603,3 +1603,60 @@ TEST(RNTupleMerger, MergeAsymmetric1TFileMerger) | |||
} | |||
} | |||
} | |||
|
|||
TEST(RNTupleMerger, MergeIncrementalLMExt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this test work at that point of the history? We should not have intermittent failing tests, to not break git-bisect
// Incrementally merge the inputs | ||
FileRaii fileGuard("test_ntuple_merge_incr_lmext.root"); | ||
fileGuard.PreserveFile(); | ||
const auto compression = 0;//ROOT::RCompressionSetting::EDefaults::kUseGeneralPurpose; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented?
|
||
// We don't expose this publicly because when we add sharded clusters, this interface does not make sense anymore | ||
DescriptorId_t FindClusterId(NTupleSize_t entryIdx) const; | ||
|
||
/// Fills `into` with the schema information about this RNTuple, i.e. all the information needed to create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
into
isn't a parameter, is this comment up-to-date?
@@ -529,7 +530,7 @@ public: | |||
void UpdateExtraTypeInfo(const RExtraTypeInfoDescriptor &extraTypeInfo) final; | |||
|
|||
/// Initialize sink based on an existing descriptor and fill into the descriptor builder. | |||
void InitFromDescriptor(const RNTupleDescriptor &descriptor); | |||
[[nodiscard]] std::unique_ptr<RNTupleModel> InitFromDescriptor(const RNTupleDescriptor &descriptor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not only nodiscard
, but the user must also ensure a lifetime longer than the sink. I'm not sure if it is possible to make this clear from the interface, did you consider passing the model as a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider passing the model as a parameter?
How would this change the lifetime problem compared to the return value? In the end we have no control on when the model gets destroyed on the caller's site either way, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the lifetime must be at least as long as the sink, not necessarily longer (it's fine if they get dropped at the same time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this change the lifetime problem compared to the return value?
That's true. IMO it makes it less likely to use wrongly than a unique ptr return value where you had to apply a [[nodiscard]]
to avoid one source of problems already.
the lifetime must be at least as long as the sink, not necessarily longer
Actually, I think we have to destroy the model before the sink because the fields are connected to the sink. We have to check RNTupleFillContext
regarding the order of destruction.
auto &buffer = sealedPageData.fBuffers.emplace_back(new unsigned char[bufSize]); | ||
auto &buffer = sealedPageData.fBuffers.emplace_back(new std::uint8_t[bufSize]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think buffers in RNTuple generally have type unsigned char
. Is there a reason to change this here, as part of a seemingly unrelated commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for consistency with the rest of the file, that uses std::uint8_t for all its buffers (which I personally like slightly more than unsigned char
, but that's beside the point).
However, if we want to keep consistency in all RNTuple code we should decide on that and change all the other uses (though I'd say in a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of how we decide, I'm against "sneaking in" these changes in other commits
writer->Fill(); | ||
} | ||
writer->Fill(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already a mistake in the first addition of the test and should be moved there (or the entire commit should be here; see earlier question)
Marking as draft because I'll split it into 2 PRs as suggested by @hahnjo |
Split into 3 separate PR, in order:
|
Based on #17559. Opening as a draft because I'd like to add some more unit tests.
This is the last(*) PR of a series that adds proper support for incremental merging with Union mode, i.e. proper handling of deferred columns in the Merger.
With this change we should be able to support ATLAS-like workflows where the TFileMerger is used to incrementally construct an RNTuple from a sequence of files that may contain additional fields compared to the already-merged ones.
(*) at least for a minimal working case - more edge cases remain to be tested and likely fixed.
A brief overview
When Union-merging files containing compatible-but-different models (e.g. the second file contains an additional field), we resort to Late Model Extension in the RNTupleMerger and therefore produce a merged RNTuple containing deferred columns in the header extension.
When we incrementally merge a new file, we need to read the metadata from the existing RNTuple, preserve the header extension as-is and possibly Late Model Extend again the new model created from this metadata. This requires some care because we need to explicitly keep all the extended fields and columns in the header extension when writing the file back (rather than merging them in the regular header).
This requires dropping the implicit assumption in our serializer API that assumes that no header extension is present when serializing the header the first time.
Checklist:
cc @amete who is probably interested, as he provided the ParallelFileMerger example linked above