Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] properly support incremental merging with Union mode #17563

Closed
wants to merge 9 commits into from

Conversation

silverweed
Copy link
Contributor

Based on #17559. Opening as a draft because I'd like to add some more unit tests.

This is the last(*) PR of a series that adds proper support for incremental merging with Union mode, i.e. proper handling of deferred columns in the Merger.
With this change we should be able to support ATLAS-like workflows where the TFileMerger is used to incrementally construct an RNTuple from a sequence of files that may contain additional fields compared to the already-merged ones.

(*) at least for a minimal working case - more edge cases remain to be tested and likely fixed.

A brief overview

When Union-merging files containing compatible-but-different models (e.g. the second file contains an additional field), we resort to Late Model Extension in the RNTupleMerger and therefore produce a merged RNTuple containing deferred columns in the header extension.
When we incrementally merge a new file, we need to read the metadata from the existing RNTuple, preserve the header extension as-is and possibly Late Model Extend again the new model created from this metadata. This requires some care because we need to explicitly keep all the extended fields and columns in the header extension when writing the file back (rather than merging them in the regular header).
This requires dropping the implicit assumption in our serializer API that assumes that no header extension is present when serializing the header the first time.

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

cc @amete who is probably interested, as he provided the ParallelFileMerger example linked above

Copy link

github-actions bot commented Jan 29, 2025

Test Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit bbcd1b6.

♻️ This comment has been updated with latest results.

When serializing the header and header extension we're currently assuming
that the first time we serialize it we don't have a header extension.
While this is true in most cases, it's not true when we are incrementally
merging an existing RNTuple.
With this fix we correctly handle this case by skipping all columns
belonging to the extension header when serializing the regular header.
- verify that we can correctly serialize/deserialize a header that
  already contains a header extension before the first serialization
- verify that we can correctly serialize/deserialize a header that
  contains deferred columns in the non-extension header
@silverweed silverweed marked this pull request as ready for review January 31, 2025 10:37
@silverweed silverweed requested a review from jblomer as a code owner January 31, 2025 10:37
Copy link
Contributor

@enirolf enirolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor comments but looks good in general! However I don't have the full RNTupleMerger picture so an extra set of eyes will be necessary.

tree/ntuple/v7/src/RNTupleSerialize.cxx Show resolved Hide resolved
@@ -1399,6 +1410,8 @@ public:
const RNTupleDescriptor &GetDescriptor() const { return fDescriptor; }
RNTupleDescriptor MoveDescriptor();

void CreateFromSchema(const RNTupleDescriptor &descriptor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this method name is a bit vague, perhaps something like SetSchemaFromExisting?

tree/ntuple/v7/src/RNTupleMerger.cxx Show resolved Hide resolved
Comment on lines 952 to 953
return R__FAIL("Passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental "
"merging) can only be done if you provided a valid RNTupleModel");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return R__FAIL("Passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental "
"merging) can only be done if you provided a valid RNTupleModel");
return R__FAIL("passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental "
"merging) can only be done with a valid RNTupleModel");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the longer version because it might not be immediately clear where this "valid RNTupleModel" should be provided. In fact, I think I want to make it even more obvious, like "if you provided a valid RNTupleModel when constructing the RNTupleMerger" (because it's a bit of a long-distance relation, where you might have constructed the merger far away in the code from the place where you call Merge. I think it's helpful to make the user's life easier by being very verbose with this error)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion mainly is a stylistic one, stemming from the fact that all of the other error messages (as far as I'm aware) are written in the passive voice and non-capitalized. I agree with the verbosity! I would personally prefer to try to keep consistent with the style of phrasing in the errors, but also open to have my mind changed.

Copy link
Contributor Author

@silverweed silverweed Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point and agree with it to an extent: I'm certainly fixing the non-capitalization, but I wouldn't know how to write the error in a way that doesn't sound weird in a passive voice - and I don't believe it would add value in this case. I'd argue that for user-facing messages it's more important to be easy to read and act upon, rather than keeping a super-consistent formal style (and passive style sometimes hinders readability imho).
However if you have ideas on how to rephrase this message I'm open to suggestions!

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments inline. There are again many unrelated changes sprinkled into the various commits which makes it very hard to follow. As a suggestion, it may be worth splitting out the descriptor work into a separate PR and get the cloning and serialization of "nonstandard" header right before plugging it into the incremental merging (now that you have a working prototype and all individual pieces figured out)

@@ -1603,3 +1603,60 @@ TEST(RNTupleMerger, MergeAsymmetric1TFileMerger)
}
}
}

TEST(RNTupleMerger, MergeIncrementalLMExt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this test work at that point of the history? We should not have intermittent failing tests, to not break git-bisect

// Incrementally merge the inputs
FileRaii fileGuard("test_ntuple_merge_incr_lmext.root");
fileGuard.PreserveFile();
const auto compression = 0;//ROOT::RCompressionSetting::EDefaults::kUseGeneralPurpose;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented?


// We don't expose this publicly because when we add sharded clusters, this interface does not make sense anymore
DescriptorId_t FindClusterId(NTupleSize_t entryIdx) const;

/// Fills `into` with the schema information about this RNTuple, i.e. all the information needed to create
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into isn't a parameter, is this comment up-to-date?

@@ -529,7 +530,7 @@ public:
void UpdateExtraTypeInfo(const RExtraTypeInfoDescriptor &extraTypeInfo) final;

/// Initialize sink based on an existing descriptor and fill into the descriptor builder.
void InitFromDescriptor(const RNTupleDescriptor &descriptor);
[[nodiscard]] std::unique_ptr<RNTupleModel> InitFromDescriptor(const RNTupleDescriptor &descriptor);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only nodiscard, but the user must also ensure a lifetime longer than the sink. I'm not sure if it is possible to make this clear from the interface, did you consider passing the model as a parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you consider passing the model as a parameter?

How would this change the lifetime problem compared to the return value? In the end we have no control on when the model gets destroyed on the caller's site either way, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the lifetime must be at least as long as the sink, not necessarily longer (it's fine if they get dropped at the same time)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this change the lifetime problem compared to the return value?

That's true. IMO it makes it less likely to use wrongly than a unique ptr return value where you had to apply a [[nodiscard]] to avoid one source of problems already.

the lifetime must be at least as long as the sink, not necessarily longer

Actually, I think we have to destroy the model before the sink because the fields are connected to the sink. We have to check RNTupleFillContext regarding the order of destruction.

tree/ntuple/v7/src/RNTupleMerger.cxx Show resolved Hide resolved
Comment on lines -729 to +731
auto &buffer = sealedPageData.fBuffers.emplace_back(new unsigned char[bufSize]);
auto &buffer = sealedPageData.fBuffers.emplace_back(new std::uint8_t[bufSize]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think buffers in RNTuple generally have type unsigned char. Is there a reason to change this here, as part of a seemingly unrelated commit?

Copy link
Contributor Author

@silverweed silverweed Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for consistency with the rest of the file, that uses std::uint8_t for all its buffers (which I personally like slightly more than unsigned char, but that's beside the point).
However, if we want to keep consistency in all RNTuple code we should decide on that and change all the other uses (though I'd say in a separate PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of how we decide, I'm against "sneaking in" these changes in other commits

Comment on lines 1640 to 1647
writer->Fill();
}
writer->Fill();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already a mistake in the first addition of the test and should be moved there (or the entire commit should be here; see earlier question)

@silverweed silverweed requested a review from enirolf February 3, 2025 09:07
@silverweed silverweed marked this pull request as draft February 3, 2025 09:09
@silverweed
Copy link
Contributor Author

Marking as draft because I'll split it into 2 PRs as suggested by @hahnjo

@silverweed
Copy link
Contributor Author

Split into 3 separate PR, in order:

@silverweed silverweed closed this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants