[ntuple] properly support incremental merging with Union mode #17563

silverweed · 2025-01-29T14:04:54Z

Based on #17559. Opening as a draft because I'd like to add some more unit tests.

This is the last(*) PR of a series that adds proper support for incremental merging with Union mode, i.e. proper handling of deferred columns in the Merger.
With this change we should be able to support ATLAS-like workflows where the TFileMerger is used to incrementally construct an RNTuple from a sequence of files that may contain additional fields compared to the already-merged ones.

(*) at least for a minimal working case - more edge cases remain to be tested and likely fixed.

A brief overview

When Union-merging files containing compatible-but-different models (e.g. the second file contains an additional field), we resort to Late Model Extension in the RNTupleMerger and therefore produce a merged RNTuple containing deferred columns in the header extension.
When we incrementally merge a new file, we need to read the metadata from the existing RNTuple, preserve the header extension as-is and possibly Late Model Extend again the new model created from this metadata. This requires some care because we need to explicitly keep all the extended fields and columns in the header extension when writing the file back (rather than merging them in the regular header).
This requires dropping the implicit assumption in our serializer API that assumes that no header extension is present when serializing the header the first time.

Checklist:

tested changes locally
updated the docs (if necessary)

cc @amete who is probably interested, as he provided the ParallelFileMerger example linked above

github-actions · 2025-01-29T23:02:26Z

Test Results

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit bbcd1b6.

♻️ This comment has been updated with latest results.

When serializing the header and header extension we're currently assuming that the first time we serialize it we don't have a header extension. While this is true in most cases, it's not true when we are incrementally merging an existing RNTuple. With this fix we correctly handle this case by skipping all columns belonging to the extension header when serializing the regular header.

- verify that we can correctly serialize/deserialize a header that already contains a header extension before the first serialization - verify that we can correctly serialize/deserialize a header that contains deferred columns in the non-extension header

enirolf

I have some minor comments but looks good in general! However I don't have the full RNTupleMerger picture so an extra set of eyes will be necessary.

tree/ntuple/v7/src/RNTupleSerialize.cxx

enirolf · 2025-01-31T12:27:12Z

tree/ntuple/v7/inc/ROOT/RNTupleDescriptor.hxx

@@ -1399,6 +1410,8 @@ public:
   const RNTupleDescriptor &GetDescriptor() const { return fDescriptor; }
   RNTupleDescriptor MoveDescriptor();

+   void CreateFromSchema(const RNTupleDescriptor &descriptor);


To me this method name is a bit vague, perhaps something like SetSchemaFromExisting?

tree/ntuple/v7/src/RNTupleMerger.cxx

enirolf · 2025-01-31T12:34:53Z

tree/ntuple/v7/src/RNTupleMerger.cxx

+      return R__FAIL("Passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental "
+                     "merging) can only be done if you provided a valid RNTupleModel");


Suggested change

return R__FAIL("Passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental "

"merging) can only be done if you provided a valid RNTupleModel");

return R__FAIL("passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental "

"merging) can only be done with a valid RNTupleModel");

I prefer the longer version because it might not be immediately clear where this "valid RNTupleModel" should be provided. In fact, I think I want to make it even more obvious, like "if you provided a valid RNTupleModel when constructing the RNTupleMerger" (because it's a bit of a long-distance relation, where you might have constructed the merger far away in the code from the place where you call Merge. I think it's helpful to make the user's life easier by being very verbose with this error)

My suggestion mainly is a stylistic one, stemming from the fact that all of the other error messages (as far as I'm aware) are written in the passive voice and non-capitalized. I agree with the verbosity! I would personally prefer to try to keep consistent with the style of phrasing in the errors, but also open to have my mind changed.

I understand your point and agree with it to an extent: I'm certainly fixing the non-capitalization, but I wouldn't know how to write the error in a way that doesn't sound weird in a passive voice - and I don't believe it would add value in this case. I'd argue that for user-facing messages it's more important to be easy to read and act upon, rather than keeping a super-consistent formal style (and passive style sometimes hinders readability imho).
However if you have ideas on how to rephrase this message I'm open to suggestions!

hahnjo

Some comments inline. There are again many unrelated changes sprinkled into the various commits which makes it very hard to follow. As a suggestion, it may be worth splitting out the descriptor work into a separate PR and get the cloning and serialization of "nonstandard" header right before plugging it into the incremental merging (now that you have a working prototype and all individual pieces figured out)

hahnjo · 2025-02-03T08:48:31Z

tree/ntuple/v7/test/ntuple_merger.cxx

@@ -1603,3 +1603,60 @@ TEST(RNTupleMerger, MergeAsymmetric1TFileMerger)
      }
   }
 }
+
+TEST(RNTupleMerger, MergeIncrementalLMExt)


Will this test work at that point of the history? We should not have intermittent failing tests, to not break git-bisect

hahnjo · 2025-02-03T08:48:42Z

tree/ntuple/v7/test/ntuple_merger.cxx

+   // Incrementally merge the inputs
+   FileRaii fileGuard("test_ntuple_merge_incr_lmext.root");
+   fileGuard.PreserveFile();
+   const auto compression = 0;//ROOT::RCompressionSetting::EDefaults::kUseGeneralPurpose;


hahnjo · 2025-02-03T08:52:04Z

tree/ntuple/v7/inc/ROOT/RNTupleDescriptor.hxx


   // We don't expose this publicly because when we add sharded clusters, this interface does not make sense anymore
   DescriptorId_t FindClusterId(NTupleSize_t entryIdx) const;

+   /// Fills `into` with the schema information about this RNTuple, i.e. all the information needed to create


into isn't a parameter, is this comment up-to-date?

hahnjo · 2025-02-03T08:57:35Z

tree/ntuple/v7/inc/ROOT/RPageStorage.hxx

@@ -529,7 +530,7 @@ public:
   void UpdateExtraTypeInfo(const RExtraTypeInfoDescriptor &extraTypeInfo) final;

   /// Initialize sink based on an existing descriptor and fill into the descriptor builder.
-   void InitFromDescriptor(const RNTupleDescriptor &descriptor);
+   [[nodiscard]] std::unique_ptr<RNTupleModel> InitFromDescriptor(const RNTupleDescriptor &descriptor);


It's not only nodiscard, but the user must also ensure a lifetime longer than the sink. I'm not sure if it is possible to make this clear from the interface, did you consider passing the model as a parameter?

did you consider passing the model as a parameter?

How would this change the lifetime problem compared to the return value? In the end we have no control on when the model gets destroyed on the caller's site either way, no?

Also, the lifetime must be at least as long as the sink, not necessarily longer (it's fine if they get dropped at the same time)

How would this change the lifetime problem compared to the return value?

That's true. IMO it makes it less likely to use wrongly than a unique ptr return value where you had to apply a [[nodiscard]] to avoid one source of problems already.

the lifetime must be at least as long as the sink, not necessarily longer

Actually, I think we have to destroy the model before the sink because the fields are connected to the sink. We have to check RNTupleFillContext regarding the order of destruction.

tree/ntuple/v7/src/RNTupleMerger.cxx

hahnjo · 2025-02-03T08:59:48Z

tree/ntuple/v7/src/RNTupleMerger.cxx

-         auto &buffer = sealedPageData.fBuffers.emplace_back(new unsigned char[bufSize]);
+         auto &buffer = sealedPageData.fBuffers.emplace_back(new std::uint8_t[bufSize]);


I think buffers in RNTuple generally have type unsigned char. Is there a reason to change this here, as part of a seemingly unrelated commit?

It was for consistency with the rest of the file, that uses std::uint8_t for all its buffers (which I personally like slightly more than unsigned char, but that's beside the point).
However, if we want to keep consistency in all RNTuple code we should decide on that and change all the other uses (though I'd say in a separate PR)

Regardless of how we decide, I'm against "sneaking in" these changes in other commits

hahnjo · 2025-02-03T09:03:08Z

tree/ntuple/v7/test/ntuple_merger.cxx

-            writer->Fill();
         }
+         writer->Fill();


I think this is already a mistake in the first addition of the test and should be moved there (or the entire commit should be here; see earlier question)

silverweed · 2025-02-03T09:09:56Z

Marking as draft because I'll split it into 2 PRs as suggested by @hahnjo

silverweed · 2025-02-03T10:27:06Z

Split into 3 separate PR, in order:

removing a wrong assertion [ntuple] remove wrong assert in RNTupleSerialize.cxx #17596
adding CloneSchema to the descriptor and fixing InitFromDescriptor [ntuple] Fix serialization when a header extension is already present #17597
integrating the changes into the Merger and properly handling deferred columns to support incremental merging [ntuple] properly support incremental merging with Union mode #17598

silverweed added the in:RNTuple label Jan 29, 2025

silverweed requested review from hahnjo, pcanal, vepadulano and enirolf January 29, 2025 14:04

silverweed self-assigned this Jan 29, 2025

silverweed added 8 commits January 31, 2025 11:18

[ntuple] add unit test for incremental merging with LM extension

09a8ac0

[ntuple] skip deferred columns when merging

96a5e3b

[ntuple] add RNTupleDescriptor::CloneSchema

febd8db

[ntuple] store the destination model in the RNTupleMerger

96b37b4

[ntuple] Merger: use kForWriting when loading the source descriptor

331039c

[ntuple] reformat if-else chain in ColumnInMemoryType

827db46

silverweed force-pushed the ntuple_merge_atlas branch from c867a32 to 9469b92 Compare January 31, 2025 10:31

silverweed marked this pull request as ready for review January 31, 2025 10:37

silverweed requested a review from jblomer as a code owner January 31, 2025 10:37

enirolf requested changes Jan 31, 2025

View reviewed changes

hahnjo reviewed Feb 3, 2025

View reviewed changes

[ntuple] minor cleanups

bbcd1b6

silverweed requested a review from enirolf February 3, 2025 09:07

silverweed marked this pull request as draft February 3, 2025 09:09

This was referenced Feb 3, 2025

[ntuple] Fix serialization when a header extension is already present #17597

Merged

[ntuple] properly support incremental merging with Union mode #17598

Open

silverweed closed this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ntuple] properly support incremental merging with Union mode #17563

[ntuple] properly support incremental merging with Union mode #17563

silverweed commented Jan 29, 2025

github-actions bot commented Jan 29, 2025 •

edited

Loading

enirolf left a comment

enirolf Jan 31, 2025

enirolf Jan 31, 2025

silverweed Feb 3, 2025

enirolf Feb 3, 2025

silverweed Feb 3, 2025 •

edited

Loading

hahnjo left a comment

hahnjo Feb 3, 2025

hahnjo Feb 3, 2025

hahnjo Feb 3, 2025

hahnjo Feb 3, 2025

silverweed Feb 3, 2025

silverweed Feb 3, 2025

hahnjo Feb 4, 2025

hahnjo Feb 3, 2025

silverweed Feb 3, 2025 •

edited

Loading

hahnjo Feb 4, 2025

hahnjo Feb 3, 2025

silverweed commented Feb 3, 2025

silverweed commented Feb 3, 2025

		return R__FAIL("Passing an already-initialized destination to RNTupleMerger (i.e. trying to do incremental "
		"merging) can only be done if you provided a valid RNTupleModel");

		auto &buffer = sealedPageData.fBuffers.emplace_back(new unsigned char[bufSize]);
		auto &buffer = sealedPageData.fBuffers.emplace_back(new std::uint8_t[bufSize]);

[ntuple] properly support incremental merging with Union mode #17563

[ntuple] properly support incremental merging with Union mode #17563

Conversation

silverweed commented Jan 29, 2025

A brief overview

Checklist:

github-actions bot commented Jan 29, 2025 • edited Loading

Test Results

enirolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silverweed Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

hahnjo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silverweed Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silverweed commented Feb 3, 2025

silverweed commented Feb 3, 2025

github-actions bot commented Jan 29, 2025 •

edited

Loading

silverweed Feb 3, 2025 •

edited

Loading

silverweed Feb 3, 2025 •

edited

Loading