-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] properly support incremental merging with Union mode #17563
Changes from 1 commit
09a8ac0
96a5e3b
febd8db
96b37b4
331039c
827db46
08bc1f4
9469b92
bbcd1b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -548,15 +548,26 @@ private: | |
/// Free text from the user | ||
std::string fDescription; | ||
|
||
std::uint64_t fOnDiskHeaderXxHash3 = 0; ///< Set by the descriptor builder when deserialized | ||
DescriptorId_t fFieldZeroId = kInvalidDescriptorId; ///< Set by the descriptor builder | ||
|
||
std::uint64_t fNPhysicalColumns = 0; ///< Updated by the descriptor builder when columns are added | ||
std::uint64_t fOnDiskHeaderSize = 0; ///< Set by the descriptor builder when deserialized | ||
|
||
std::set<unsigned int> fFeatureFlags; | ||
std::unordered_map<DescriptorId_t, RFieldDescriptor> fFieldDescriptors; | ||
std::unordered_map<DescriptorId_t, RColumnDescriptor> fColumnDescriptors; | ||
|
||
std::vector<RExtraTypeInfoDescriptor> fExtraTypeInfoDescriptors; | ||
std::unique_ptr<RHeaderExtension> fHeaderExtension; | ||
|
||
//// All fields above are part of the schema and are cloned when creating a new descriptor from a given one | ||
//// (see CloneSchema()) | ||
|
||
std::uint64_t fOnDiskHeaderXxHash3 = 0; ///< Set by the descriptor builder when deserialized | ||
std::uint64_t fOnDiskFooterSize = 0; ///< Like fOnDiskHeaderSize, contains both cluster summaries and page locations | ||
|
||
std::uint64_t fNEntries = 0; ///< Updated by the descriptor builder when the cluster groups are added | ||
std::uint64_t fNClusters = 0; ///< Updated by the descriptor builder when the cluster groups are added | ||
std::uint64_t fNPhysicalColumns = 0; ///< Updated by the descriptor builder when columns are added | ||
|
||
DescriptorId_t fFieldZeroId = kInvalidDescriptorId; ///< Set by the descriptor builder | ||
|
||
/** | ||
* Once constructed by an RNTupleDescriptorBuilder, the descriptor is mostly immutable except for set of | ||
|
@@ -567,9 +578,6 @@ private: | |
*/ | ||
std::uint64_t fGeneration = 0; | ||
|
||
std::set<unsigned int> fFeatureFlags; | ||
std::unordered_map<DescriptorId_t, RFieldDescriptor> fFieldDescriptors; | ||
std::unordered_map<DescriptorId_t, RColumnDescriptor> fColumnDescriptors; | ||
std::unordered_map<DescriptorId_t, RClusterGroupDescriptor> fClusterGroupDescriptors; | ||
/// References cluster groups sorted by entry range and thus allows for binary search. | ||
/// Note that this list is empty during the descriptor building process and will only be | ||
|
@@ -578,12 +586,15 @@ private: | |
/// May contain only a subset of all the available clusters, e.g. the clusters of the current file | ||
/// from a chain of files | ||
std::unordered_map<DescriptorId_t, RClusterDescriptor> fClusterDescriptors; | ||
std::vector<RExtraTypeInfoDescriptor> fExtraTypeInfoDescriptors; | ||
std::unique_ptr<RHeaderExtension> fHeaderExtension; | ||
|
||
// We don't expose this publicly because when we add sharded clusters, this interface does not make sense anymore | ||
DescriptorId_t FindClusterId(NTupleSize_t entryIdx) const; | ||
|
||
/// Fills `into` with the schema information about this RNTuple, i.e. all the information needed to create | ||
/// a new RNTuple with the same schema as this one but not necessarily the same clustering. This is used | ||
/// when merging two RNTuples. | ||
RNTupleDescriptor CloneSchema() const; | ||
|
||
public: | ||
static constexpr unsigned int kFeatureFlagTest = 137; // Bit reserved for forward-compatibility testing | ||
|
||
|
@@ -1399,6 +1410,8 @@ public: | |
const RNTupleDescriptor &GetDescriptor() const { return fDescriptor; } | ||
RNTupleDescriptor MoveDescriptor(); | ||
|
||
void CreateFromSchema(const RNTupleDescriptor &descriptor); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To me this method name is a bit vague, perhaps something like |
||
|
||
void SetNTuple(const std::string_view name, const std::string_view description); | ||
void SetFeature(unsigned int flag); | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -281,6 +281,9 @@ protected: | |
/// with the page source, we leave it up to the derived class whether or not the compressor gets constructed. | ||
std::unique_ptr<RNTupleCompressor> fCompressor; | ||
|
||
/// Flag if sink was initialized | ||
bool fIsInitialized = false; | ||
|
||
/// Helper for streaming a page. This is commonly used in derived, concrete page sinks. Note that if | ||
/// compressionSetting is 0 (uncompressed) and the page is mappable and not checksummed, the returned sealed page | ||
/// will point directly to the input page buffer. Otherwise, the sealed page references an internal buffer | ||
|
@@ -289,8 +292,6 @@ protected: | |
RSealedPage SealPage(const RPage &page, const RColumnElementBase &element); | ||
|
||
private: | ||
/// Flag if sink was initialized | ||
bool fIsInitialized = false; | ||
std::vector<Callback_t> fOnDatasetCommitCallbacks; | ||
std::vector<unsigned char> fSealPageBuffer; ///< Used as destination buffer in the simple SealPage overload | ||
|
||
|
@@ -529,7 +530,7 @@ public: | |
void UpdateExtraTypeInfo(const RExtraTypeInfoDescriptor &extraTypeInfo) final; | ||
|
||
/// Initialize sink based on an existing descriptor and fill into the descriptor builder. | ||
void InitFromDescriptor(const RNTupleDescriptor &descriptor); | ||
[[nodiscard]] std::unique_ptr<RNTupleModel> InitFromDescriptor(const RNTupleDescriptor &descriptor); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
How would this change the lifetime problem compared to the return value? In the end we have no control on when the model gets destroyed on the caller's site either way, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, the lifetime must be at least as long as the sink, not necessarily longer (it's fine if they get dropped at the same time) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's true. IMO it makes it less likely to use wrongly than a unique ptr return value where you had to apply a
Actually, I think we have to destroy the model before the sink because the fields are connected to the sink. We have to check |
||
|
||
void CommitSuppressedColumn(ColumnHandle_t columnHandle) final; | ||
void CommitPage(ColumnHandle_t columnHandle, const RPage &page) final; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
into
isn't a parameter, is this comment up-to-date?