Skip to content

Conversation

jackapet
Copy link

@jackapet jackapet commented Sep 25, 2025

This Pull request:

Add support for THnSparseD histograms in RDF

Changes or fixes:

  • Adding THnSparseDModel into HistoModels
  • Adding HistoNSparseD functions into RInterface
  • Remove deletion of copy constructor from THnSparse: THnSparse(const THnSparse&) = delete;

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

Fixes #19969

@ferdymercury
Copy link
Collaborator

ferdymercury commented Sep 25, 2025

Thanks a lot!
Optional improvements:

  • Adapt bindings/distrdf/python/DistRDF/Operation.py
  • Adding tests in roottest/python/distrdf/backends/check_reducer_merge.py, bindings/distrdf/test/test_operation.py, tree/dataframe/test/dataframe_histomodels.cxx, tree/dataframe/test/dataframe_merge_results.cxx, tree/dataframe/test/dataframe_cloning.cxx, tree/dataframe/test/dataframe_vary.cxx
  • Adding docu in tree/dataframe/inc/ROOT/RDF/RMergeableValue.hxx, tree/dataframe/src/RDataFrame.cxx and README/ReleaseNotes/v626/index.md (well copy paste to v638):

@jackapet jackapet changed the title Add support for THnSparseD histogram in RDataFrame Fixes #19969 Sep 25, 2025
@jackapet jackapet changed the title Fixes #19969 Fixes Add support for THnSparseD histogram in RDataFrame #19969 Sep 25, 2025
@jackapet jackapet changed the title Fixes Add support for THnSparseD histogram in RDataFrame #19969 Fixes #19969 Sep 25, 2025
@jackapet jackapet changed the title Fixes #19969 Add support for THnSparseD histogram in RDataFrame Sep 25, 2025
@jackapet
Copy link
Author

Thanks @ferdymercury ,

I will try to follow your instructions.

I noticed in my local tests that the THnSparseD and THnD too do not work with multithreading. Both seems to work fine with a single thread.

@ferdymercury
Copy link
Collaborator

THnSparseD and THnD too do not work with multithreading

See https://root-forum.cern.ch/t/filling-histograms-in-parallel/35460

Copy link
Collaborator

@ferdymercury ferdymercury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Feel free to click on the button "Ready to Review" so that core developers are notified.

@jackapet
Copy link
Author

THnSparseD and THnD too do not work with multithreading

See https://root-forum.cern.ch/t/filling-histograms-in-parallel/35460

I see, thank you for advice.

@jackapet
Copy link
Author

I implemented all required changes. I have few comments/questions:

  • I am not able to run tests locally because I do not have a necessary software installed. Is there some docker image which I can use for development/testing?
  • In documentation, I was not able to determine links to lines in the html file, e.g. [HistoNSparseD](https://root.cern/doc/master/classROOT_1_1RDF_1_1RInterface.html#a5f3e2f0a3d1c8e4f0e2f3e7f0e8c6b7a). The link was generated by copilot an I do not know if it is right.

@jackapet jackapet marked this pull request as ready for review September 25, 2025 13:49
Copy link
Collaborator

@ferdymercury ferdymercury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the link to the html file, leave it empty for the moment.
Thanks.

@ferdymercury
Copy link
Collaborator

One further comment: commit messages usually start with the involved component:

[math] fix sth
[RDF] improve Rdataframe...
[NFC] document better
(NFC is a signal for changes that are not relevant from the code point)

@jackapet
Copy link
Author

One further comment: commit messages usually start with the involved component:

[math] fix sth [RDF] improve Rdataframe... [NFC] document better (NFC is a signal for changes that are not relevant from the code point)

Ok, I added [RDF] or [NFC] to commits

@ferdymercury
Copy link
Collaborator

ferdymercury commented Sep 25, 2025

Thanks again!

In README/ReleaseNotes/v638/index.md
you could add:
- Add HistoNSparseD action that fills a sparse N-dimensional histogram.

  • You could add yourself as contributor on the list of that file.

nitpick wrt commit message: [hist] is usually small compared to [RDF] RDataFrame

@martamaja10
Copy link
Contributor

Hi @jackapet, thank you very much for this contribution! It looks good to me at the first glance, so I will let the CI run and once we see those results I will also give you a more detailed review. Thanks to @ferdymercury for giving the first reviews and hints as well!

@martamaja10 martamaja10 self-assigned this Sep 26, 2025
Copy link

github-actions bot commented Sep 26, 2025

Test Results

    21 files      21 suites   3d 22h 58m 50s ⏱️
 3 681 tests  3 677 ✅ 0 💤  4 ❌
75 475 runs  75 397 ✅ 0 💤 78 ❌

For more details on these failures, see this check.

Results for commit d8a9b0b.

♻️ This comment has been updated with latest results.

@martamaja10
Copy link
Contributor

martamaja10 commented Sep 26, 2025

Hi @jackapet, so unfortunately there is an issue with your changes at the ROOT building stage. Have you tried building ROOT locally with all these changes? If you go through the logs of the CI workers, for example, you can see the following:

2025-09-26T10:28:40.3536950Z ##[error]/Users/sftnight/ROOT-CI/src/tree/dataframe/test/dataframe_histomodels.cxx:363:31: error: no matching constructor for initialization of '::THnSparseD' (aka 'THnSparseT<TArrayD>')
2025-09-26T10:28:40.3541740Z   363 |    auto h1e = d.HistoNSparseD(::THnSparseD("h1e", "h1e", 4, nbinse, edges), {"x0", "x1", "x2", "x3"});
2025-09-26T10:28:40.3643390Z       |                               ^            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-26T10:28:40.3746460Z /Users/sftnight/ROOT-CI/src/hist/hist/inc/THnSparse.h:72:4: note: candidate inherited constructor not viable: no known conversion from 'std::vector<std::vector<double>>' to 'const Double_t *' (aka 'const double *') for 5th argument
2025-09-26T10:28:40.3824260Z    72 |    THnSparse(const char* name, const char* title, Int_t dim,
2025-09-26T10:28:40.3855300Z       |    ^
2025-09-26T10:28:40.3878380Z    73 |              const Int_t* nbins, const Double_t* xmin = nullptr, const Double_t* xmax = nullptr,
2025-09-26T10:28:40.3879050Z       |                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-26T10:28:40.3880090Z /Users/sftnight/ROOT-CI/src/hist/hist/inc/THnSparse.h:213:21: note: constructor from base class 'THnSparse' inherited here
2025-09-26T10:28:40.3880990Z   213 |    using THnSparse::THnSparse;
2025-09-26T10:28:40.3881340Z       |                     ^
2025-09-26T10:28:40.3882470Z /Users/sftnight/ROOT-CI/src/hist/hist/inc/THnSparse.h:75:4: note: candidate inherited constructor not viable: requires at most 4 arguments, but 5 were provided
2025-09-26T10:28:40.3883580Z    75 |    THnSparse(const char* name, const char* title,
2025-09-26T10:28:40.3884010Z       |    ^         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-26T10:28:40.3884430Z    76 |              const std::vector<TAxis>& axes,
2025-09-26T10:28:40.3884840Z       |              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-26T10:28:40.3885230Z    77 |              Int_t chunksize = 1024 * 16);
2025-09-26T10:28:40.3885610Z       |              ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2025-09-26T10:28:40.3886560Z /Users/sftnight/ROOT-CI/src/hist/hist/inc/THnSparse.h:213:21: note: constructor from base class 'THnSparse' inherited here
2025-09-26T10:28:40.3887390Z   213 |    using THnSparse::THnSparse;
2025-09-26T10:28:40.3887740Z       |                     ^
2025-09-26T10:28:40.3888920Z /Users/sftnight/ROOT-CI/src/hist/hist/inc/THnSparse.h:37:7: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 5 were provided
2025-09-26T10:28:40.3890820Z    37 | class THnSparse: public THnBase {
2025-09-26T10:28:40.3891190Z       |       ^~~~~~~~~
2025-09-26T10:28:40.3892050Z /Users/sftnight/ROOT-CI/src/hist/hist/inc/THnSparse.h:213:21: note: constructor from base class 'THnSparse' inherited here
2025-09-26T10:28:40.3892890Z   213 |    using THnSparse::THnSparse;
2025-09-26T10:28:40.3893260Z       |                     ^
2025-09-26T10:28:40.3894670Z /Users/sftnight/ROOT-CI/src/tree/dataframe/inc/ROOT/RDF/HistoModels.hxx:24:7: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 5 were provided
2025-09-26T10:28:40.3895860Z    24 | class THnSparseT;
2025-09-26T10:28:40.3896150Z       |       ^~~~~~~~~~
2025-09-26T10:28:40.3897450Z /Users/sftnight/ROOT-CI/src/tree/dataframe/inc/ROOT/RDF/HistoModels.hxx:24:7: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 5 were provided
2025-09-26T10:28:40.3898520Z    24 | class THnSparseT;
2025-09-26T10:28:40.3898660Z       |       ^~~~~~~~~~
2025-09-26T10:28:40.3899270Z /Users/sftnight/ROOT-CI/src/tree/dataframe/inc/ROOT/RDF/HistoModels.hxx:24:7: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 5 were provided
2025-09-26T10:28:40.3899780Z 1 error generated.
2025-09-26T10:28:40.3900280Z make[2]: *** [tree/dataframe/test/CMakeFiles/dataframe_histomodels.dir/dataframe_histomodels.cxx.o] Error 1
2025-09-26T10:28:40.3900690Z make[1]: *** [tree/dataframe/test/CMakeFiles/dataframe_histomodels.dir/all] Error 2

This will need to be fixed first.

@jackapet
Copy link
Author

jackapet commented Sep 26, 2025

Hello @martamaja10 ,

I am able to compile code without tests.

I was not able to run or compile tests because I am not able to install required system packages.

I am using windows with wsl2. ROOT compilation fails with RVec compilation:

/mnt/c/cern/root_dev/build/include/ROOT/RVec.hxx:516:49: error: use of undeclared identifier 'R__HARDWARE_INTERFERENCE_SIZE'
   static constexpr std::size_t cacheLineSize = R__HARDWARE_INTERFERENCE_SIZE;

I guess wsl2 does not have this variable defined. Is there some workaround for this?

When running on cluster, I am able to compile the ROOT but I can't install libraries. So, I was able to check rdataframe in my local script but not in required tests.

Could you provide instructions how to compile tests? Do you have a singularity/docker image for developers with all required libraries installed? I can access cvmfs if it helps. Otherwise, I will try to build my own image.

@ferdymercury
Copy link
Collaborator

when calling cmake, did you enable flag -Dtesting=ON -Droottest=ON ?

(Building is enough, no need to install)

@ferdymercury
Copy link
Collaborator

The failure is likely because there is something equivalent to

THn::THn(const char *name, const char *title, Int_t dim, const Int_t *nbins,

that needs to be implemented in THnSparse.h and .cxx

@jackapet
Copy link
Author

when calling cmake, did you enable flag -Dtesting=ON -Droottest=ON ?

(Building is enough, no need to install)

Ok, thank you. I am trying to recompile it. I made an apptainer image with settings below. I hope all libraries are now installed.

Bootstrap: docker
From: ubuntu:24.04

%post
    apt-get -y update
    apt-get -y install binutils cmake dpkg-dev g++ gcc libssl-dev git libx11-dev libxext-dev libxft-dev libxpm-dev python3 libtbb-dev libvdt-dev libgif-dev
    apt-get -y install gfortran libpcre3-dev libglu1-mesa-dev libglew-dev libftgl-dev libfftw3-dev libcfitsio-dev libgraphviz-dev libavahi-compat-libdnssd-dev libldap2-dev python3-dev python3-numpy libxml2-dev libkrb5-dev libgsl-dev qtwebengine5-dev nlohmann-json3-dev libmysqlclient-dev libgl2ps-dev liblzma-dev libxxhash-dev liblz4-dev libzstd-dev
    apt-get -y install libgtest-dev libgmock-dev
    apt-get -y install python3-pytest

@jackapet
Copy link
Author

The failure is likely because there is something equivalent to

THn::THn(const char *name, const char *title, Int_t dim, const Int_t *nbins,

that needs to be implemented in THnSparse.h and .cxx

I see, good point. There is an inconsistency between THnD and THnSparseD constructors. I suggest make it consistent in this PR. I can add the missing constructor to THnSparseD and I can also add a constructor using std::vector into THnD which is already in THnSparseD.

@ferdymercury
Copy link
Collaborator

hope all libraries are now installed.
Thanks.
Check also the packages file here:
https://github.com/root-project/root-ci-images/tree/main/ubuntu2404

@jackapet
Copy link
Author

hope all libraries are now installed.
Thanks.
Check also the packages file here:
https://github.com/root-project/root-ci-images/tree/main/ubuntu2404

Thanks, this will help.

I managed to fix missing constructor and I can compile it with test. However, I will still need to install few missing packages to enable running tests in python.

@martamaja10
Copy link
Contributor

hope all libraries are now installed.
Thanks.
Check also the packages file here:
https://github.com/root-project/root-ci-images/tree/main/ubuntu2404

Thanks, this will help.

I managed to fix missing constructor and I can compile it with test. However, I will still need to install few missing packages to enable running tests in python.

Thank you! Let's see what CI says now :)

@jackapet
Copy link
Author

Hi @martamaja10 , @ferdymercury ,

I was able to prepare apptainer image containing the same packages as described in root-ci-images (#19975 (comment)).

However, I am still not able to run test locally because I do not know how to execute them.

Should I run some setup script for test? Should I execute them using root -x -q -b something.C? Or is there some executable or python script?

@ferdymercury
Copy link
Collaborator

Should I run some setup script for test? Should I execute them using root -x -q -b something.C? Or is there some executable or python script?

From the build directory, you can run

ctest -R root-io-filemerger

(specify the name or subset of test names you want to run e.g. root-io-*)

@jackapet
Copy link
Author

jackapet commented Sep 29, 2025

I run all tests beginning on ctest -R root-io-. Two tests were skipped. All other tests succeeded.

I also added one additional test into THn.cxx. This test checks all constructors. This test passes too.

About the failed test in the check_reducer_merger. This is related with multithreading. THnSparse cannot run in MT mode. The test fails with:

malloc_consolidate(): unaligned fastbin chunk detected
 *** Break *** abort
 Generating stack trace...
malloc_consolidate(): unaligned fastbin chunk detected
 *** Break *** abort

Should we remove this test?

@jackapet
Copy link
Author

@ferdymercury , @martamaja10 ,

I was checking THnSparse implementation. There is lot of manipulation with memory using new, malloc, memcpy, delete, and other operators. This is because the implementation is based on basic arrays like Int_t*.

I would suggest to rewrite it using std::unordered_map and std::vector classes. With this we can avoid any direct manipulation with memory. We can also implement it without using chunks, which are causing troubles, and leave optimization on std::unordered_map.

Error messages indicate that some internal object is deleted twice - in test dataframe cloning. It seems that some object was shared after cloning and two deletions happened when original and cloned objects were deleted.

TExMap fBinsContinued; ///<! Filled bins for non-unique hashes, containing pairs of (bin index 0, bin index 1)
THnSparseCompactBinCoord *fCompactCoord; ///<! Compact coordinate

THnSparse(const THnSparse&) = delete;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, what is the reason for undeleting this constructor?
Could it be that this leads to the failures seen in the CI ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor is undeleted because it is required by RResultPtr. In other words, object without this constructor cannot be used by RDataFrame.

Perhaps, this constructor require explicit implementation.

@ferdymercury
Copy link
Collaborator

I would suggest to rewrite it using std::unordered_map and std::vector classes. With this we can avoid any direct manipulation with memory. We can also implement it without using chunks, which are causing troubles, and leave optimization on std::unordered_map.

I guess that since it's old code, the trend is to not touch it too much and apply minimal fixes to solve encountered issues/bugs. (Unless there is a significant performance improvement by switching to unordered_map).

Question: does the crash happen also before the additions of this PR? So to say, if you run valgrind --leak-check=full --suppressions=$ROOTSYS/etc/valgrind-root.supp root.exe -l mytest.C+
In other words, is the "unaligned fastbin chunk detected" crash coming from before the changes in the PR? Meaning that it's just that the new test added more coverage?
If the issue comes from multithreading, then I guess the only solution is to add a R__LOCKGUARD like the one you see in TChain.cxx, or to remove the multi-threaded test.

@jackapet
Copy link
Author

jackapet commented Sep 30, 2025

I would suggest to rewrite it using std::unordered_map and std::vector classes. With this we can avoid any direct manipulation with memory. We can also implement it without using chunks, which are causing troubles, and leave optimization on std::unordered_map.

I guess that since it's old code, the trend is to not touch it too much and apply minimal fixes to solve encountered issues/bugs. (Unless there is a significant performance improvement by switching to unordered_map).

Question: does the crash happen also before the additions of this PR? So to say, if you run valgrind --leak-check=full --suppressions=$ROOTSYS/etc/valgrind-root.supp root.exe -l mytest.C+ In other words, is the "unaligned fastbin chunk detected" crash coming from before the changes in the PR? Meaning that it's just that the new test added more coverage? If the issue comes from multithreading, then I guess the only solution is to add a R__LOCKGUARD like the one you see in TChain.cxx, or to remove the multi-threaded test.

I would suggest to rewrite it using std::unordered_map and std::vector classes. With this we can avoid any direct manipulation with memory. We can also implement it without using chunks, which are causing troubles, and leave optimization on std::unordered_map.

I guess that since it's old code, the trend is to not touch it too much and apply minimal fixes to solve encountered issues/bugs. (Unless there is a significant performance improvement by switching to unordered_map).

Question: does the crash happen also before the additions of this PR? So to say, if you run valgrind --leak-check=full --suppressions=$ROOTSYS/etc/valgrind-root.supp root.exe -l mytest.C+ In other words, is the "unaligned fastbin chunk detected" crash coming from before the changes in the PR? Meaning that it's just that the new test added more coverage? If the issue comes from multithreading, then I guess the only solution is to add a R__LOCKGUARD like the one you see in TChain.cxx, or to remove the multi-threaded test.

The crash is in new tests added in this request. All failures are related with THnSparse usage in RDataFrame. There are two things which does not work: running with multithreading and lazy cloning via ROOT::Internal::RDF::CloneResultAndAction (leads to double deletion).

Multithreading leads to failure in gtest-tree-dataframe-dataframe-merge-results and check_reducer_merge.
Cloning leads to failures in gtest-tree-dataframe-dataframe-cloning and gtest-tree-dataframe-dataframe-vary.

I will try to run with debugger to identify where exactly the code crash. Perhaps it will give some hints how to fix it.

@jackapet
Copy link
Author

I think I understand the issue now.

THnSparse has a raw pointer as a member variable. When we copy or clone the object we just copy the pointer. The pointer then points to the same location in the memory. This leads to crash when deleting original and cloned object. Bin contents can be also wrong. Issues in multithreading can be also caused by that, maybe.

So, we need an explicit implementation of the copy constructor (The one which is undeleted).

@ferdymercury
Copy link
Collaborator

So, we need an explicit implementation of the copy constructor (The one which is undeleted).

Good catch!

@jackapet
Copy link
Author

Explicit implementation fixed a problem with cloning and varying THnSparse histograms.

The problem with multithreading is still there. The implementation is very thread unsafe. There is a dynamic allocation of the memory and two threads may attempt to allocate memory at the same time. I suggest to remove this merger test. After this we can check pipeline again.

@ferdymercury
Copy link
Collaborator

Thanks for the fix!
@martamaja10 could you restart the CI ?

@pcanal
Copy link
Member

pcanal commented Sep 30, 2025

here is a dynamic allocation of the memory and two threads may attempt to allocate memory at the same time

'That' in itself is not usually an issue (malloc/operator new are thread safe). But indeed, first we need to see it work (including when running with valgrind) in single thread mode and then we can investigate the harder cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for THnSparseD histogram in RDataFrame
4 participants