Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow different Snapshot behaviours in distributed RDataFrame #17136

Open
vepadulano opened this issue Nov 29, 2024 · 3 comments
Open

Allow different Snapshot behaviours in distributed RDataFrame #17136

vepadulano opened this issue Nov 29, 2024 · 3 comments

Comments

@vepadulano
Copy link
Member

Feature description

The Snapshot operation, when called locally, will produce one output file to the path provided by the user. Currently, the distributed version of Snapshot produces one output file per task, with the result of running the computation graph on the event range processed by that task. This difference in behaviour was introduced for performance reasons, but for the end users it can be sometimes confusing. We should introduce new behaviours to the distributed RDataFrame function:

  1. Allow the user to request one final merged output file, understanding that this will incur in performance costs.
  2. Allow the user to get back the list of partial output files, so they can harvest them according to further workflow requirements. Also, we should check that it's possible to write the files to a remote storage location where the user has write access (e.g. via an xrootd path).

Alternatives considered

No response

Additional context

No response

@VISHESH0932
Copy link

Hello @vepadulano I would love to work on this issue.

@vepadulano
Copy link
Member Author

Dear @VISHESH0932 , thanks for your availability! Let's discuss this offline, you can contact me at vincenzo.eduardo.padulano@cern.ch

@vepadulano
Copy link
Member Author

After discussion with CMS users, it has come to my attention that in certain computing sites xrootd is not even available, in which case davix is used instead. While we know thanks to user experience that it is possible to write remote files via xrootd (provided having write permission on the remote storage location), we have no tests of writing remote data via davix (i.e. https via the TFile) with Snapshot. So we should gain experience with this case too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants