Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix uninitialized reads in parquet chunked reader #17810

Open
wants to merge 2 commits into
base: branch-25.04
Choose a base branch
from

Conversation

pmattione-nvidia
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia commented Jan 24, 2025

This PR fixes a couple of uninitialized reads in parquet chunked reader.

We are trying to go from an array of size N to a list of offsets to the data referred to by the array of length N+1, where the last offset is the total offset to the end. The exclusive_scan() calls compute the offsets, but do an uninitialized read of array[N]. This introduces thrust counting_transform_iterator's to perform an indirection so that we don't read array[N]. Note that exclusive_scan() computes a prefix sum, so the uninitialized data in array[N] shouldn't have been used to compute anything anyway.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pmattione-nvidia pmattione-nvidia added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 24, 2025
@pmattione-nvidia pmattione-nvidia self-assigned this Jan 24, 2025
@pmattione-nvidia pmattione-nvidia requested review from a team as code owners January 24, 2025 16:12
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Jan 24, 2025
@pmattione-nvidia pmattione-nvidia changed the base branch from branch-25.02 to branch-25.04 January 24, 2025 16:13
@davidwendt
Copy link
Contributor

I would be concerned about any performance impact this my incur.
Overall, I don't think this extra code is necessary since the read does not actual contribute to the result.
We have many places in libcudf that work this way with exclusive-scan.
See the discussion here #12667 (comment) and the proposed resolution to be fixed in CCCL here: NVIDIA/cccl#876

@pmattione-nvidia
Copy link
Contributor Author

pmattione-nvidia commented Jan 24, 2025

That CCCL resolution has been open for 3.5 years now; I doubt it is getting fixed. And UB is bad. And any overhead this adds is negligible compared to everything else needed for parquet reads (decompression, reading global device memory, etc.)

@davidwendt
Copy link
Contributor

That CCCL resolution has been open for 3.5 years now; I doubt it is getting fixed. And UB is bad.

You are welcome to note your comments in the CCCL issue.
But I don't think we should try to work around this in general.

@pmattione-nvidia
Copy link
Contributor Author

Also, this overhead isn't being applied to processing of the data itself, it's just overhead on calculating a few page offsets. The perf cost of this is negligible, UB should be avoided, and CCCL is not going to fix this (it may even require a branch in exclusive_scan(), so they may not even want to change anything there).

@github-actions github-actions bot removed the Python Affects Python cuDF API. label Feb 3, 2025
@github-actions github-actions bot removed CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Feb 3, 2025
@nvdbaranec
Copy link
Contributor

I would be concerned about any performance impact this my incur. Overall, I don't think this extra code is necessary since the read does not actual contribute to the result. We have many places in libcudf that work this way with exclusive-scan. See the discussion here #12667 (comment) and the proposed resolution to be fixed in CCCL here: NVIDIA/cccl#876

The counterargument here is that this isn't being applied to bulk data. It's being applied to metadata - dozens/hundreds of things (page structs) during the setup phase.

@davidwendt
Copy link
Contributor

Also, this overhead isn't being applied to processing of the data itself, it's just overhead on calculating a few page offsets. The perf cost of this is negligible, UB should be avoided, and CCCL is not going to fix this (it may even require a branch in exclusive_scan(), so they may not even want to change anything there).

I don't want to workaround issues like this but prefer CCCL fix (or not) the issue in their library.
I think the risk here is not worth this change.
Although the read is UB, the value is discarded by CCCL and there is no risk of data corruption.
Perhaps this could even be a feature request for compute-sanitizer to detect the read value is not used.
Regardless, this has been covered in great detail in #12667 and you are welcome to add more comments to that closed issue.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants