Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable readColumn to read all rows #53

Merged
merged 15 commits into from
Dec 20, 2024

Conversation

park-brian
Copy link
Contributor

Hi @platypii, this is a preliminary pull request to enable readColumn to read all rows in a column DataReader when setting the rowLimit to Infinity (I initially thought about using undefined, but this would change the type signature/introduce an additional type check). Let me know if this looks good to you, and I can write some tests to mock out this behavior.

@park-brian park-brian mentioned this pull request Dec 14, 2024
@platypii
Copy link
Collaborator

Yea this looks good thanks. Tests would be appreciated.

I'm not totally sold on Infinity over undefined... The nice thing about undefined is that if someone comes to depend on this function, and in the future we require rowLimit again (maybe we want to pre-allocate array? no plans for this, but just as an example), then with infinity there is no way to express that in the types, whereas with undefined it would be a type error when the user tries to upgrade the library.

@platypii
Copy link
Collaborator

More general question: why do you need this? I saw your code for predicate push down uses it with unknown rowLimit, but first of all, might this result in reading MORE than you wanted? It sounds like you just want the first page? (but there could be multiple?) Also... don't you know what the page size is in advance and could just pass that to readColumn as-is without this PR?

@park-brian
Copy link
Contributor Author

Thanks for the questions! We're reading individual pages by slicing and concatenating individual headers + dictionaries + data pages from the parquet file. Since each column chunk only contains one page, we can simply read the entire buffer each time. In the future, a more optimized version of this could concatenate adjacent pages into a single buffer (so this behavior is still desirable in that case).

@park-brian
Copy link
Contributor Author

To answer your other question (knowing the page size in advance). This is kind of tricky - we do know how many bytes the page takes up, but the statistics don't include the number of values in the page (just null_count, distinct_count, min_value, max_value, etc). num_rows are only stored at the RowGroup level.

@park-brian
Copy link
Contributor Author

Actually, I should correct myself. it is possible to get a data page's num_values. We just need to read the first i32 value from the page buffer, but it is convenient to have a way to read the full page without needing to specify its size.

@park-brian
Copy link
Contributor Author

Hi @platypii, it might actually be better to squash these commits, since there was a bit of tweaking that took place.

@platypii
Copy link
Collaborator

Repo is set to squash commits only so no worries there.

The source changes look good. 👍

I'm not as sure about the tests. Writing out the columns for every test file is a bit redundant with the readFiles tests. I would prefer a smaller number of targeted tests of rowLimit = undefined against maybe one of the test files? I could be convinced otherwise here, it just seems like a lot of additional testing noise while adding a marginal amount of safety?

@park-brian
Copy link
Contributor Author

Thanks @platypii, changes have been made

Copy link
Collaborator

@platypii platypii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great thanks @park-brian!

@platypii platypii merged commit 9992316 into hyparam:master Dec 20, 2024
3 checks passed
@platypii
Copy link
Collaborator

Published v1.7.0 with these changes

@park-brian park-brian deleted the read-column-rowlimit-infinite branch December 20, 2024 02:31
import { asyncBufferFromFile } from '../src/utils.js'

describe('readColumn', () => {
it('read columns when rowLimit is undefined', async () => {
Copy link
Contributor

@severo severo Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could factor these four tests with https://vitest.dev/api/#test-for, as they only differ in the last two lines. I'll send a PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants