-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable readColumn to read all rows #53
Enable readColumn to read all rows #53
Conversation
Yea this looks good thanks. Tests would be appreciated. I'm not totally sold on Infinity over undefined... The nice thing about undefined is that if someone comes to depend on this function, and in the future we require |
More general question: why do you need this? I saw your code for predicate push down uses it with unknown rowLimit, but first of all, might this result in reading MORE than you wanted? It sounds like you just want the first page? (but there could be multiple?) Also... don't you know what the page size is in advance and could just pass that to |
Thanks for the questions! We're reading individual pages by slicing and concatenating individual headers + dictionaries + data pages from the parquet file. Since each column chunk only contains one page, we can simply read the entire buffer each time. In the future, a more optimized version of this could concatenate adjacent pages into a single buffer (so this behavior is still desirable in that case). |
To answer your other question (knowing the page size in advance). This is kind of tricky - we do know how many bytes the page takes up, but the statistics don't include the number of values in the page (just null_count, distinct_count, min_value, max_value, etc). |
Actually, I should correct myself. it is possible to get a data page's |
Hi @platypii, it might actually be better to squash these commits, since there was a bit of tweaking that took place. |
Repo is set to squash commits only so no worries there. The source changes look good. 👍 I'm not as sure about the tests. Writing out the columns for every test file is a bit redundant with the |
Thanks @platypii, changes have been made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great thanks @park-brian!
Published v1.7.0 with these changes |
import { asyncBufferFromFile } from '../src/utils.js' | ||
|
||
describe('readColumn', () => { | ||
it('read columns when rowLimit is undefined', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could factor these four tests with https://vitest.dev/api/#test-for, as they only differ in the last two lines. I'll send a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @platypii, this is a preliminary pull request to enable readColumn to read all rows in a column DataReader when setting the rowLimit to Infinity (I initially thought about using undefined, but this would change the type signature/introduce an additional type check). Let me know if this looks good to you, and I can write some tests to mock out this behavior.