Enable readColumn to read all rows #53

park-brian · 2024-12-14T19:59:24Z

Hi @platypii, this is a preliminary pull request to enable readColumn to read all rows in a column DataReader when setting the rowLimit to Infinity (I initially thought about using undefined, but this would change the type signature/introduce an additional type check). Let me know if this looks good to you, and I can write some tests to mock out this behavior.

platypii · 2024-12-15T04:04:37Z

Yea this looks good thanks. Tests would be appreciated.

I'm not totally sold on Infinity over undefined... The nice thing about undefined is that if someone comes to depend on this function, and in the future we require rowLimit again (maybe we want to pre-allocate array? no plans for this, but just as an example), then with infinity there is no way to express that in the types, whereas with undefined it would be a type error when the user tries to upgrade the library.

platypii · 2024-12-15T04:11:44Z

More general question: why do you need this? I saw your code for predicate push down uses it with unknown rowLimit, but first of all, might this result in reading MORE than you wanted? It sounds like you just want the first page? (but there could be multiple?) Also... don't you know what the page size is in advance and could just pass that to readColumn as-is without this PR?

park-brian · 2024-12-16T17:30:32Z

Thanks for the questions! We're reading individual pages by slicing and concatenating individual headers + dictionaries + data pages from the parquet file. Since each column chunk only contains one page, we can simply read the entire buffer each time. In the future, a more optimized version of this could concatenate adjacent pages into a single buffer (so this behavior is still desirable in that case).

park-brian · 2024-12-16T21:15:53Z

To answer your other question (knowing the page size in advance). This is kind of tricky - we do know how many bytes the page takes up, but the statistics don't include the number of values in the page (just null_count, distinct_count, min_value, max_value, etc). num_rows are only stored at the RowGroup level.

park-brian · 2024-12-16T21:21:46Z

Actually, I should correct myself. it is possible to get a data page's num_values. We just need to read the first i32 value from the page buffer, but it is convenient to have a way to read the full page without needing to specify its size.

park-brian · 2024-12-19T23:27:43Z

Hi @platypii, it might actually be better to squash these commits, since there was a bit of tweaking that took place.

platypii · 2024-12-20T00:21:24Z

Repo is set to squash commits only so no worries there.

The source changes look good. 👍

I'm not as sure about the tests. Writing out the columns for every test file is a bit redundant with the readFiles tests. I would prefer a smaller number of targeted tests of rowLimit = undefined against maybe one of the test files? I could be convinced otherwise here, it just seems like a lot of additional testing noise while adding a marginal amount of safety?

park-brian · 2024-12-20T01:54:50Z

Thanks @platypii, changes have been made

platypii

This is great thanks @park-brian!

platypii · 2024-12-20T02:19:36Z

Published v1.7.0 with these changes

severo · 2024-12-20T08:26:22Z

test/column.test.js

+import { asyncBufferFromFile } from '../src/utils.js'
+
+describe('readColumn', () => {
+  it('read columns when rowLimit is undefined', async () => {


we could factor these four tests with https://vitest.dev/api/#test-for, as they only differ in the last two lines. I'll send a PR.

park-brian added 2 commits December 14, 2024 14:41

Enable readColumn to read all rows

d6e04ea

Refactor readColumn to use hasRowLimit

b4a79b0

park-brian mentioned this pull request Dec 14, 2024

Query API #16

Open

park-brian added 2 commits December 14, 2024 16:24

Simplify hasRowLimit condition

3a29c7d

Check less common condition first

24fa314

park-brian and others added 8 commits December 18, 2024 11:03

Merge branch 'hyparam:master' into read-column-rowlimit-infinite

ee45b24

add readColumn test files

764d40e

implement readColumn tests for undefined rowLimits

afb38b8

remove unused variable

4b25716

return early if no metadata is present

acb49f7

address tsc warnings

6da005b

add comparison

0426230

clarify that undefined is valid for rowLimit

dab11c2

park-brian added 3 commits December 19, 2024 20:42

remove test files

481c5d7

verify edge case works when rowLimit is undefined

3eb6568

add test cases for readColumn

79ccbe3

platypii approved these changes Dec 20, 2024

View reviewed changes

platypii merged commit 9992316 into hyparam:master Dec 20, 2024
3 checks passed

park-brian deleted the read-column-rowlimit-infinite branch December 20, 2024 02:31

severo reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable readColumn to read all rows #53

Enable readColumn to read all rows #53

park-brian commented Dec 14, 2024

platypii commented Dec 15, 2024

platypii commented Dec 15, 2024

park-brian commented Dec 16, 2024

park-brian commented Dec 16, 2024

park-brian commented Dec 16, 2024

park-brian commented Dec 19, 2024

platypii commented Dec 20, 2024

park-brian commented Dec 20, 2024

platypii left a comment

platypii commented Dec 20, 2024

severo Dec 20, 2024 •

edited

Loading

severo Dec 20, 2024

Enable readColumn to read all rows #53

Enable readColumn to read all rows #53

Conversation

park-brian commented Dec 14, 2024

platypii commented Dec 15, 2024

platypii commented Dec 15, 2024

park-brian commented Dec 16, 2024

park-brian commented Dec 16, 2024

park-brian commented Dec 16, 2024

park-brian commented Dec 19, 2024

platypii commented Dec 20, 2024

park-brian commented Dec 20, 2024

platypii left a comment

Choose a reason for hiding this comment

platypii commented Dec 20, 2024

severo Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

severo Dec 20, 2024

Choose a reason for hiding this comment

severo Dec 20, 2024 •

edited

Loading