Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-823151: First result batch has int64 arrow type for NUMBER column with real values #1568

Open
tekumara opened this issue May 19, 2023 · 12 comments
Assignees
Labels
bug status-triage_done Initial triage done, will be further handled by the driver team

Comments

@tekumara
Copy link
Contributor

When fetching multiple batches via get_result_batches , the first batch is empty and has the incorrect arrow data type for NUMBER columns containing real values, eg:

import snowflake.connector
import pyarrow

conn = snowflake.connector.connect(... )
cur = conn.cursor()
cur.execute("SELECT cast(1.5 as NUMBER(32,4)) as AMOUNT;")
batches = cur.get_result_batches()
batch = batches[0]
# the first batch is normally empty, which triggers _create_empty_table
# in this MRE we trigger it explicitly to demonstrate the error 
table = batch._create_empty_table()
print(table.schema)

assert table.schema.types[0] == pyarrow.float64()

Will fail with an assertion error and output:

AMOUNT: int64

snowflake-connector-python==3.0.3

@github-actions github-actions bot changed the title First result batch has int64 arrow type for NUMBER column with real values SNOW-823151: First result batch has int64 arrow type for NUMBER column with real values May 19, 2023
@sfc-gh-achandrasekaran
Copy link
Contributor

@tekumara did this happen with previous versions?

@tekumara
Copy link
Contributor Author

I'm not entirely sure.

@sfc-gh-aling
Copy link
Collaborator

sfc-gh-aling commented Jun 15, 2023

hey @tekumara , you are just one step close to get the result you want.
get_result_batches returns an ArrowResultChunk object -- it's a data structure storing data, not the data itself. You can either call the create_iter to create an iterator to iterate the data within the chunk or call to_arrow, to_pandas to convert to pyarrow Table or pandas DataFrame if you have installed the pandas dependency.

please let me know if you have any other questions, I'm more than happy to answer.

@tekumara
Copy link
Contributor Author

tekumara commented Jun 15, 2023

I still have the same problem using those methods. The MRE above uses _create_empty_table() to make it easier to reproduce.

@sfc-gh-aling
Copy link
Collaborator

can you try this?

conn = snowflake.connector.connect(**CONNECTION_PARAMETERS)
cur = conn.cursor()
cur.execute("SELECT cast(1.5 as NUMBER(32,4)) as AMOUNT;")

batches = cur.get_result_batches()
batch = batches[0]
table = batch.to_arrow()
assert table.schema.types[0] == pyarrow.float64()

@tekumara
Copy link
Contributor Author

That example works but not in the general case ie: when I fetch and use batches[0].to_arrow() from one of our real tables containing a NUMBER(32,4) the first batch is empty. This triggers the _create_empty_table() code path and the fault demonstrated in the contrived MRE above.

@ShahNewazKhan
Copy link

@tekumara Did you ever find out why the first batch always returns empty?

@tekumara
Copy link
Contributor Author

No I didn't, are you seeing this too?

@ShahNewazKhan
Copy link

No I didn't, are you seeing this too?

Yes I am seeing the same thing, when the connector returns a list of ResultBatch objects, I have found that if it is a large list and rowcounts exceed ~ 2k then the first ResultBatch is empty.

To work around this, when distributing ResultBatch objects across multiple workers I simply check if the ResultBatch is empty and don't process it in the worker.

@tekumara
Copy link
Contributor Author

@sfc-gh-aling could you reopen this issue please as it’s still happening? 🙏

@sfc-gh-dszmolka
Copy link
Contributor

thank you all for your feedbacks on this one! i reopened this issue and we'll take a look

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed needs triage labels Mar 24, 2024
@gmhewett
Copy link

gmhewett commented Aug 6, 2024

Hey @sfc-gh-dszmolka, I am experiencing this issue as well. When calling snowflake_cursor.get_result_batches(), the first batch will sporadically contain contain 0 rows. This usually happens when the entire query contains at least 10k rows.

As a workaround, I simply do not process the first batch if it contains 0 rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

6 participants