SNOW-869536: Iterators from to_local_iterator stop returning results after another query occurs #945

orrdermer1 · 2023-07-16T14:47:28Z

Please answer these questions before submitting your issue. Thanks!

What version of Python are you using?

Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)]

What operating system and processor architecture are you using?

macOS-13.3.1-arm64-arm-64bit

What are the component versions in the environment (pip freeze)?

... (Snowpark 1.5.1)

What did you do?

We've been using an iterator from to_local_iterator(), and also using the table's schema to parse it.

# Note: df.schema must not be called before this for bug to be recreated
df = session.table("some_table")
my_iter = df.to_local_iterator()
counter = 0
for row in my_iter:
   len(df.schema.fields)  # Here was our parsing logic which used df.schema, this is good enough to recreate the bug
   counter += 1
print(counter)   # 1 is printed - we only iterated over the first row

What did you expect to see?

We expected to iterate over all the rows, and we only iterated over the first one.
Calling df.schema had probably caused the python snowflake connector to execute another query, making cursor.execute() no longer point to our query and rendering the iterator useless.
This probably means that generally, other queries cannot be run while iterating.
Note that there's an easy workaround, using AsyncJobs - which makes the iterator query specifically for our query-id, and thus is still stable even while other queries are running:

my_iter = df.to_local_iterator(block=False).result()
counter = 0
for row in my_iter:
   len(df.schema.fields)
   counter += 1
print(counter)  # Prints the length of the table

Can you set logging to DEBUG and collect the logs?

Hard to do with our current environment :(

The text was updated successfully, but these errors were encountered:

orrdermer1 · 2023-07-25T12:56:24Z

Apparently the suggestion I've made isn't good either, because apparently (block=False).result() actually invokes fetchall - which means all data is loaded into the process at once, making the iterator useless.

I wound up writing a function to solve it. For some reason _cursor.description didn't return quite right as well, but this should work:

from snowflake.snowpark._internal.utils import result_set_to_iter
from snowflake.snowpark.dataframe import DataFrame
from snowflake.snowpark.async_job import AsyncJob


def get_iterator_from_df(df: DataFrame, case_sensitive=True):
    """
    This function is a workaround for a bug in Snowpark, allowing to iterate over multiple dataframes simultaneously.
    """
    # Async jobs create a new cursor, which is good for us
    async_job: AsyncJob = df.to_local_iterator(block=False)  # type: ignore
    
    # Not using "async_job.result" because it uses fetchall - effectively collecting everything
    result_meta = async_job._cursor.describe(async_job._query)
    assert result_meta is not None, "Failed to get result metadata"
    async_job._cursor.get_results_from_sfqid(async_job.query_id)
    
    return result_set_to_iter(
        iter(async_job._cursor),
        result_meta,
        case_sensitive=case_sensitive,
    )

This code will show that it works, and profiling shows that fetchone was only called 4 times instead of for all the results.

import cProfile

df_1 = session.table("<>").limit(500000)
df_2 = session.table("<>").limit(400000)

with cProfile.Profile() as pr:
    iterator_1 = get_iterator_from_df(df_1)
    iterator_2 = get_iterator_from_df(df_2, case_sensitive=False)
    print(f"First iterator: {next(iterator_1)}")
    print(f"Second iterator: {next(iterator_2)}")
    print()
    print(f"First iterator: {next(iterator_1)}")
    print(f"Second iterator: {next(iterator_2)}")
    pr.print_stats()

sfc-gh-stan · 2023-10-02T16:14:45Z

I can repro, this is because the same cursor is being reused for both queries.

snowpark-python/src/snowflake/snowpark/_internal/server_connection.py

Line 347 in e93d054

results_cursor = self._cursor.execute(query, params=params, **kwargs)

. Assigned to myself for fixing.

orrdermer1 added bug Something isn't working needs triage Initial RCA is required labels Jul 16, 2023

github-actions bot changed the title ~~Iterators from to_local_iterator stop returning results after another query occurs~~ SNOW-869536: Iterators from to_local_iterator stop returning results after another query occurs Jul 16, 2023

sfc-gh-stan removed the needs triage Initial RCA is required label Oct 2, 2023

sfc-gh-stan self-assigned this Oct 2, 2023

sfc-gh-stan mentioned this issue Jan 30, 2024

SNOW-869536 Fix buggy behavior in DataFrame.to_local_iterator #1226

Merged

5 tasks

sfc-gh-stan closed this as completed in #1226 Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-869536: Iterators from to_local_iterator stop returning results after another query occurs #945

SNOW-869536: Iterators from to_local_iterator stop returning results after another query occurs #945

orrdermer1 commented Jul 16, 2023 •

edited

Loading

orrdermer1 commented Jul 25, 2023

sfc-gh-stan commented Oct 2, 2023

SNOW-869536: Iterators from to_local_iterator stop returning results after another query occurs #945

SNOW-869536: Iterators from to_local_iterator stop returning results after another query occurs #945

Comments

orrdermer1 commented Jul 16, 2023 • edited Loading

orrdermer1 commented Jul 25, 2023

sfc-gh-stan commented Oct 2, 2023

orrdermer1 commented Jul 16, 2023 •

edited

Loading