Skip to content

SNOW-1672579 Encode DataFrame.to_snowpark_pandas #2711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 12, 2024

Conversation

sfc-gh-vbudati
Copy link
Contributor

@sfc-gh-vbudati sfc-gh-vbudati commented Dec 4, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1672579

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

  • Added AST encoding for DataFrame.to_snowpark_pandas.
  • I had to add some Local testing functionality to help my expectation test to pass. Also needed to add some janky logic to create a temp read-only table in lieu of the table that is created by to_snowpark_pandas.
  • I updated the script generating the relevant proto files to create them in the correct directory.

@sfc-gh-vbudati sfc-gh-vbudati requested review from a team as code owners December 4, 2024 01:24
@sfc-gh-vbudati sfc-gh-vbudati added NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md snowpark-ast Change materially affects the Snowpark AST labels Dec 4, 2024
@github-actions github-actions bot added local testing Local Testing issues/PRs snowpark-pandas labels Dec 4, 2024
@sfc-gh-vbudati sfc-gh-vbudati added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Dec 4, 2024

snowpark_pandas_df = df.to_snowpark_pandas(index_col=["A"], columns=["C", "B"])

# Perform a pandas operation on the result to ensure nothing breaks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this operation should then produce with AST enabled the internal Snowpark calls, or should it not? Just trying to understand the logic here. Perhaps you'll need to add snowpark_pandas_df.groupby("A").to_numpy() or so to trigger an eval... Yet our test logic should flush the last batch no matter what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't produce an AST because I don't think we added any pandas AST logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid scope creep, I think it's a good idea to test the pandas AST logic in a different PR. This PR should only focus on whether the to_snowpark_pandas protobuf is generated correctly. Here's the ticket for it: https://snowflakecomputing.atlassian.net/browse/SNOW-1849281

Testing the pandas logic with local testing/nop testing is very janky so it will have to be in a separate test file.

Copy link
Contributor

@sfc-gh-nkrishna sfc-gh-nkrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modin file changes LGTM, but someone on the IR team should look at the rest

# create a temporary table out of the current snowpark dataframe
temporary_table_name = random_name_for_temp_object(
TempObjectType.TABLE
) # pragma: no cover
ast_id = self._ast_id
self._ast_id = None # set the AST ID to None to prevent AST emission.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So setting _emit_ast=False wasn't enough? This seems like a bug.

@@ -3925,7 +3950,7 @@ def write(self, _emit_ast: bool = True) -> DataFrameWriter:
"""

# AST.
if _emit_ast and self._ast_id is not None:
if self._ast_id is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. This is a bit of a hack, but probably reasonable given the alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since write is a property, I can't use the _emit_ast parameter.

@@ -94,6 +94,11 @@ def load_test_cases():
Returns: a list of test cases.
"""
test_files = DATA_DIR.glob("*.test")
if sys.version_info[1] < 9:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably check that [0] == 3. Your grandchildren who will have to maintain this code will thank you ;).

@sfc-gh-vbudati sfc-gh-vbudati merged commit 782de21 into main Dec 12, 2024
40 of 41 checks passed
@sfc-gh-vbudati sfc-gh-vbudati deleted the vbudati/SNOW-1672579-to-snowpark-pandas-2 branch December 12, 2024 22:47
@github-actions github-actions bot locked and limited conversation to collaborators Dec 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
local testing Local Testing issues/PRs NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-ast Change materially affects the Snowpark AST snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants