Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-888898: Session.create_dataframe fails to properly set nullable columns #994

Closed
eugene-roach opened this issue Aug 8, 2023 · 6 comments · Fixed by #998
Closed

SNOW-888898: Session.create_dataframe fails to properly set nullable columns #994

eugene-roach opened this issue Aug 8, 2023 · 6 comments · Fixed by #998
Assignees
Labels
bug Something isn't working triaged

Comments

@eugene-roach
Copy link

  1. Python version: Python 3.8.16
  2. OS/architecture? macOS-13.4.1-arm64-arm-64bit
  3. Component version: snowflake-connector-python==3.1.0, snowflake-snowpark-python==1.6.1
  4. What did you do?
    data = [{"A": 1}, {"B": 2}]
    df = session.create_dataframe(data)
    df.write.save_as_table("DATA_TABLE", mode="overwrite")
  1. Expectation: The table get created with two nullable columns.
  2. Error:
 snowflake.connector.errors.IntegrityError: 100072 (22000): NULL result in a non-nullable column

Pretty sure the fix is just to change change https://github.com/snowflakedb/snowpark-python/blob/main/src/snowflake/snowpark/_internal/type_utils.py#L391
to StructField(n, name_to_datatype_b[n], True) since the columns is nullable if DataType a is missing the field

@eugene-roach eugene-roach added bug Something isn't working needs triage Initial RCA is required labels Aug 8, 2023
@github-actions github-actions bot changed the title Session.create_dataframe fails to properly set nullable columns SNOW-888898: Session.create_dataframe fails to properly set nullable columns Aug 8, 2023
@sfc-gh-aalam
Copy link
Contributor

In this case when you create a dataframe with data without providing schema, the schema is inferred from the data given. If the given data does not have any nullable columns, the default inference is that the column is non-nullable. This is a recent behavior change where we respect nullable columns.

What you want to do here is explicitely define schema using StructType where the StructField for the nullable columns must be set to nullable.

@sfc-gh-aalam sfc-gh-aalam added triaged and removed needs triage Initial RCA is required bug Something isn't working labels Aug 8, 2023
@eugene-roach
Copy link
Author

I understand but the current implementation isn't correct.
In line 383, the nullable parameter considers whether Datatype b contains the field. If it doesn't, then the nullable field is set to True. That means that if Datatype a doesn't contain the field, the field should also be considered nullable. So line 391 should always be True for the nullable parameter since this only runs when the field is missing from Datatype a

@sfc-gh-aalam
Copy link
Contributor

I don't understand why line 391 should always be True. Considering this is a merge_type function here's what we are doing for merging StructTypes:

  1. collect all fields that are common in both a and b and store it in fields. If for a column in a, the corresponding column in b does not exist, we assume the default value of datatype=NullType and default nullable=True.
  2. For all the fields that are in b for not in a, append the (name_b, datatype_b, nullable_b) into field (line 390-392).

Does that explain why the current code in line 391 shouldn't always be true?

@eugene-roach
Copy link
Author

Your first point, you mention that if a field is in a but not b then nullable=True so shouldn't the opposite be true as well? If a field is in b but not a then nullable=True. Why would the order of a and b matter?

Just as an example, given the code below:

data = [{"A": 1, "B": 2}, {"B": 3, "C": 4}]
df = session.create_dataframe(data)
for field in df.schema.fields:
    print(field)

Currently, writing to a table will fail as mentioned above and the output is:

StructField('A', LongType(), nullable=True)
StructField('B', LongType(), nullable=False)
StructField('C', LongType(), nullable=False)

If I make the suggested change, writing to a table is successful and the output is:

StructField('A', LongType(), nullable=True)
StructField('B', LongType(), nullable=False)
StructField('C', LongType(), nullable=True)

From what I can tell, this second schema represents the provided data more accurately and still considers the nullability of field A

@sfc-gh-aalam
Copy link
Contributor

Thanks for the explanation @eugene-roach. You are right. I'll update the code

@sfc-gh-aalam sfc-gh-aalam reopened this Aug 9, 2023
@sfc-gh-aalam sfc-gh-aalam self-assigned this Aug 9, 2023
@eugene-roach
Copy link
Author

Thanks!

@sfc-gh-aalam sfc-gh-aalam added the bug Something isn't working label Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants