-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: move pyspark tests into main test suite #1761
Conversation
expr = plx.all_horizontal( | ||
*chain(predicates, (plx.col(name) == v for name, v in constraints.items())) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to implement Expr.__eq__
to get this to work. It overlaps with @EdAbati PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome @FBruzzesi - not sure what's happening with tests in running in ci?
We are not installing pyspark at all, therefore but now |
ah i see - maybe |
tests/conftest.py
Outdated
if constructor == "pyspark": | ||
if sys.version_info < (3, 12): | ||
constructors.append(pyspark_lazy_constructor()) | ||
else: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli maybe this is too much? π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pyspark 4.0.0 this would go π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is fine π
module="pyspark", | ||
category=DeprecationWarning, | ||
) | ||
pd_df = pd.DataFrame(obj).replace({float("nan"): None}).reset_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the objects that come into these constructors are (always?) dictionaries I think we can skip the trip through pandas and construct from a built-in Python object that spark knows how to ingest directly (list of dictionaries). Could be overly cautions, but Spark may infer data types differently if it is handed a pandas DataFrame rather than lists of Python objects.
Since pyspark supports a list of records we could convert dict β list of dicts like so
if isinstance(obj, dict):
obj = [{k: v for k, v in zip(obj, row)} for row in zip(*obj.values())]
Or could pass in the rows & schema separately
if isinstance(obj, dict):
df = ...createDataFrame([*zip(*obj.values())], schema=[*obj.keys()])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember having issues with some tests, where we may need to specify the schema with column type. (but I don't remember exactly what was the problem)
But if we can skip pandas here, it would be πππ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought when migrating the codebase, yet I can confirm the data type being an issue for a subset of the tests. I would say to keep it like this for now and eventually address it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for doing this! π
yield session | ||
session.stop() | ||
|
||
register(session.stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL atexit.register
, nice!
tests/conftest.py
Outdated
if constructor == "pyspark": | ||
if sys.version_info < (3, 12): | ||
constructors.append(pyspark_lazy_constructor()) | ||
else: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pyspark 4.0.0 this would go π€
'ignore:.*The distutils package is deprecated and slated for removal in Python 3.12:DeprecationWarning:pyspark', | ||
'ignore:.*distutils Version classes are deprecated. Use packaging.version instead.*:DeprecationWarning:pyspark', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli I moved these back to pyproject.toml, yet targeting pyspark
module. Would that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @FBruzzesi
non-tests changes look good
i'm about to go out so I didn't finish reading through all the changes in the tests
folder, but if you checked them and there's no rogue changes feel free to merge
nice one! π
β¦ev/narwhals into tests/pyspark-to-main
Thanks Marco! Aside for CI time increasing significantly for when pyspark runs (maybe we could skip the windows one and run pyspark on ubuntu only), I don't see a big risk for merging now. It is such a better developer experience to add features with tests already there π |
Yeees thank you @FBruzzesi π₯³π₯³π₯³ |
yes, π to this, windows is already really slow to run... |
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below