Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: move pyspark tests into main test suite #1761

Merged
merged 20 commits into from
Jan 9, 2025
Merged

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Comment on lines +113 to +115
expr = plx.all_horizontal(
*chain(predicates, (plx.col(name) == v for name, v in constraints.items()))
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to implement Expr.__eq__ to get this to work. It overlaps with @EdAbati PR

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome @FBruzzesi - not sure what's happening with tests in running in ci?

pyproject.toml Outdated Show resolved Hide resolved
@FBruzzesi
Copy link
Member Author

awesome @FBruzzesi - not sure what's happening with tests in running in ci?

We are not installing pyspark at all, therefore but now --all-cpu-constructors includes pyspark. However, some python version would not support pyspark at all I believe (3.12 and 3.13)

@MarcoGorelli
Copy link
Member

ah i see - maybe --all-cpu-constructors should only include those which are available?

Comment on lines 234 to 238
if constructor == "pyspark":
if sys.version_info < (3, 12):
constructors.append(pyspark_lazy_constructor())
else:
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli maybe this is too much? πŸ™ˆ

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with pyspark 4.0.0 this would go 🀞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is fine πŸ‘

@FBruzzesi FBruzzesi marked this pull request as ready for review January 8, 2025 17:00
module="pyspark",
category=DeprecationWarning,
)
pd_df = pd.DataFrame(obj).replace({float("nan"): None}).reset_index()
Copy link
Contributor

@camriddell camriddell Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the objects that come into these constructors are (always?) dictionaries I think we can skip the trip through pandas and construct from a built-in Python object that spark knows how to ingest directly (list of dictionaries). Could be overly cautions, but Spark may infer data types differently if it is handed a pandas DataFrame rather than lists of Python objects.

Since pyspark supports a list of records we could convert dict β†’ list of dicts like so

if isinstance(obj, dict):
    obj = [{k: v for k, v in zip(obj, row)} for row in zip(*obj.values())]

Or could pass in the rows & schema separately

if isinstance(obj, dict):
    df = ...createDataFrame([*zip(*obj.values())], schema=[*obj.keys()])

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember having issues with some tests, where we may need to specify the schema with column type. (but I don't remember exactly what was the problem)

But if we can skip pandas here, it would be πŸ‘ŒπŸ‘ŒπŸ‘Œ

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought when migrating the codebase, yet I can confirm the data type being an issue for a subset of the tests. I would say to keep it like this for now and eventually address it

Copy link
Collaborator

@EdAbati EdAbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for doing this! πŸ™Œ

yield session
session.stop()

register(session.stop)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL atexit.register, nice!

narwhals/_spark_like/group_by.py Outdated Show resolved Hide resolved
Comment on lines 234 to 238
if constructor == "pyspark":
if sys.version_info < (3, 12):
constructors.append(pyspark_lazy_constructor())
else:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with pyspark 4.0.0 this would go 🀞

Comment on lines +168 to +169
'ignore:.*The distutils package is deprecated and slated for removal in Python 3.12:DeprecationWarning:pyspark',
'ignore:.*distutils Version classes are deprecated. Use packaging.version instead.*:DeprecationWarning:pyspark',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I moved these back to pyproject.toml, yet targeting pyspark module. Would that work for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

nice!

@FBruzzesi FBruzzesi requested a review from MarcoGorelli January 9, 2025 15:54
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FBruzzesi

non-tests changes look good

i'm about to go out so I didn't finish reading through all the changes in the tests folder, but if you checked them and there's no rogue changes feel free to merge

nice one! πŸ™Œ

@FBruzzesi
Copy link
Member Author

non-tests changes look good

i'm about to go out so I didn't finish reading through all the changes in the tests folder, but if you checked them and there's no rogue changes feel free to merge

nice one! πŸ™Œ

Thanks Marco! Aside for CI time increasing significantly for when pyspark runs (maybe we could skip the windows one and run pyspark on ubuntu only), I don't see a big risk for merging now.

It is such a better developer experience to add features with tests already there πŸ™ˆ

@FBruzzesi FBruzzesi merged commit 20eb53b into main Jan 9, 2025
23 checks passed
@FBruzzesi FBruzzesi deleted the tests/pyspark-to-main branch January 9, 2025 16:54
@EdAbati
Copy link
Collaborator

EdAbati commented Jan 9, 2025

Yeees thank you @FBruzzesi πŸ₯³πŸ₯³πŸ₯³

@MarcoGorelli
Copy link
Member

maybe we could skip the windows one and run pyspark on ubuntu only

yes, πŸ‘ to this, windows is already really slow to run...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: Move spark tests and constructor into main test suite
4 participants