Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement partial "lazy" support for DuckDB (even with this PR, DuckDB support is work-in-progress!) #1725

Merged
merged 116 commits into from
Jan 6, 2025

Conversation

MarcoGorelli
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 5, 2025 12:40
@choucavalier
Copy link

choucavalier commented Jan 5, 2025

can't wait for this to be merged πŸ’ͺ thanks @MarcoGorelli that's gonna provide a great DX

@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Jan 5, 2025
@MarcoGorelli MarcoGorelli mentioned this pull request Jan 5, 2025
4 tasks
Copy link
Collaborator

@EdAbati EdAbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting times πŸ¦† πŸ₯°

just added a couple of small questions and (hopefully useful) comments

Comment on lines +673 to +675
if subset is not None and any(x not in self.columns for x in subset):
msg = f"Column(s) {subset} not found in {self.columns}"
raise ColumnNotFoundError(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we create a check_columns_exist function in narwhals.utils so we can reuse everywhere else? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sure!

narwhals/_duckdb/expr.py Outdated Show resolved Hide resolved
narwhals/implementation.py Outdated Show resolved Hide resolved
tests/duckdb_test.py Outdated Show resolved Hide resolved
tests/expr_and_series/arithmetic_test.py Show resolved Hide resolved
@@ -38,6 +38,8 @@ def test_arithmetic_expr(
constructor: Constructor,
request: pytest.FixtureRequest,
) -> None:
if "duckdb" in str(constructor) and attr == "__floordiv__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__floordiv__ should be implemented, or am I looking at the wrong thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it behaves differently though, so i think we need a separate discussion for how to deal with it, e.g.

In [3]: duckdb.sql('select 1.5 // 2.5')
Out[3]: 
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ (1.5 // 2.5) β”‚
β”‚    double    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚          0.6 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

In [4]: 1.5 // 2.5
Out[4]: 0.0

@@ -109,18 +112,18 @@ def test_cast(


def test_cast_series(
constructor: Constructor,
constructor_eager: ConstructorEager,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to take a crash course in duckDB. I tried to leave a couple of comments.

Main point being, how we want to design the collect method in the main LazyFrame class

Comment on lines +84 to +88
return ArrowDataFrame(
native_dataframe=self._native_frame.arrow(),
backend_version=parse_version(pa.__version__),
version=self._version,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that duckdb is dependency free.
Should we jump to the discussion in #1479 before deciding how to collect for duckdb?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even in that one the likely default for duckdb would still be pyarrow though, right? i've added a try-except anyway to show a less surprising error message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to make a draft/RFC tomorrow to follow up on my comment in the thread ;)


def to_pandas(self: Self) -> pd.DataFrame:
# only is version if v1, keep around for backcompat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: implement version check? Same for to_arrow?

Suggested change
# only is version if v1, keep around for backcompat
# only if version is v1, keep around for backcompat

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_pandas wouldn't be available on nw.LazyFrame anyway so this wouldn't be reachable for non-v1

Comment on lines +227 to +228
assert left_on is not None # noqa: S101
assert right_on is not None # noqa: S101
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already never be the case after all the checks in BaseFrame.join considering how in {"inner", "left"}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true but then mypy complains 😭

narwhals/_duckdb/dataframe.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Member Author

thanks both for your reviews and comments, much appreciated!

any objections to merging as-is before merge conflicts, and then we iterate on it until it's complete?

@MarcoGorelli MarcoGorelli changed the title feat: Implement "lazy" support for DuckDB feat: Implement partial "lazy" support for DuckDB (even with this PR, DuckDB support is work-in-progress!) Jan 5, 2025
@MarcoGorelli
Copy link
Member Author

thanks all for comments!

doesn't look like there's been objections, and this PR is quite self-standing (it doesn't affect existing core backends) so I'll go ahead and ship it so we can release, then we can fill it out bit-by-bit and it can become truly incredible

@choucavalier thanks for your interest - please note that this is really work-in-progress so you'll likely run into quite a few missing methods if you try it out. Nonetheless, i'd be curious to hear how you find it if you do

@MarcoGorelli MarcoGorelli merged commit aa48faa into narwhals-dev:main Jan 6, 2025
23 checks passed
@choucavalier
Copy link

thanks @MarcoGorelli i'll try it out and report any issue with proper failing tests :)

thanks for your amazing work. you're a MACHINE. like @EdAbati said: exciting times!! πŸš€

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants