Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-860063 Update documentation #964

Merged
merged 8 commits into from
Jul 31, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,7 @@ Functions
object_keys
object_pick
pandas_udf
pandas_udtf
parse_json
parse_xml
percent_rank
Expand Down
23 changes: 23 additions & 0 deletions src/snowflake/snowpark/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -7034,6 +7034,29 @@ def pandas_udtf(
- :func:`udtf`
- :meth:`UDTFRegistration.register() <snowflake.snowpark.udf.UDTFRegistration.register>`

Compared to the default row-by-row processing pattern of a normal UDTF, which sometimes is
inefficient, vectorized Python UDTFs (user-defined table functions) enable seamless partition-by-partition processing
sfc-gh-stan marked this conversation as resolved.
Show resolved Hide resolved
by operating on partitions as
`Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_
and returning results as
`Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_
or lists of `Pandas arrays <https://pandas.pydata.org/docs/reference/api/pandas.array.html>`_
or `Pandas Series <https://pandas.pydata.org/docs/reference/series.html>`_.
In addition, vectorized Python UDTFs allow for easy integration with libraries that operate on pandas DataFrames or pandas arrays.

A vectorized UDTF handler class:
- defines an :code:`end_partition` method that takes in a DataFrame argument and returns a :code:`pandas.DataFrame` or a tuple of :code:`pandas.Series` or :code:`pandas.arrays` where each array is a column.
- does NOT defines a :code:`process` method.
sfc-gh-stan marked this conversation as resolved.
Show resolved Hide resolved
- optionally defines a handler class with an :code:`__init__` method which will be invoked before processing each partition.
sfc-gh-stan marked this conversation as resolved.
Show resolved Hide resolved

You can use :func:`~snowflake.snowpark.functions.udtf`, :meth:`register` or
:func:`~snowflake.snowpark.functions.pandas_udtf` to create a vectorized UDTF by providing
appropriate return and input types. If you would like to use :meth:`register_from_file` to
create a vectorized UDTF, you would need to explicitly mark the handler method as vectorized using
either the decorator `@vectorized(input=pandas.DataFrame)` or setting `<class>.end_partition._sf_vectorized_input = pandas.DataFrame`

Note: A vectorized UDTF must be called with PARTITION BY clause to build the partitions.

Example::
>>> from snowflake.snowpark.types import PandasSeriesType, PandasDataFrameType, IntegerType
>>> class multiply:
Expand Down
5 changes: 5 additions & 0 deletions src/snowflake/snowpark/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -1845,6 +1845,11 @@ def create_dataframe(
>>> import pandas as pd
>>> session.create_dataframe(pd.DataFrame([(1, 2, 3, 4)], columns=["a", "b", "c", "d"])).collect()
[Row(a=1, b=2, c=3, d=4)]

Note:
When `data` is a pandas DataFrame, `snowflake.connector.pandas_tools.write_pandas` is called, which
requires permission to (1) CREATE STAGE (2) CREATE TABLE and (3) CREATE FILE FORMAT under the current
database and schema.
"""
if data is None:
raise ValueError("data cannot be None.")
Expand Down
22 changes: 18 additions & 4 deletions src/snowflake/snowpark/udtf.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,15 +299,29 @@ class UDTFRegistration:
- :meth:`~snowflake.snowpark.DataFrame.join_table_function`

Compared to the default row-by-row processing pattern of a normal UDTF, which sometimes is
inefficient, a vectorized UDTF allows vectorized operations on a dataframe, with the input as a
`Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_. In a
vectorized UDTF, you can operate on a batches of rows by handling Pandas DataFrame or Pandas
Series. You can use :func:`~snowflake.snowpark.functions.udtf`, :meth:`register` or
inefficient, vectorized Python UDTFs (user-defined table functions) enable seamless partition-by-partition processing
by operating on partitions as
`Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_
and returning results as
`Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_
or lists of `Pandas arrays <https://pandas.pydata.org/docs/reference/api/pandas.array.html>`_
or `Pandas Series <https://pandas.pydata.org/docs/reference/series.html>`_.
Vectorized Python UDTFs allow for easy integration with libraries that operate on pandas DataFrames or pandas arrays.

A vectorized UDTF handler class:
- defines an :code:`end_partition` method that takes in a DataFrame argument and returns a :code:`pandas.DataFrame` or a tuple of :code:`pandas.Series` or :code:`pandas.arrays` where each array is a column.
- does NOT defines a :code:`process` method.
- optionally defines a handler class with an :code:`__init__` method which will be invoked before processing each partition.
sfc-gh-stan marked this conversation as resolved.
Show resolved Hide resolved

You can use :func:`~snowflake.snowpark.functions.udtf`, :meth:`register` or
:func:`~snowflake.snowpark.functions.pandas_udtf` to create a vectorized UDTF by providing
appropriate return and input types. If you would like to use :meth:`register_from_file` to
create a vectorized UDTF, you would need to explicitly mark the handler method as vectorized using
either the decorator `@vectorized(input=pandas.DataFrame)` or setting `<class>.end_partition._sf_vectorized_input = pandas.DataFrame`

Note: A vectorized UDTF must be called with PARTITION BY clause to build the partitions.


Example 11
Creating a vectorized UDTF by specifying a `PandasDataFrameType` as `input_types` and a `PandasDataFrameType` with column names as `output_schema`.
>>> from snowflake.snowpark.types import PandasDataFrameType, IntegerType, StringType, FloatType
Expand Down
Loading