Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API change: Transformer.transform_iter() returns an iter-dict generator #481

Merged
merged 41 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
ceb9369
deprecate caching
cmacdonald Sep 4, 2024
2c8552a
Revert "deprecate caching"
cmacdonald Sep 4, 2024
304943c
wip - transform_iter return changes
cmacdonald Sep 4, 2024
be5eb5f
correct dataframe method name
cmacdonald Sep 4, 2024
6f788d0
use __new__ for checking impls
cmacdonald Sep 5, 2024
cd0c6e2
spacing fix
cmacdonald Sep 5, 2024
cbda858
improved testing for new API
cmacdonald Sep 5, 2024
ac24c9d
todo comment on types
cmacdonald Sep 5, 2024
60202ee
added apply impl that work for transform_iter
cmacdonald Sep 5, 2024
f02525d
documentation fix
cmacdonald Sep 5, 2024
1376fe5
documentation and type improvements
cmacdonald Sep 5, 2024
31db9b0
Merge branch 'master' into transform_iter
cmacdonald Sep 5, 2024
e92856b
changed Transformer API - transform_iter returns an Iterator; __call_…
cmacdonald Sep 13, 2024
16aafcc
pt.apply.generic now fixed
cmacdonald Sep 14, 2024
ac5a2b1
added iter support for by_query as well
cmacdonald Sep 14, 2024
cd119fd
documentation fixes
cmacdonald Sep 14, 2024
9334837
doc improvements
cmacdonald Sep 14, 2024
7a23903
eliminate mypy warning
cmacdonald Sep 14, 2024
0617801
check that returning lists still works
cmacdonald Sep 14, 2024
9c83c0e
revert API s.t. transform_iter returns an Iterable (which can be a ge…
cmacdonald Sep 14, 2024
c21226e
revise return types
cmacdonald Sep 14, 2024
d5c88e9
remove gen() fn
cmacdonald Sep 14, 2024
192af88
use itertools.groupby
cmacdonald Sep 14, 2024
ad20b8a
remove superfluous iter()
cmacdonald Sep 14, 2024
8f5df56
simplify transform_iter() in Compose
cmacdonald Sep 14, 2024
5445c91
doc clarification
cmacdonald Sep 14, 2024
8bb25da
`__call__` input/output, type annotations, documentation
seanmacavaney Sep 17, 2024
8fd74d0
fix documentation generation
seanmacavaney Sep 17, 2024
087d342
alternative way to handle indexer implementations of transform/transf…
seanmacavaney Sep 17, 2024
1c59bc7
remove temporary comment
seanmacavaney Sep 17, 2024
d76744d
weird
seanmacavaney Sep 17, 2024
5d16f7e
use new typevars
cmacdonald Sep 17, 2024
5a087ac
ApplyByRowTransformer and DropColumnTransformer
seanmacavaney Sep 17, 2024
94fb3d8
Merge remote-tracking branch 'origin/transform_iter' into transform_iter
seanmacavaney Sep 17, 2024
5cca486
missing kwargs
seanmacavaney Sep 17, 2024
58175a4
cleanup apply transformers
seanmacavaney Sep 18, 2024
cff0601
importing Transformer from pyterrier.apply_base? weird
seanmacavaney Sep 18, 2024
f2e837c
push_queries_dict
seanmacavaney Sep 18, 2024
9133a9a
fix tests
seanmacavaney Sep 18, 2024
f544d2f
type fix
cmacdonald Sep 18, 2024
ae9bc59
type fixes
cmacdonald Sep 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions pyterrier/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,15 +325,18 @@ def index(self, iter : Iterable[dict], batch_size=100):

def gen():
for batch in chunked(iter, batch_size):
batch_df = prev_transformer.transform_iter(batch)
for row in batch_df.itertuples(index=False):
yield row._asdict()
yield from prev_transformer.transform_iter(batch)
return last_transformer.index(gen())

def transform(self, topics):
for m in self.models:
topics = m.transform(topics)
return topics

def transform_iter(self, topics):
for m in self.models:
topics = m.transform_iter(topics)
return topics

def fit(self, topics_or_res_tr, qrels_tr, topics_or_res_va=None, qrels_va=None):
"""
Expand Down
35 changes: 25 additions & 10 deletions pyterrier/transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,17 @@ class Transformer:
"""
Base class for all transformers. Implements the various operators ``>>`` ``+`` ``*`` ``|`` ``&``
as well as ``search()`` for executing a single query and ``compile()`` for rewriting complex pipelines into more simples ones.

Its expected that either ``.transform()`` or ``.transform_iter()`` be implemented by any class extending this - this rule
does not apply for indexers, which instead implement ``.index()``.
"""

def __init__(self, *args, **kwargs):
cmacdonald marked this conversation as resolved.
Show resolved Hide resolved
super().__init__(*args, **kwargs)
self._transform_implemented = type(self).transform != Transformer.transform
self._transform_iter_implemented = type(self).transform_iter != Transformer.transform_iter
# we cant test for either self._transform_implemented or self._transform_iter_implemented here, due to indexers
cmacdonald marked this conversation as resolved.
Show resolved Hide resolved

@staticmethod
def identity() -> 'Transformer':
"""
Expand Down Expand Up @@ -93,16 +102,21 @@ def transform(self, topics_or_res : pd.DataFrame) -> pd.DataFrame:
Abstract method for all transformations. Typically takes as input a Pandas
DataFrame, and also returns one.
"""
pass
if not self._transform_iter_implemented:
raise NotImplementedError("You need to implement either .transform() and .transform_iter() in %s" % str(type(self)))
return pd.DataFrame(self.transform_iter(topics_or_res.to_dict(orient='records')))

def transform_iter(self, input: Iterable[dict]) -> pd.DataFrame:
def transform_iter(self, input: Iterable[dict]) -> Iterable[dict]:
"""
Method that proesses an iter-dict by instantiating it as a dataframe and calling transform().
Returns the DataFrame returned by transform(). This can be a handier version of transform()
that avoids constructing a dataframe by hand. Alo used in the implementation of index() on a composed
pipeline.
Method that proesses an iter-dict by instantiating it as a dataframe and calling ``transform()``.
Returns an Iterable[dict] equivalent to the DataFrame returned by ``transform()``. This can be a
handier version of ``transform()`` that avoids constructing a dataframe by hand. Also used in the
implementation of ``index()`` on a composed pipeline.
"""
return self.transform(pd.DataFrame(list(input)))
if not self._transform_implemented:
raise NotImplementedError("You need to implement either .transform() and .transform_iter() in %s" % str(type(self)))

return self.transform(pd.DataFrame(list(input))).to_dict(orient='records')

def transform_gen(self, input : pd.DataFrame, batch_size=1, output_topics=False) -> Iterator[pd.DataFrame]:
"""
Expand Down Expand Up @@ -214,10 +228,11 @@ def set_parameter(self, name : str, value):
raise ValueError(('Invalid parameter name %s for transformer %s. '+
'Check the list of available parameters') %(name, str(self)))

def __call__(self, input : Union[pd.DataFrame, Iterable[dict]]) -> pd.DataFrame:
def __call__(self, input : Union[pd.DataFrame, Iterable[dict]]) ->Union[pd.DataFrame, Iterable[dict]]:
"""
Sets up a default method for every transformer, which is aliased to transform() (for DataFrames)
or transform_iter() (for iterable dictionaries) depending on the type of input.
Sets up a default method for every transformer, which is aliased to ``transform()`` (for DataFrames)
or ``transform_iter()`` (for iterable dictionaries) depending on the type of input. The return type
matches the input type.
"""
if isinstance(input, pd.DataFrame):
return self.transform(input)
Expand Down
Loading