API change: Transformer.transform_iter() returns an iter-dict generator #481

cmacdonald · 2024-09-04T17:01:01Z

Problem

Currently .transform_iter() returns a DataFrame.

As a results, ComposedPipeline.index() calls .transform_iter() and has to undo the DataFrame for the next invocation.

The disadvantages are:

Extra DataFrame constructions and deconstructions
No API that users can use and implement that operates entirely in iter-dicts, and therefore avoid DataFrames

Proposal

.transform_iter() takes and returns an iterable of dicts (which can be a generator), rather than a DataFrame
As a consequence of (1), .__call__() takes and returns an iter-dict (instantiated as a list), OR takes and returns a DataFrame.
Implementers need only implement .transform() OR .transform_iter(). .__call__() can detect which one is implemented, redirect appropriately.
Can we do the same thing for .transform() - default implementation is to call .transform_iter() (and vice versa)
Add an easy to remember kwarg for pt.apply to take iter-dicts (iter=True).

Prerequisites

We need to be able to identify overridden methods; this needs to probably be done in the Transformer.__init__() constructor, rather than once per-invocation. https://stackoverflow.com/questions/9436681/how-to-detect-method-overloading-in-subclasses-in-python looks like a possible solution.

Potential for breaking changes

We only told people to implement .transform() in the past, not .transform_iter()

However, I have found some repositories implementing .transform_iter(). See https://github.com/search?q=pyterrier+transform_iter&type=code, specifically
https://github.com/argonism/OPEX

Is this change enough to make this a major version change?

This reverts commit ceb9369.

pyterrier/transformer.py

cmacdonald · 2024-09-04T21:11:56Z

I didnt know about __init_subclass__, and on reading it sounds attractive, so I followed your suggestions:

class Transformer:
    def __init_subclass__(cls, **kwargs):
        if cls.transform == Transformer.transform and cls.transform_iter == Transformer.transform_iter:
            raise NotImplementedError("You need to implement either .transform() or .transform_iter() in %s" % str(cls))

and

class Indexer(Transformer):
    def index(self, iter : Iterable[dict], **kwargs):
         pass

    def transform(self, df):
        raise TypeError("Its not expected that .transform() is called on an Indexer")
        # TODO - what about classes that are both Indexers and Transformers?

    def transform_iter(self, df):
        raise TypeError("Its not expected that .transform_iter() is called on an Indexer")
        # TODO - what about classes that are both Indexers and Transformers?

This has two problems:
(1) pt.TransformerBase throws the error (as will any other "abstract" classes)
(2) We've now overridden transform and transform_iter in our Indexer, which would be a problem for a class that is both an Indexer and a Transformer?

My code is in https://github.com/terrier-org/pyterrier/tree/transform_iter_subclass

seanmacavaney · 2024-09-05T07:33:21Z

For (1) ABC handles this somehow. Maybe we dig in a bit to see how it's done there?

(2) isn't so common, I suppose it could just be up to the implementer to make sure this is handled?

Or, as a more major change, maybe it doesn't make very much sense to have indexer inherit from transformer? Some of the transformer functionality applies, but actually not that much of it. They should be able to be included in a ComposedPipeline, but only at the end. Rank cutoff, feature composition, etc. do not apply. Etc.

cmacdonald · 2024-09-05T09:23:09Z

For (1) ABC handles this somehow. Maybe we dig in a bit to see how it's done there?

They override __new__, which is much less common to override.

(2) isn't so common, I suppose it could just be up to the implementer to make sure this is handled?

I've excluded Indexer from the check. However, it wont fail gracefully.

(3) Or, as a more major change, maybe it doesn't make very much sense to have indexer inherit from transformer? Some of the transformer functionality applies, but actually not that much of it. They should be able to be included in a ComposedPipeline, but only at the end. Rank cutoff, feature composition, etc. do not apply. Etc.

I'm amenable to exploring this, but not sure its relevant to this PR. In essence, Indexer is a consumer not a pipeline component. We should check standard patterns to see whats common.

cmacdonald · 2024-09-05T10:03:42Z

I can confirm that the following works - apply __call__ with an iterdict and it will return an iterdict 😀 :

>>> r = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed')
>>> (r % 1) ([{'qid' : 'q1', 'query' : 'chemical reactions'}])
[{'qid': 'q1', 'docid': 9373, 'docno': '9374', 'rank': 0, 'score': 10.619215559726808, 'query': 'chemical reactions'}]

…_ returns Iterable apply.generic(iter) API too

…nerator)

cmacdonald · 2024-09-14T17:31:22Z

I think this one is ready for re-review Sean. Some notes for your consideration:

I've indeed reverted back to .transform_iter() returning Iterable[Dict]. This means it /can/ return a list (because yield from List works as expected), but returning a generator is supported, and indeed I think should encouraged.
I wondered if the return type of __call__ should be List[Dict]?
~~We may want to consider the default case, specifically whether Transformer.transform_iter() default impl needs to make a list from a DataFrame then return iter(). This is perhaps unnecessary?~~ ✅
pt.apply factories for functions those that take multiple rows can now accept an iter=True kwarg; no change is needed for single-row pt.apply factories. Test cases check that functions can return generators or lists.
pt.model.add_ranks support for iter-dicts is probably needed in the future.
Some typevars would be nice, but that's feature-creep at this stage.

pyterrier/apply_base.py

pyterrier/ops.py

pyterrier/transformer.py

pyterrier/apply.py

cmacdonald · 2024-09-14T18:12:13Z

Question: can we somehow know how many times DataFrames are constructed, and check that this makes an indexing pipeline faster?

cmacdonald · 2024-09-14T19:11:44Z

We only told people to implement .transform() in the past, not .transform_iter()

I think the API change may be backwards compatible, at least in some cases. If .transform_iter() returns a DataFrame rather than an Iterable, it might work fine, at least in cases where the output of transform_iter() is used to construct a DataFrame (e.g. pd.DataFrame(list(self.transform_iter())) c.f. https://github.com/terrier-org/pyterrier/blob/transform_iter/pyterrier/transformer.py#L105))

If we really care about this, we could add some unit tests.

seanmacavaney · 2024-09-17T13:41:57Z

I think the API change may be backwards compatible

I think you're right, and in the cases where it doesn't work, I'm happy with having it as a breaking change.

seanmacavaney · 2024-09-17T13:44:46Z

Question: can we somehow know how many times DataFrames are constructed, and check that this makes an indexing pipeline faster?

Perhaps we could patch pandas.DataFrame with one that counts the number of invocations? It depends on having transformers that implement transform_iter() though to have a positive effect though, right?

cmacdonald · 2024-09-17T14:02:15Z

Question: can we somehow know how many times DataFrames are constructed, and check that this makes an indexing pipeline faster?

Perhaps we could patch pandas.DataFrame with one that counts the number of invocations? It depends on having transformers that implement transform_iter() though to have a positive effect though, right?

At least an indexing pipeline should now be much simpler:

pipeline = pt.apply.text(lambda row: row["title"] + " " + row["text"] )
pipeline.index(some_corpus_with_title_and_body)

seanmacavaney · 2024-09-17T16:30:17Z

Alright, @cmacdonald can you take a look at my change?

With another look, I'm not so jazzed that somebody who accidentally uses an Indexer as a regular transformer will get a stack overflow error. I'll try to address in another commit.

…orm_iter

seanmacavaney · 2024-09-17T17:03:52Z

Alright, I think I sorted that out now too, with test cases to verify.

cmacdonald · 2024-09-17T19:39:23Z

Thanks Sean, this is looking really good now. I think we should:
(a) do some testing that this indeed improves indexing speed for some indexing pipelines, and
(b) decide the next version number this applies to.

seanmacavaney · 2024-09-17T20:32:16Z

I tried and it was the same. Looks like pt.apply.[x] still uses DataFrames.

I mocked up a quick implementation below to see the potential gains, and it looks like we could see a ~2.5x speedup (98s for msmarco-document down to 37s).

import pyterrier as pt

class NullIndexer(pt.Indexer):
  def index(self, it):
    for _ in it:
      pass

class NewGenericApply(pt.Transformer):
  def __init__(self, col, fn):
    self.col = col
    self.fn = fn
  def transform_iter(self, inp):
    for rec in inp:
      yield dict(rec, **{self.col: self.fn(rec)})

dataset = pt.get_dataset('irds:msmarco-document')

# iterator
pipeline = NewGenericApply('text', lambda x: '{title}\n{body}'.format(**x)) >> NullIndexer()
pipeline.index(dataset.get_corpus_iter()) # 37s (86345.29docs/s)

# dataframe
pipeline = pt.apply.text(lambda x: '{title}\n{body}'.format(**x)) >> NullIndexer()
pipeline.index(dataset.get_corpus_iter()) # 98s (32509docs/s)

(There's a tiny boost (86345.29docs/s) to 87789d/s, or 37s to 36s) if we modify the existing dicts and yield them instead of building a new one. But that makes me a little uncomfortable, I prefer pure functions :-))

cmacdonald · 2024-09-18T13:57:43Z

Merged, thanks for your work on this too Sean :-)

seanmacavaney · 2024-09-18T15:43:38Z

No problem, awesome enhancement!

cmacdonald added 4 commits September 4, 2024 17:30

deprecate caching

ceb9369

Revert "deprecate caching"

2c8552a

This reverts commit ceb9369.

wip - transform_iter return changes

304943c

correct dataframe method name

be5eb5f

seanmacavaney reviewed Sep 4, 2024

View reviewed changes

pyterrier/transformer.py Outdated Show resolved Hide resolved

pyterrier/transformer.py Outdated Show resolved Hide resolved

cmacdonald added 3 commits September 5, 2024 10:08

use __new__ for checking impls

6f788d0

spacing fix

cd0c6e2

improved testing for new API

cbda858

cmacdonald added 9 commits September 5, 2024 12:43

todo comment on types

ac24c9d

added apply impl that work for transform_iter

60202ee

documentation fix

f02525d

documentation and type improvements

1376fe5

Merge branch 'master' into transform_iter

31db9b0

changed Transformer API - transform_iter returns an Iterator; __call_…

e92856b

…_ returns Iterable apply.generic(iter) API too

pt.apply.generic now fixed

16aafcc

added iter support for by_query as well

ac5a2b1

documentation fixes

cd119fd

cmacdonald changed the title ~~Transform iter - WIP~~ API change: Transformer.transform_iter() returns an iter-dict generator Sep 14, 2024

doc improvements

9334837

cmacdonald requested a review from seanmacavaney September 14, 2024 15:37

cmacdonald added 4 commits September 14, 2024 18:16

eliminate mypy warning

7a23903

check that returning lists still works

0617801

revert API s.t. transform_iter returns an Iterable (which can be a ge…

9c83c0e

…nerator)

revise return types

c21226e

seanmacavaney reviewed Sep 14, 2024

View reviewed changes

doc clarification

5445c91

__call__ input/output, type annotations, documentation

8bb25da

seanmacavaney added 3 commits September 17, 2024 18:02

fix documentation generation

8fd74d0

alternative way to handle indexer implementations of transform/transf…

087d342

…orm_iter

remove temporary comment

1c59bc7

weird

d76744d

cmacdonald and others added 10 commits September 17, 2024 21:49

use new typevars

5d16f7e

ApplyByRowTransformer and DropColumnTransformer

5a087ac

Merge remote-tracking branch 'origin/transform_iter' into transform_iter

94fb3d8

missing kwargs

5cca486

cleanup apply transformers

58175a4

importing Transformer from pyterrier.apply_base? weird

cff0601

push_queries_dict

f2e837c

fix tests

9133a9a

type fix

f544d2f

type fixes

ae9bc59

cmacdonald merged commit f8e47fb into master Sep 18, 2024
21 checks passed

cmacdonald deleted the transform_iter branch September 18, 2024 13:57

cmacdonald mentioned this pull request Sep 19, 2024

two versions of transform #232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API change: Transformer.transform_iter() returns an iter-dict generator #481

API change: Transformer.transform_iter() returns an iter-dict generator #481

cmacdonald commented Sep 4, 2024 •

edited

Loading

cmacdonald commented Sep 4, 2024

seanmacavaney commented Sep 5, 2024

cmacdonald commented Sep 5, 2024 •

edited

Loading

cmacdonald commented Sep 5, 2024

cmacdonald commented Sep 14, 2024 •

edited

Loading

cmacdonald commented Sep 14, 2024

cmacdonald commented Sep 14, 2024

seanmacavaney commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024

cmacdonald commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024

cmacdonald commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024 •

edited

Loading

cmacdonald commented Sep 18, 2024

seanmacavaney commented Sep 18, 2024

API change: Transformer.transform_iter() returns an iter-dict generator #481

API change: Transformer.transform_iter() returns an iter-dict generator #481

Conversation

cmacdonald commented Sep 4, 2024 • edited Loading

Problem

Proposal

Prerequisites

Potential for breaking changes

cmacdonald commented Sep 4, 2024

seanmacavaney commented Sep 5, 2024

cmacdonald commented Sep 5, 2024 • edited Loading

cmacdonald commented Sep 5, 2024

cmacdonald commented Sep 14, 2024 • edited Loading

cmacdonald commented Sep 14, 2024

cmacdonald commented Sep 14, 2024

seanmacavaney commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024

cmacdonald commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024

cmacdonald commented Sep 17, 2024

seanmacavaney commented Sep 17, 2024 • edited Loading

cmacdonald commented Sep 18, 2024

seanmacavaney commented Sep 18, 2024

cmacdonald commented Sep 4, 2024 •

edited

Loading

cmacdonald commented Sep 5, 2024 •

edited

Loading

cmacdonald commented Sep 14, 2024 •

edited

Loading

seanmacavaney commented Sep 17, 2024 •

edited

Loading