Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API change: Transformer.transform_iter() returns an iter-dict generator #481

Merged
merged 41 commits into from
Sep 18, 2024

Conversation

cmacdonald
Copy link
Contributor

@cmacdonald cmacdonald commented Sep 4, 2024

Problem

Currently .transform_iter() returns a DataFrame.

As a results, ComposedPipeline.index() calls .transform_iter() and has to undo the DataFrame for the next invocation.

The disadvantages are:

  • Extra DataFrame constructions and deconstructions
  • No API that users can use and implement that operates entirely in iter-dicts, and therefore avoid DataFrames

Proposal

  1. .transform_iter() takes and returns an iterable of dicts (which can be a generator), rather than a DataFrame
  2. As a consequence of (1), .__call__() takes and returns an iter-dict (instantiated as a list), OR takes and returns a DataFrame.
  3. Implementers need only implement .transform() OR .transform_iter(). .__call__() can detect which one is implemented, redirect appropriately.
  4. Can we do the same thing for .transform() - default implementation is to call .transform_iter() (and vice versa)
  5. Add an easy to remember kwarg for pt.apply to take iter-dicts (iter=True).

Prerequisites

Potential for breaking changes

We only told people to implement .transform() in the past, not .transform_iter()

However, I have found some repositories implementing .transform_iter(). See https://github.com/search?q=pyterrier+transform_iter&type=code, specifically
https://github.com/argonism/OPEX

Is this change enough to make this a major version change?

pyterrier/transformer.py Outdated Show resolved Hide resolved
pyterrier/transformer.py Outdated Show resolved Hide resolved
@cmacdonald
Copy link
Contributor Author

I didnt know about __init_subclass__, and on reading it sounds attractive, so I followed your suggestions:

class Transformer:
    def __init_subclass__(cls, **kwargs):
        if cls.transform == Transformer.transform and cls.transform_iter == Transformer.transform_iter:
            raise NotImplementedError("You need to implement either .transform() or .transform_iter() in %s" % str(cls))

and

class Indexer(Transformer):
    def index(self, iter : Iterable[dict], **kwargs):
         pass

    def transform(self, df):
        raise TypeError("Its not expected that .transform() is called on an Indexer")
        # TODO - what about classes that are both Indexers and Transformers?

    def transform_iter(self, df):
        raise TypeError("Its not expected that .transform_iter() is called on an Indexer")
        # TODO - what about classes that are both Indexers and Transformers?

This has two problems:
(1) pt.TransformerBase throws the error (as will any other "abstract" classes)
(2) We've now overridden transform and transform_iter in our Indexer, which would be a problem for a class that is both an Indexer and a Transformer?

My code is in https://github.com/terrier-org/pyterrier/tree/transform_iter_subclass

@seanmacavaney
Copy link
Collaborator

For (1) ABC handles this somehow. Maybe we dig in a bit to see how it's done there?

(2) isn't so common, I suppose it could just be up to the implementer to make sure this is handled?

Or, as a more major change, maybe it doesn't make very much sense to have indexer inherit from transformer? Some of the transformer functionality applies, but actually not that much of it. They should be able to be included in a ComposedPipeline, but only at the end. Rank cutoff, feature composition, etc. do not apply. Etc.

@cmacdonald
Copy link
Contributor Author

cmacdonald commented Sep 5, 2024

For (1) ABC handles this somehow. Maybe we dig in a bit to see how it's done there?

They override __new__, which is much less common to override.

(2) isn't so common, I suppose it could just be up to the implementer to make sure this is handled?

I've excluded Indexer from the check. However, it wont fail gracefully.

(3) Or, as a more major change, maybe it doesn't make very much sense to have indexer inherit from transformer? Some of the transformer functionality applies, but actually not that much of it. They should be able to be included in a ComposedPipeline, but only at the end. Rank cutoff, feature composition, etc. do not apply. Etc.

I'm amenable to exploring this, but not sure its relevant to this PR. In essence, Indexer is a consumer not a pipeline component. We should check standard patterns to see whats common.

@cmacdonald
Copy link
Contributor Author

I can confirm that the following works - apply __call__ with an iterdict and it will return an iterdict 😀 :

>>> r = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed')
>>> (r % 1) ([{'qid' : 'q1', 'query' : 'chemical reactions'}])
[{'qid': 'q1', 'docid': 9373, 'docno': '9374', 'rank': 0, 'score': 10.619215559726808, 'query': 'chemical reactions'}]

@cmacdonald cmacdonald changed the title Transform iter - WIP API change: Transformer.transform_iter() returns an iter-dict generator Sep 14, 2024
@cmacdonald
Copy link
Contributor Author

cmacdonald commented Sep 14, 2024

I think this one is ready for re-review Sean. Some notes for your consideration:

  1. I've indeed reverted back to .transform_iter() returning Iterable[Dict]. This means it /can/ return a list (because yield from List works as expected), but returning a generator is supported, and indeed I think should encouraged.
  2. I wondered if the return type of __call__ should be List[Dict]?
  3. We may want to consider the default case, specifically whether Transformer.transform_iter() default impl needs to make a list from a DataFrame then return iter(). This is perhaps unnecessary?
  4. pt.apply factories for functions those that take multiple rows can now accept an iter=True kwarg; no change is needed for single-row pt.apply factories. Test cases check that functions can return generators or lists.
  5. pt.model.add_ranks support for iter-dicts is probably needed in the future.
  6. Some typevars would be nice, but that's feature-creep at this stage.

pyterrier/apply_base.py Outdated Show resolved Hide resolved
pyterrier/apply_base.py Outdated Show resolved Hide resolved
pyterrier/ops.py Outdated Show resolved Hide resolved
pyterrier/transformer.py Outdated Show resolved Hide resolved
pyterrier/transformer.py Outdated Show resolved Hide resolved
pyterrier/apply.py Show resolved Hide resolved
@cmacdonald
Copy link
Contributor Author

Question: can we somehow know how many times DataFrames are constructed, and check that this makes an indexing pipeline faster?

@cmacdonald
Copy link
Contributor Author

We only told people to implement .transform() in the past, not .transform_iter()

I think the API change may be backwards compatible, at least in some cases. If .transform_iter() returns a DataFrame rather than an Iterable, it might work fine, at least in cases where the output of transform_iter() is used to construct a DataFrame (e.g. pd.DataFrame(list(self.transform_iter())) c.f. https://github.com/terrier-org/pyterrier/blob/transform_iter/pyterrier/transformer.py#L105))

If we really care about this, we could add some unit tests.

@seanmacavaney
Copy link
Collaborator

I think the API change may be backwards compatible

I think you're right, and in the cases where it doesn't work, I'm happy with having it as a breaking change.

@seanmacavaney
Copy link
Collaborator

Question: can we somehow know how many times DataFrames are constructed, and check that this makes an indexing pipeline faster?

Perhaps we could patch pandas.DataFrame with one that counts the number of invocations? It depends on having transformers that implement transform_iter() though to have a positive effect though, right?

@cmacdonald
Copy link
Contributor Author

Question: can we somehow know how many times DataFrames are constructed, and check that this makes an indexing pipeline faster?

Perhaps we could patch pandas.DataFrame with one that counts the number of invocations? It depends on having transformers that implement transform_iter() though to have a positive effect though, right?

At least an indexing pipeline should now be much simpler:

pipeline = pt.apply.text(lambda row: row["title"] + " " + row["text"] )
pipeline.index(some_corpus_with_title_and_body)

@seanmacavaney
Copy link
Collaborator

Alright, @cmacdonald can you take a look at my change?

With another look, I'm not so jazzed that somebody who accidentally uses an Indexer as a regular transformer will get a stack overflow error. I'll try to address in another commit.

@seanmacavaney
Copy link
Collaborator

Alright, I think I sorted that out now too, with test cases to verify.

@cmacdonald
Copy link
Contributor Author

Thanks Sean, this is looking really good now. I think we should:
(a) do some testing that this indeed improves indexing speed for some indexing pipelines, and
(b) decide the next version number this applies to.

@seanmacavaney
Copy link
Collaborator

seanmacavaney commented Sep 17, 2024

I tried and it was the same. Looks like pt.apply.[x] still uses DataFrames.

I mocked up a quick implementation below to see the potential gains, and it looks like we could see a ~2.5x speedup (98s for msmarco-document down to 37s).

import pyterrier as pt

class NullIndexer(pt.Indexer):
  def index(self, it):
    for _ in it:
      pass

class NewGenericApply(pt.Transformer):
  def __init__(self, col, fn):
    self.col = col
    self.fn = fn
  def transform_iter(self, inp):
    for rec in inp:
      yield dict(rec, **{self.col: self.fn(rec)})

dataset = pt.get_dataset('irds:msmarco-document')

# iterator
pipeline = NewGenericApply('text', lambda x: '{title}\n{body}'.format(**x)) >> NullIndexer()
pipeline.index(dataset.get_corpus_iter()) # 37s (86345.29docs/s)

# dataframe
pipeline = pt.apply.text(lambda x: '{title}\n{body}'.format(**x)) >> NullIndexer()
pipeline.index(dataset.get_corpus_iter()) # 98s (32509docs/s)

(There's a tiny boost (86345.29docs/s) to 87789d/s, or 37s to 36s) if we modify the existing dicts and yield them instead of building a new one. But that makes me a little uncomfortable, I prefer pure functions :-))

@cmacdonald cmacdonald merged commit f8e47fb into master Sep 18, 2024
21 checks passed
@cmacdonald cmacdonald deleted the transform_iter branch September 18, 2024 13:57
@cmacdonald
Copy link
Contributor Author

Merged, thanks for your work on this too Sean :-)

@seanmacavaney
Copy link
Collaborator

No problem, awesome enhancement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants