-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API change: Transformer.transform_iter() returns an iter-dict generator #481
Conversation
I didnt know about class Transformer:
def __init_subclass__(cls, **kwargs):
if cls.transform == Transformer.transform and cls.transform_iter == Transformer.transform_iter:
raise NotImplementedError("You need to implement either .transform() or .transform_iter() in %s" % str(cls)) and class Indexer(Transformer):
def index(self, iter : Iterable[dict], **kwargs):
pass
def transform(self, df):
raise TypeError("Its not expected that .transform() is called on an Indexer")
# TODO - what about classes that are both Indexers and Transformers?
def transform_iter(self, df):
raise TypeError("Its not expected that .transform_iter() is called on an Indexer")
# TODO - what about classes that are both Indexers and Transformers? This has two problems: My code is in https://github.com/terrier-org/pyterrier/tree/transform_iter_subclass |
For (1) (2) isn't so common, I suppose it could just be up to the implementer to make sure this is handled? Or, as a more major change, maybe it doesn't make very much sense to have indexer inherit from transformer? Some of the transformer functionality applies, but actually not that much of it. They should be able to be included in a |
They override
I've excluded Indexer from the check. However, it wont fail gracefully.
I'm amenable to exploring this, but not sure its relevant to this PR. In essence, Indexer is a consumer not a pipeline component. We should check standard patterns to see whats common. |
I can confirm that the following works - apply >>> r = pt.terrier.Retriever.from_dataset('vaswani', 'terrier_stemmed')
>>> (r % 1) ([{'qid' : 'q1', 'query' : 'chemical reactions'}])
[{'qid': 'q1', 'docid': 9373, 'docno': '9374', 'rank': 0, 'score': 10.619215559726808, 'query': 'chemical reactions'}] |
…_ returns Iterable apply.generic(iter) API too
I think this one is ready for re-review Sean. Some notes for your consideration:
|
Question: can we somehow know how many times DataFrames are constructed, and check that this makes an indexing pipeline faster? |
I think the API change may be backwards compatible, at least in some cases. If .transform_iter() returns a DataFrame rather than an Iterable, it might work fine, at least in cases where the output of transform_iter() is used to construct a DataFrame (e.g. If we really care about this, we could add some unit tests. |
I think you're right, and in the cases where it doesn't work, I'm happy with having it as a breaking change. |
Perhaps we could patch |
At least an indexing pipeline should now be much simpler: pipeline = pt.apply.text(lambda row: row["title"] + " " + row["text"] )
pipeline.index(some_corpus_with_title_and_body) |
Alright, @cmacdonald can you take a look at my change? With another look, I'm not so jazzed that somebody who accidentally uses an |
Alright, I think I sorted that out now too, with test cases to verify. |
Thanks Sean, this is looking really good now. I think we should: |
I tried and it was the same. Looks like I mocked up a quick implementation below to see the potential gains, and it looks like we could see a ~2.5x speedup (98s for msmarco-document down to 37s). import pyterrier as pt
class NullIndexer(pt.Indexer):
def index(self, it):
for _ in it:
pass
class NewGenericApply(pt.Transformer):
def __init__(self, col, fn):
self.col = col
self.fn = fn
def transform_iter(self, inp):
for rec in inp:
yield dict(rec, **{self.col: self.fn(rec)})
dataset = pt.get_dataset('irds:msmarco-document')
# iterator
pipeline = NewGenericApply('text', lambda x: '{title}\n{body}'.format(**x)) >> NullIndexer()
pipeline.index(dataset.get_corpus_iter()) # 37s (86345.29docs/s)
# dataframe
pipeline = pt.apply.text(lambda x: '{title}\n{body}'.format(**x)) >> NullIndexer()
pipeline.index(dataset.get_corpus_iter()) # 98s (32509docs/s) (There's a tiny boost (86345.29docs/s) to 87789d/s, or 37s to 36s) if we modify the existing dicts and yield them instead of building a new one. But that makes me a little uncomfortable, I prefer pure functions :-)) |
Merged, thanks for your work on this too Sean :-) |
No problem, awesome enhancement! |
Problem
Currently
.transform_iter()
returns a DataFrame.As a results, ComposedPipeline
.index()
calls.transform_iter()
and has to undo the DataFrame for the next invocation.The disadvantages are:
Proposal
.transform_iter()
takes and returns an iterable of dicts (which can be a generator), rather than a DataFrame.__call__()
takes and returns an iter-dict (instantiated as a list), OR takes and returns a DataFrame..transform()
OR.transform_iter()
..__call__()
can detect which one is implemented, redirect appropriately..transform()
- default implementation is to call.transform_iter()
(and vice versa)Prerequisites
Transformer.__init__()
constructor, rather than once per-invocation. https://stackoverflow.com/questions/9436681/how-to-detect-method-overloading-in-subclasses-in-python looks like a possible solution.Potential for breaking changes
We only told people to implement
.transform()
in the past, not.transform_iter()
However, I have found some repositories implementing
.transform_iter()
. See https://github.com/search?q=pyterrier+transform_iter&type=code, specificallyhttps://github.com/argonism/OPEX
Is this change enough to make this a major version change?