Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create HuggingFaceTransformer.py #35

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mobashgr
Copy link

@mobashgr mobashgr commented Feb 7, 2022

Here is the code for adding any HuggingFace Transformer model over INCEpTION

Here is the code for adding any HuggingFace Transformer model over INCEpTION
ariadne/contrib/HuggingFaceTransformer.py Outdated Show resolved Hide resolved
tokenizer = AutoTokenizer.from_pretrained(self._model)
model = AutoModelForTokenClassification.from_pretrained(self._model)
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer,aggregation_strategy="max")
for c, sentence in enumerate(cas.select(SENTENCE_TYPE)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see the c being used. If it is not needed, I guess the enumerate is not needed either?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true. I was using them for other purposes and forgot to remove them.

@jcklie
Copy link
Contributor

jcklie commented Feb 7, 2022

Thank you for the PR! The basics looks good to me.
This code still has the issue that you do not use the tokenization of INCEpTION, which requires you then to use character level granularity which is not so nice to use later when exporting the corpus and using it downstream. In other recommenders, we align the predictions of recommenders to the INCEpTION tokenization which you would need to do here also before I would merge it tbh. Examples and hints can be found in

huggingface/transformers#14305
https://huggingface.co/docs/transformers/custom_datasets?highlight=offset_mapping#token-classification-with-wnut-emerging-entities
https://discuss.huggingface.co/t/predicting-with-token-classifier-on-data-with-no-gold-labels/9373

I also do not understand why you would need pandas here, it is certainly possible to do it just without.
As you only support token classification here, I would also name it TransformerTokenClassifier or so, the name indicates that it is a general implementation.

The file name does not fit with the users, Python and we us typically snake case for files.

It would be nice to have a unit test, even if it just does smoke testing.

@mobashgr
Copy link
Author

mobashgr commented Feb 7, 2022

Regarding the first point, yes, I was facing this problem yesterday and used the character level granularity as suggested by Richard. My problem was resolved, and I don’t think that I have the time to do this alignment now. I just wanted to share what I have as a solution to a problem that I was facing especially since the Adapter code isn’t working, and it was misleading TBH. I believe that INCEpTION is a very powerful tool and it should definitely have examples for HuggingFace classifiers.

For the second point, I need pandas, as the output of the pipeline in my case is a list of list of dictionaries. A sample of the pipeline output looks like this [{'entity_group': 'Chemical', 'score': 0.9996301, 'word': 'acety', 'start': 66, 'end': 71}, {'entity_group': 'Chemical', 'score': 0.99999845, 'word': 'nicotine', 'start': 98, 'end': 106}, {'entity_group': 'Chemical', 'score': 0.99911577, 'word': 'la dicine evised', 'start': 122, 'end': 144}, {'entity_group': 'Chemical', 'score': 0.9999038, 'word': 'alpha - only hete', 'start': 308, 'end': 325}] . So, I prefer to change it into a dataframe.

@reckart
Copy link
Member

reckart commented Feb 27, 2024

@mobashgr Sorry for getting back to you late. Could you please add the same Apache License license header to the file that we use in the other files?

I believe it should not be a strong problem if the recommender users a different tokenization. If the recommender creates a suggestion that does not fit in with the layer settings in INCEpTION, it will be ignored - it should not cause trouble.

@reckart reckart added the ⭐️ Enhancement New feature or request label Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⭐️ Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants