-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train and embed with pretokenized input? #2935
Comments
Hi, I think this would do the trick:
Unfortunately the Hope this helps. |
@ir2718 |
I wasn't getting that error, make but I'm glad my comment helped in solving your problem. |
Hi, I'm new to NLP, and I am currently trying to finetune jina for text similarity comparison.
I construct a dataset with columns
sentence1
,sentence2
andscore
. And I can easily train the model withSentenceTransformerTrainer
andCoSENTLOSS
then.But I found the tokenizing process time-consuming, as there are many duplicate sentences in the pairs. For example, for these two pairs, [A, a, 1.0] and [A, B, 0.0]. My code need to tokenize
A
twice.I've found the following code for embedding with tokenized inputs:
But I am still not sure how to train with pre-tokenzied inputs. I am not sure if this is the best place to ask my questiong, any help would be appreciated!
The text was updated successfully, but these errors were encountered: