Accessing Document vectors and computing similarity is too slow without batching #18

ATAboukhadra · 2021-07-10T11:07:45Z

Hi,

I use this library and other Spacy models to create Doc objects.
I use the pipe() method to apply this to a large corpus of text. The main challenge is that accessing the vector of each document is too slow.
Is there a way to get only the vector from applying the model on the text? Or extract the vectors in batches as well?
This problem was raised from the problem of similarity where I couldn't also use the method similarity() on batches but only 1 by 1. Is there a way to compute similarity in batches?

I'm using a 4-core CPU.

I hope my question is clear.

Thanks.

repodiac · 2021-07-29T13:37:56Z

Hi, imho the spaCy way of dealing with separate documents is sort of "in the way". I do not recall a way to handle batches of spaCy docs!?

I come from another direction, I have a huge number of computed embeddings from USE and would like to input them to a spaCy pipeline. I think this is way easier also for your case, to eventually overwrite or add the USE embedding as "vector" hook to the respective doc object.

Of course, would be great if @ATAboukhadra could assist here and extend his plugin via a convenience method, maybe.

MartinoMensio added the enhancement New feature or request label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing Document vectors and computing similarity is too slow without batching #18

Accessing Document vectors and computing similarity is too slow without batching #18

ATAboukhadra commented Jul 10, 2021

repodiac commented Jul 29, 2021 •

edited

Loading

Accessing Document vectors and computing similarity is too slow without batching #18

Accessing Document vectors and computing similarity is too slow without batching #18

Comments

ATAboukhadra commented Jul 10, 2021

repodiac commented Jul 29, 2021 • edited Loading

repodiac commented Jul 29, 2021 •

edited

Loading