You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hey folks in the adapter community,
I am looking for a way to inference multiple NLP services eg NER, POS, QA, summarization, chatbot using a single given base model like LLaMa-2 where concurrent users use multiple services but the base model is shared across other adapters.
hoping such an implementation would reduce GPU utilization significantly as instead of using parallel finetuned models, one base model with mutliple adapters could be used.
requesting anyone with such info reach out, any help woiuld be highly appreciated.
The text was updated successfully, but these errors were encountered:
I'm not sure if I understand you correctly, but this sounds like a case for the Parallel composition block.
This block can be used to load and use multiple adapters in parallel, each with its own prediction head.
I took the example from the docs:
from adapters import AutoAdapterModel
from transformers import AutoTokenizer
import adapters.composition as ac
model = AutoAdapterModel.from_pretrained("distilbert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
adapter1 = model.load_adapter("sts/sts-b@ukp")
adapter2 = model.load_adapter("sts/mrpc@ukp")
model.active_adapters = ac.Parallel(adapter1, adapter2)
input_ids = tokenizer("Adapters are great!", "Adapters are awesome!", return_tensors="pt")
output1, output2 = model(**input_ids)
(Short warning: I already used the imports from the adapters package, the new version of adapter-transformers, you can find more information on that in #584.)
Each adapter generates an output. Depending on which service is required, you can then process the specific output further.
The section in the docs can be found here if you want some more information about possible composition blocks.
hey folks in the adapter community,
I am looking for a way to inference multiple NLP services eg NER, POS, QA, summarization, chatbot using a single given base model like LLaMa-2 where concurrent users use multiple services but the base model is shared across other adapters.
hoping such an implementation would reduce GPU utilization significantly as instead of using parallel finetuned models, one base model with mutliple adapters could be used.
requesting anyone with such info reach out, any help woiuld be highly appreciated.
The text was updated successfully, but these errors were encountered: