Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postprocessing hangs with multiple return sequences #18

Open
onadegibert opened this issue Oct 8, 2024 · 1 comment
Open

Postprocessing hangs with multiple return sequences #18

onadegibert opened this issue Oct 8, 2024 · 1 comment

Comments

@onadegibert
Copy link

Hello,

I am running the IndicTrans models from HuggingFace and need to return multiple translations per source sentence (e.g., 8). Using the code provided in the README.md of the repository, whenever I increase the num_return_sequences parameter, the postprocessing step hangs indefinitely without any error message.

Expected behavior: I expect the postprocessing step to handle multiple return sequences and provide the output without hanging.

Actual behavior: When increasing num_return_sequences, the postprocessing step hangs, and there is no further message or output.

Here is the code I’m using (adapted from the README.md):

import torch

from IndicTransToolkit import IndicProcessor
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import time

ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)

sentences = [
    "This is a test sentence.",
    "This is another longer different test sentence.",
    "Please send an SMS to 9876543210 and an email on newemail123@xyz.com by 15th October, 2023.",
]

batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva")
batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt")

num_return_sequences = 2
print(f"num_return_sequences == {num_return_sequences}")
with torch.inference_mode():
    outputs = model.generate(**batch, num_beams=5, num_return_sequences=num_return_sequences, max_length=256)

with tokenizer.as_target_tokenizer():
    # This scoping is absolutely necessary, as it will instruct the tokenizer to tokenize using the target vocabulary.
    # Failure to use this scoping will result in gibberish/unexpected predictions as the output will be de-tokenized with the source vocabulary instead.
    outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)

start = time.time()
print("Starting posprocessing")
outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
end = time.time()
total_time = end-start
print(f"Postprocessing took {total_time:.2f} seconds")
print(outputs)

I'm using Python 3.10.12, with the following libraries installed:

  • torch==2.4.1
  • transformers==4.45.2

Thanks!

@VarunGumma
Copy link
Owner

Hi @onadegibert, thank you for reaching out. Yes, the processor is not designed to handle multiple return sequences from generate. As of now the post-processing assumes a 1-to-1 mapping of input to output to patch the placeholders. We will try to incorporate this request in the next release. If you have a fix by then, please feel free to make a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants