Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal-trainer-zoo #62

Open
2 tasks done
manisnesan opened this issue Jan 2, 2024 · 8 comments
Open
2 tasks done

minimal-trainer-zoo #62

manisnesan opened this issue Jan 2, 2024 · 8 comments

Comments

@manisnesan
Copy link
Owner

manisnesan commented Jan 2, 2024

https://github.com/muellerzr/minimal-trainer-zoo

Checklist

Using data collators for training & error analysis

  • Data collators are used for training and error analysis of NLP tasks -> pad model inputs in a batch to the length of the longest example, leading to faster training and fewer redundant computations on padded tokens and attention masks.
  • Using data collators for error analysis involves inspecting the top losses of the model.
    • Why: identifying potential training data errors and understanding the model's performance.
    • By computing and analyzing the top losses, practitioners can gain insights into the specific examples where the model is struggling
    • This helps in identifying data quality issues, model weaknesses, and areas for improvement.
    • The top losses provide a focus for error analysis, allowing practitioners to investigate and address specific challenges in the training data or model behavior.
@manisnesan manisnesan changed the title minimal- model-zoo minimal-trainer-zoo Jan 2, 2024
@manisnesan
Copy link
Owner Author

I am trying to get started with summarization, question answering and multilingual sequence classification tasks. Next steps are to play with the minimal trainer zoo examples and HF book code examples

@manisnesan
Copy link
Owner Author

Example Sequence Classification Notebook using imdb-movie-review - Colab

@manisnesan
Copy link
Owner Author

manisnesan commented Jan 3, 2024

Follow up next steps on Sequence Classification (L2)

@manisnesan
Copy link
Owner Author

User

Explain the following piece of python code from transformers library. Specifically explain the axis parameter with an example in the line np.argmax(logits, axis=-1)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Bing

The provided Python code is a function named compute_metrics that takes in a tuple eval_pred containing two elements: logits and labels. This function is typically used in the context of machine learning, where logits are the model's predictions and labels are the actual values or targets.
The line predictions = np.argmax(logits, axis=-1) is using the argmax function from the numpy library to find the indices of the maximum values along the last axis (since axis=-1) of the logits array.
The axis parameter in np.argmax specifies the axis of the input array along which to find the indices of the maximum values. If axis is not specified, the function returns the index of the maximum value in the flattened array.
For example, consider a 2D array (matrix) a:

a = np.array([[10, 11, 12], [13, 14, 15]])

If we call np.argmax(a, axis=0), it will return the indices of the maximum values along the columns (axis 0), which is [1, 1, 1]. This means the maximum value in each column is found in the second row (index 1).
If we call np.argmax(a, axis=1), it will return the indices of the maximum values along the rows (axis 1), which is [2, 2]. This means the maximum value in each row is found in the third column (index 2).
In your code, np.argmax(logits, axis=-1) is finding the indices of the maximum values along the last axis of the logits array. This is commonly used in multi-class classification problems, where logits often has shape (batch_size, num_classes), and you want to select the class with the highest predicted value for each instance in the batch.
Finally, these indices (class labels with the highest predicted values) are used to compute metrics with the actual labels. The specific metrics computed would depend on the metric object. It could be accuracy, F1 score, precision, recall, etc., depending on the problem at hand.

@manisnesan
Copy link
Owner Author

manisnesan commented Jan 5, 2024

Starting with summarization task.

  • Sequence to sequence (using a T5 model)
  • billsum dataset is used
  • preprocess func : "summarize:" as a prompt added to the input

Questions

  • how can format the output of the summary
  • how should we think about the max_length & truncation while tokenizing

Very information log messages from transformers library

when I did not provide max_length during tokenize
Token indices sequence length is longer than the specified maximum sequence length for this model (1694 > 512). Running this sequence through the model will result in indexing errors

similarly I got the below one when I only provided max_length without truncation

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

@manisnesan
Copy link
Owner Author

User

Explain the warning Note from the transformers library while calling trainer.train method.

Following Code Snippet contains how the tokenizer is currently used in my code

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

Warning Note:
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the callmethod is faster than using a method to encode the text followed by a call to thepad method to get a padded encoding.

Bing

The warning you're seeing is related to the efficiency of using the __call__ method of the T5TokenizerFast tokenizer.

In the Hugging Face Transformers library, the __call__ method of a tokenizer performs both tokenization and padding in a single step, which is more efficient than performing these two operations separately.

In your code, you're currently using the tokenizer to encode the inputs and then separately handling the padding. The warning is suggesting that you can do both in one step by using the __call__ method, which would be faster.

Here's how you might adjust your code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_inputs = tokenizer(inputs, padding='longest', max_length=1024, truncation=True)

In this adjusted code, the padding='longest' argument in the __call__ method ensures that all the encoded inputs in a batch are padded to the length of the longest sequence. This should help to address the warning you're seeing and make your code more efficient.

Please note that the padding strategy ('longest' in this case) can be adjusted based on your specific use case. For example, you can set padding='max_length' to pad all sequences up to max_length. If you want to pad to a length smaller than max_length, you can use padding='do_not_pad' and add a separate padding process after the tokenization.


I asked Bing to response in the below format

User

Question that I posed

Bing

Your response

@manisnesan
Copy link
Owner Author

User

Explain the error message from transformers library during model.generate step of the summarization task.

Code

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')
encoded_input = tokenizer(text, return_tensors="pt").input_ids
outputs = model.generate(encoded_input, max_new_tokens=100, do_sample=False)

Error Message
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

Bing

The error message you're seeing is related to the device allocation of your model and data. In PyTorch, both your model and the data you're using for training or inference need to be on the same device, either CPU or GPU (CUDA).

In your code, it seems like your model and data are on different devices, which is causing the RuntimeError. This typically happens when your model is on one device (e.g., CPU) and your data is on another device (e.g., GPU/CUDA), or vice versa.

To resolve this issue, you need to ensure that both your model and data are on the same device. You can do this by using the to(device) method in PyTorch, which moves the model or tensor to the specified device.

Here's how you might adjust your code:

from transformers import AutoModelForSeq2SeqLM

# Specify the device (use 'cuda' if GPU is available, else 'cpu')
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')
model = model.to(device)  # Move model to the device

encoded_input = tokenizer(text, return_tensors="pt").input_ids
encoded_input = encoded_input.to(device)  # Move encoded input to the device

outputs = model.generate(encoded_input, max_new_tokens=100, do_sample=False)

In this adjusted code, both the model and the encoded_input are moved to the same device before the model.generate step¹²³. This should help to address the error you're seeing. Please replace 'cuda' with 'cpu' if your machine does not have a CUDA-enabled GPU¹²³.

Source: Conversation with Bing, 1/5/2024
(1) Pytorch : Expected all tensors on same device - Stack Overflow. https://stackoverflow.com/questions/71278607/pytorch-expected-all-tensors-on-same-device.
(2) RuntimeError: Expected all tensors to be on the same device, but found .... https://stackoverflow.com/questions/70102[3](https://stackoverflow.com/questions/74514534/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least)23/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least.
(3) RuntimeError: Expected all tensors to be on the same device, but found .... https://stackoverflow.com/questions/745[1](https://stackoverflow.com/questions/71278607/pytorch-expected-all-tensors-on-same-device)4534/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least.
(4) Expected all tensors to be on the same device, but found at least two .... https://github.com/huggingface/transformers/issues/[2](https://stackoverflow.com/questions/70102323/runtimeerror-expected-all-tensors-to-be-on-the-same-device-but-found-at-least)4410.

@manisnesan
Copy link
Owner Author

manisnesan commented Jan 5, 2024

Example Summarization Task using billsum dataset - Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant