Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

French to English translation task notebook #12

Open
Gaurav7888 opened this issue Mar 24, 2023 · 3 comments
Open

French to English translation task notebook #12

Gaurav7888 opened this issue Mar 24, 2023 · 3 comments

Comments

@Gaurav7888
Copy link

https://colab.research.google.com/drive/14KegLD0ymq4vTRzCjUvP77w9l-IGCsnj?usp=sharing

@mlevans @tejasvicsr1

@ShubhamBhut
Copy link

Hello! It appears that when using the model to make predictions on external input, the tokenization process may differ from what was used on the original input data. As a result, the model is unable to predict the correct output if the input is not from the dataset. Even if the same text (as external input) is mentioned in the dataset, still it is giving wrong prediction.

Below is the code I used for external prediction, the data preprocessing process is same as for the input dataset tmp_x. I got wrong prediction for this despite that bonjour is clearly mentioned multiple times in the training data

text = ["Bonjour", "mon cheri"]
text[0]
preprocess_x, tk_x = tokenize(text)
preprocess_x[0]
tmp_text = pad(preprocess_x, preproc_french_sentences.shape[1])
tmp_text = tmp_text.reshape((-1, preproc_french_sentences.shape[-2])) 

logits_to_text(loaded_model.predict(tmp_text[[1]])[0], english_tokenizer)

@Gaurav7888
Copy link
Author

Yes, for that we can use a tokenizer built on llm from hugging face and then it will give better results.
I tried to build my own using the dataset.

@Tirth678
Copy link

Tirth678 commented Jul 21, 2024

Hello hope your day is going great,
This is my first contribution so apologies for any mistake

  1. You can try redefining 'tokenize' and 'pad' as

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

def tokenize(sentences):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
return tokenizer.texts_to_sequences(sentences), tokenizer

def pad(sequences, length):
return pad_sequences(sequences, maxlen=length, padding='post')

if this goes well you can also try to reshape your input as
predictions = loaded_model.predict(tmp_text)
print("Predictions:", predictions)

I hope it serves you well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants