You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some english words, the model add . to the end of translation ; for example: ok.
See the code which produces the following output:
Translating: 'ok'
Translation from eng_Latn to heb_Hebr: 'בסדר, בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '괜찮아요'
Translating: 'Ok'
Translation from eng_Latn to heb_Hebr: 'בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо.'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '좋아'
Translating: 'OK'
Translation from eng_Latn to heb_Hebr: 'בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо.'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '괜찮아요'
Code
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def translate(from_lang: str, to_lang: str, text: str):
translator = pipeline(task='translation', model=model, tokenizer=tokenizer, src_lang=from_lang, tgt_lang=to_lang, max_length = 400)
output = translator(text)
translated_text = output[0]['translation_text']
print(f"Translation from {from_lang} to {to_lang}: '{translated_text}'")
text_to_translates = [
"ok",
"Ok",
"OK"
]
to_langs = [
"heb_Hebr",
"rus_Cyrl",
"fra_Latn",
"kor_Hang"
]
for text in text_to_translates:
print(f"Translating: '{text}'")
for lang in to_langs:
translate(from_lang="eng_Latn",to_lang=lang, text=text)
print("\n\n")
What have you tried?
Tried adding an empty space; sometimes it helps, but most often it does not.
Tried the different combination of hyperparameters; none of them made any difference:
max_length: As you've mentioned, this parameter controls the maximum length of the input sequence to the model. If the input sequence is longer than max_length, it will be truncated. Increasing this might prevent important context from being lost but could increase computation time.
num_beams: This parameter is used in beam search, which is a strategy for generating text where multiple translation paths are considered at each step. Increasing the number of beams can potentially improve the quality of the output at the cost of more computation.
temperature: This parameter controls randomness in the output generation. Lower temperatures make the model outputs more deterministic and conservative, while higher temperatures encourage more diversity but can also introduce more mistakes.
top_k: This parameter is used with sampling strategies, limiting the number of highest probability vocabulary tokens to be considered for each step. A lower top_k reduces randomness.
top_p (nucleus sampling): This sampling strategy involves selecting the smallest set of tokens whose cumulative probability exceeds the probability p. The model will then only consider this set of tokens for generating the next word. This can lead to more fluent and coherent text generation.
repetition_penalty: This parameter discourages the model from repeating the same line verbatim. Adjusting this can help in reducing redundancies in the translations.
length_penalty: Adjusts the length of the generated output. Setting this parameter can help if the model consistently generates too short or too long outputs.
no_repeat_ngram_size: This parameter prevents the repetition of n-grams. This can be useful to avoid repeated phrases or sentences, which is a common issue in generated text.
early_stopping: If set to True, generation will stop if all beam candidates reach the EOS token (end of sequence). This can save computation time without affecting the output quality significantly.
What's your environment?
fairseq Version (e.g., 1.0 or main): N/A
PyTorch Version (e.g., 1.0): N/A
OS (e.g., Linux): MacBook Pro - Apple M3 Max
How you installed fairseq (pip, source): N/A
Build command you used (if compiling from source): N/A
Python version: Python 3.11.8
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
Any other relevant information:
transformers: 4.40.0
The text was updated successfully, but these errors were encountered:
The NLLB model has been trained to translate not words, but sentences.
This is a tradition in machine translation.
Texts longer than a sentence used to be too difficult for translation models (modern LLMs can translate longer documents, but most MT training data is at sentence level anyway).
Texts shorter than one sentences are often too ambiguous to translate. For example, the English word "OK" may be a noun that means "approval" (as in "We can start as soon as we get the OK."), a verb that means "to approve" (as in "I don't want to OK this amount of money."), an adjective that means "acceptable" (as in "Do you think it's OK to stay here for the night?"), or an adverb that means "sufficiently well".
So most machine translation technologies focus on sentence translation, and NLLB is not an exception.
And being trained with sentences, NLLB perceives the text "OK" as a sentence, and translates it to Hebrew, Russian, and French sentences that mean "Well.", "Good." and "I agree.", respectively.
By the way, if you want pure word translation, you probably don't need a neural network at all; it's just a wrong tool. A dictionary would suffice, and if you want a highly multilingual one, I would suggest https://en.wiktionary.org/ or https://panlex.org/.
❓ Questions and Help
What is your question?
For some english words, the model add
.
to the end of translation ; for example:ok
.See the code which produces the following output:
Code
What have you tried?
Tried adding an empty space; sometimes it helps, but most often it does not.
Tried the different combination of hyperparameters; none of them made any difference:
max_length
: As you've mentioned, this parameter controls the maximum length of the input sequence to the model. If the input sequence is longer than max_length, it will be truncated. Increasing this might prevent important context from being lost but could increase computation time.num_beams
: This parameter is used in beam search, which is a strategy for generating text where multiple translation paths are considered at each step. Increasing the number of beams can potentially improve the quality of the output at the cost of more computation.temperature
: This parameter controls randomness in the output generation. Lower temperatures make the model outputs more deterministic and conservative, while higher temperatures encourage more diversity but can also introduce more mistakes.top_k
: This parameter is used with sampling strategies, limiting the number of highest probability vocabulary tokens to be considered for each step. A lower top_k reduces randomness.top_p
(nucleus sampling): This sampling strategy involves selecting the smallest set of tokens whose cumulative probability exceeds the probability p. The model will then only consider this set of tokens for generating the next word. This can lead to more fluent and coherent text generation.repetition_penalty
: This parameter discourages the model from repeating the same line verbatim. Adjusting this can help in reducing redundancies in the translations.length_penalty
: Adjusts the length of the generated output. Setting this parameter can help if the model consistently generates too short or too long outputs.no_repeat_ngram_size
: This parameter prevents the repetition of n-grams. This can be useful to avoid repeated phrases or sentences, which is a common issue in generated text.early_stopping
: If set to True, generation will stop if all beam candidates reach the EOS token (end of sequence). This can save computation time without affecting the output quality significantly.What's your environment?
N/A
N/A
MacBook Pro
-Apple M3 Max
pip
, source):N/A
N/A
Python 3.11.8
N/A
N/A
4.40.0
The text was updated successfully, but these errors were encountered: