Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition dot in translated text #5487

Open
gosha70 opened this issue Apr 21, 2024 · 1 comment
Open

Addition dot in translated text #5487

gosha70 opened this issue Apr 21, 2024 · 1 comment

Comments

@gosha70
Copy link

gosha70 commented Apr 21, 2024

❓ Questions and Help

What is your question?

For some english words, the model add . to the end of translation ; for example: ok.
See the code which produces the following output:

Translating: 'ok'
Translation from eng_Latn to heb_Hebr: 'בסדר, בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '괜찮아요'

Translating: 'Ok'
Translation from eng_Latn to heb_Hebr: 'בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо.'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '좋아'

Translating: 'OK'
Translation from eng_Latn to heb_Hebr: 'בסדר.'
Translation from eng_Latn to rus_Cyrl: 'Хорошо.'
Translation from eng_Latn to fra_Latn: 'Je suis d'accord.'
Translation from eng_Latn to kor_Hang: '괜찮아요'

Code

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def translate(from_lang: str, to_lang: str, text: str):
    translator = pipeline(task='translation', model=model, tokenizer=tokenizer, src_lang=from_lang, tgt_lang=to_lang, max_length = 400)
    output = translator(text)
    translated_text = output[0]['translation_text']
    print(f"Translation from {from_lang} to {to_lang}: '{translated_text}'")


text_to_translates = [
    "ok",
    "Ok",
    "OK"
]

to_langs = [
    "heb_Hebr",
    "rus_Cyrl",
    "fra_Latn",
    "kor_Hang"
]

for text in text_to_translates:
    print(f"Translating: '{text}'")
    for lang in to_langs:
        translate(from_lang="eng_Latn",to_lang=lang, text=text)
        print("\n\n")

What have you tried?

Tried adding an empty space; sometimes it helps, but most often it does not.

Tried the different combination of hyperparameters; none of them made any difference:

  • max_length: As you've mentioned, this parameter controls the maximum length of the input sequence to the model. If the input sequence is longer than max_length, it will be truncated. Increasing this might prevent important context from being lost but could increase computation time.
  • num_beams: This parameter is used in beam search, which is a strategy for generating text where multiple translation paths are considered at each step. Increasing the number of beams can potentially improve the quality of the output at the cost of more computation.
  • temperature: This parameter controls randomness in the output generation. Lower temperatures make the model outputs more deterministic and conservative, while higher temperatures encourage more diversity but can also introduce more mistakes.
  • top_k: This parameter is used with sampling strategies, limiting the number of highest probability vocabulary tokens to be considered for each step. A lower top_k reduces randomness.
  • top_p (nucleus sampling): This sampling strategy involves selecting the smallest set of tokens whose cumulative probability exceeds the probability p. The model will then only consider this set of tokens for generating the next word. This can lead to more fluent and coherent text generation.
  • repetition_penalty: This parameter discourages the model from repeating the same line verbatim. Adjusting this can help in reducing redundancies in the translations.
  • length_penalty: Adjusts the length of the generated output. Setting this parameter can help if the model consistently generates too short or too long outputs.
  • no_repeat_ngram_size: This parameter prevents the repetition of n-grams. This can be useful to avoid repeated phrases or sentences, which is a common issue in generated text.
  • early_stopping: If set to True, generation will stop if all beam candidates reach the EOS token (end of sequence). This can save computation time without affecting the output quality significantly.

What's your environment?

  • fairseq Version (e.g., 1.0 or main): N/A
  • PyTorch Version (e.g., 1.0): N/A
  • OS (e.g., Linux): MacBook Pro - Apple M3 Max
  • How you installed fairseq (pip, source): N/A
  • Build command you used (if compiling from source): N/A
  • Python version: Python 3.11.8
  • CUDA/cuDNN version: N/A
  • GPU models and configuration: N/A
  • Any other relevant information:
    • transformers: 4.40.0
@avidale
Copy link

avidale commented Jun 21, 2024

The NLLB model has been trained to translate not words, but sentences.

This is a tradition in machine translation.
Texts longer than a sentence used to be too difficult for translation models (modern LLMs can translate longer documents, but most MT training data is at sentence level anyway).
Texts shorter than one sentences are often too ambiguous to translate. For example, the English word "OK" may be a noun that means "approval" (as in "We can start as soon as we get the OK."), a verb that means "to approve" (as in "I don't want to OK this amount of money."), an adjective that means "acceptable" (as in "Do you think it's OK to stay here for the night?"), or an adverb that means "sufficiently well".
So most machine translation technologies focus on sentence translation, and NLLB is not an exception.

And being trained with sentences, NLLB perceives the text "OK" as a sentence, and translates it to Hebrew, Russian, and French sentences that mean "Well.", "Good." and "I agree.", respectively.

By the way, if you want pure word translation, you probably don't need a neural network at all; it's just a wrong tool. A dictionary would suffice, and if you want a highly multilingual one, I would suggest https://en.wiktionary.org/ or https://panlex.org/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants