Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring the Potential of High-Quality Synthetic Datasets Using Translation-agent to Enhance the Performance of Multilingual Large Language Models #18

Open
universea opened this issue Jun 16, 2024 · 11 comments
Labels
research Requires R&D

Comments

@universea
Copy link

Background
In the field of multilingual large models, especially for non-English corpora, there is often a problem of insufficient data quantity and poor quality. High-quality training data is crucial for model performance, particularly for under-resourced minority languages where a lack of sufficient training data is often a major limiting factor. Therefore, could translating an English corpus into various minority languages to create high-quality synthetic data corpora be a potential solution?

Proposal
The following steps might help generate and utilize synthetic datasets to enhance the capabilities of large multilingual models, especially for minority languages:

Selection and Optimization of Translation Models: Use the agent translation project or other efficient translation models as the translation tool to ensure high-quality output.

Generation of Synthetic Datasets: Utilize the aforementioned translation model to process English data, generating synthetic datasets in various target languages. Clean and validate the generated data to ensure its quality.

Training and Fine-Tuning of Multilingual Large Language Models: Use the synthetic datasets to train or fine-tune multilingual large models, observing improvements in the models' ability to process multilingual texts.

Evaluation and Iteration: Assess the model performance through standardized tests and real-world scenario testing, with a particular focus on the performance in minority languages, and iterate on the model and data generation processes based on feedback.

Discussion
This approach could greatly promote the generation of data and training of models for minority languages, as well as improve the overall performance and application breadth of multilingual models. I am eager to discuss the feasibility of this plan, particularly regarding the technical implementation and resource requirements. If you are interested in this topic or have relevant experience, please join the discussion and explore this field together.

@universea universea changed the title Exploring the Potential of High-Quality Synthetic Datasets Using Translation-agent to Enhance the Performance of Large Multilingual Models Exploring the Potential of High-Quality Synthetic Datasets Using Translation-agent to Enhance the Performance of Multilingual Large Language Models Jun 16, 2024
@enismaxim1
Copy link

Having conducted research in a closely-related topic, I think that this approach is unlikely to work.

A useful mental model for how LLMs operate on different languages is the following: they first convert text (in any particular language) into a shared vector space representing meaning. Crucially, this vector space is language-agnostic: the LLM will convert a Russian text and an English text of equal meaning into a nearly-equal part of the vector space. The conversion from language into meaning is just translation, and the LLM will be lossy on this conversion on precisely the languages in which it is less effective at translating.

But the point is that the LLMs ability to effectively process low-resource language text is limited by its underlying translation ability. If we generate synthetic data, then we are limited by exactly the same thing, meaning that we are unlikely to be able to extract extra signal from synthetic data with this approach.

A related way to generate sytnethic data in a way that does work is to use LLMs, which are better at translating particular documents than existing translation systems, in order to train the smaller translation systems to match the translation performance of LLMs. If you're interested, see the issue #9 (or the paper https://arxiv.org/abs/2404.13813).

@siddhantx0
Copy link

i can try hindi/bhojpuri bad words...

@random-yang
Copy link

Having conducted research in a closely-related topic, I think that this approach is unlikely to work.

A useful mental model for how LLMs operate on different languages is the following: they first convert text (in any particular language) into a shared vector space representing meaning. Crucially, this vector space is language-agnostic: the LLM will convert a Russian text and an English text of equal meaning into a nearly-equal part of the vector space. The conversion from language into meaning is just translation, and the LLM will be lossy on this conversion on precisely the languages in which it is less effective at translating.

But the point is that the LLMs ability to effectively process low-resource language text is limited by its underlying translation ability. If we generate synthetic data, then we are limited by exactly the same thing, meaning that we are unlikely to be able to extract extra signal from synthetic data with this approach.

A related way to generate sytnethic data in a way that does work is to use LLMs, which are better at translating particular documents than existing translation systems, in order to train the smaller translation systems to match the translation performance of LLMs. If you're interested, see the issue #9 (or the paper https://arxiv.org/abs/2404.13813).

You mean that even for LLM, the low-resource language knowledge it contains is very little, resulting in our inability to effectively distill this knowledge for further improvement of low-resource language translation?

@enismaxim1
Copy link

enismaxim1 commented Jun 22, 2024

No, I just mean that translating English data to a low-resource language using an LLM and then training it on that same data is unlikely to yield any performance improvement. The argument is that the LLMs level of multilinguality should come from its ability to translate therefore translation does not provide any additional useful data to the LLM.

@universea
Copy link
Author

No, I just mean that translating English data to a low-resource language using an LLM and then training it on that same data is unlikely to yield any performance improvement. The argument is that the LLMs level of multilinguality should come from its ability to translate therefore translation does not provide any additional useful data to the LLM.

Thank you for your comment. And, I want to know if this method can improve the reasoning, logic, and mathematical abilities of large models in low-resource languages.

@enismaxim1
Copy link

enismaxim1 commented Jun 22, 2024

I would guess that you cannot. The reasoning/logic/math capabilities should be independent of any particular language (instead operating on the shared latent space of meaning). Therefore I would expect that the logic/math/reasoning capabilities should again be bottlenecked by translation/language understanding, which I don't expect this method can improve.

@siddhantx0
Copy link

Your point about reasoning being language-independent is valid. However, language-specific training might still enhance performance indirectly by improving comprehension and expression. While core logic remains unchanged, better language understanding can facilitate more accurate interpretations and responses, potentially leading to improved overall reasoning capabilities.

@universea
Copy link
Author

I mostly support your viewpoints, but I have also noticed a phenomenon. For example, many large models can answer an English math question quite well, but when the same math question is presented in other languages, the accuracy and quality of the answers from the large model tend to decline. If we translate the math question from a lesser-known language into English, the quality and accuracy of the model's answers improve. How do you view this issue?

@siddhantx0
Copy link

How to append new remote languages

@siddhantx0
Copy link

I mostly support your viewpoints, but I have also noticed a phenomenon. For example, many large models can answer an English math question quite well, but when the same math question is presented in other languages, the accuracy and quality of the answers from the large model tend to decline. If we translate the math question from a lesser-known language into English, the quality and accuracy of the model's answers improve. How do you view this issue?

More languages means more data means more insight

@enismaxim1
Copy link

I mostly support your viewpoints, but I have also noticed a phenomenon. For example, many large models can answer an English math question quite well, but when the same math question is presented in other languages, the accuracy and quality of the answers from the large model tend to decline. If we translate the math question from a lesser-known language into English, the quality and accuracy of the model's answers improve. How do you view this issue?

This is interesting. If it is true, then these methods should work. Did you translate the problem into English using an LLM or with something else (like Google)?

@methanet methanet added the research Requires R&D label Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Requires R&D
Projects
None yet
Development

No branches or pull requests

5 participants