Translate Hokkien

The goal of this project is to create quality ML Hokkien Translations.
This project contains tools to help translate and evaluate English, Hokkien (POJ script), Hokkien (Tai-Lo script), and Hokkien (漢字 script).
This project focuses on Text-to-Text translations.
(Hokkien is also known as Minnan, Taiwanese, Hoklo, Southern Min, and iso 639-3: NAN.)

Demo

Try out the Online Demo of the latest Hokkien Translation Model

Updates

2023-11-07

Added Models, Translations, and Evaluations of Hokkien (Latin script) -> Hokkien (漢字 script)*
- - Hokkien (Latin Script) = A mix of Manual and Automated Translations/Transliterations. Automated ones are a mix of Southern + Northern Hokkien Dialects, and also a mix of Tai-Lo and POJ scripts.
- Results: Fine Tuned GPT3.5 achieved 30% BLEU (5x more than GPT4-Zero-Shot which got 6%).
- Outcomes: This model would be useful for processing Hokkien Wikipedia, as it is the largest source of easily accessible Hokkien Texts.

2023-10-31

Data Management; Moved basic MoeDict transformations to dbt Pipeline. Appended MoeDict data for Mart_sample usecases as well.

2023-10-26

Added Translations and Evaluations of: GPT-3.5 Fine Tuned on 12,000 Examples (Almost all of MoeDict samples), for Mandarin -> Hokkien (漢字 script).
Result: BLEU score of 21
Conclusions:
- A Finetuned GPT3.5 Model definitely performs better than a GPT4 Zero-Shot Model when there is 1000+ sentence-pairs.
- A Finetuned GPT3.5 Model with ~10,000 sentence pairs performs ~ ↑ 55% better than GPT4 Zero-Shot, and ~ ↑ 282% better than GPT3.5 Zero-Shot.

2023-10-24

Added MoeDict Dataset. It along with an "English" column (translated from Mandarin via GPT4).
Calculated BLEU Scores with new data.
⚠️ Discovered previous BLEU Score calculations were off. Update with Corrected BLEU Scores!
(Data Structures: Refactored so they're easier to deal with.)
Findings:
- English -> Hokkien (POJ script) - Initial naieve models have very low BLEU scores (1%)
- Mandarin -> Hokkien (漢字 script) - Has a much higher BLEU (7% to 17%). This is about half of what one would expect of a passable BLEU score (30%).
  - GPT-3.5 Zero Shot: BLEU 7%
  - GPT-3.5 Fine Tuned on 100 Examples: 10%
  - GPT-4 Zero Shot: BLEU 13%
  - GPT-3.5 Fine Tuned on 1,000 Examples: 16%
    - (Yup, a Fine-Tuned GPT3.5 Model surpases GPT-4 Zero Shot)
Hypotheses:
- For ZH->NAN(TC): given the change in Magnitude with Finetuning (0->100->1,000 Examples = 7%->10%->16% BLEU), it is foreseeable that if most of the MOEDICT dataset is used (~13,201 sentence pairs), then there's a chance the BLEU score could reach a passable level (30%).

2023-10-19

Management: Continuing to replace more data models, with dbt models.

2023-10-12

Management: Formatted the downstream 'ML_TESTSET_EVALUATIONS_AVERAGE' table as a dbt Model, as part of the pipeline.

2023-10-11

Management: Reformatted the data as SQLITE3, and initialized a DBT Project from it.

2023-10-10

Reference Texts
- Gathered some reference text from Wikipedia (GFDL license) and Omniglot (Non-Commercial license)
- Cleaned up reference texts
- Generated some reference English translations from Minnan Wikipedia (POJ). Generated by taking the "median text" from GPT4 translations. This is not necessarily accurate, but serves as a basis.
Candidate Texts
- Generated some EN→NAN translations (via GPT4 and GPT3.5)
Evaluations
- Generated several evaluations based on BLEU
Conclusions and Next Steps
- Results: The BLEU scores for these evaluations are quite bad, with only unigram scores showing any non-zero results. Things to try to improve this:
  - A more lenient POJ Tokenizer that tokenizes by Syllable rather than Word. This is because word-separation isn't always consistent.
  - A more lenient POJ Tokenizer that ignores diacritics. This is because current POJ sources can be inconsistent.
  - Using Hanzi as a base script before any POJ conversions, for early translation models.
  - Using Mandarin Chinese as an intermediary.
  - Consider the use of Tâi-lô (as a Hanzi→Tâi-lô converter currently exists, but not a Hanzi→POJ one). And how Tâi-lô effects some of the source data.
  - Refer to romanized words, like "Hanzi", as "Hàn-jī / Hàn-lī" in any LLM prompts. Using Hokkien scripts may slightly bias the LLM towards more accurate Hokkien vocabulary, grammar, and script writing.
- Pipeline: These were all generated in spreadsheets. In future, they should be better automated as part of a data pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
data_as_dbt_project		data_as_dbt_project
data_as_sqlite		data_as_sqlite
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_as_dbt_project

data_as_dbt_project

data_as_sqlite

data_as_sqlite

images

images

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Translate Hokkien

Demo

Updates

2023-11-07

2023-10-31

2023-10-26

2023-10-24

2023-10-19

2023-10-12

2023-10-11

2023-10-10

About

Releases

Packages

License

whiai/translate-hokkien

Folders and files

Latest commit

History

Repository files navigation

Translate Hokkien

Demo

Updates

2023-11-07

2023-10-31

2023-10-26

2023-10-24

2023-10-19

2023-10-12

2023-10-11

2023-10-10

About

Topics

Resources

License

Stars

Watchers

Forks