📖 Bible Neural Machine Translation with NLLB and LLMs

🔧 Setup

pip install -r requirements.txt

You also need access to a GPU that supports FlashAttention and bfloat16 (e.g. H100/L40S) if you're fine-tuning NLLB models.

🌏 Fine-tuning with NLLB

Fine-tune a multilingual translation model (e.g. NLLB-200) for Bible verse translation between languages, using domain-specific corpora like:

It supports language-pair fine-tuning and evaluation using BLEU and chrF metrics, leveraging transformers and datasets from 🤗 Hugging Face, and is optimized for fast training using Flash Attention and bfloat16.

🚀 Usage

To fine-tune the NLLB-200 distilled 1.3B model from Indonesian to a regional language (e.g. Bambam) using eBible corpus, run:

python src/run_translation.py \
    --model_name facebook/nllb-200-distilled-1.3B \
    --dataset_name bible-nlp/biblenlp-corpus \
    --src_lang ind \
    --tgt_lang ptu \
    --src_lang_nllb ind_Latn \
    --tgt_lang_nllb ptu_Latn \
    --max_length 256 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 16 \
    --learning_rate 2e-4 \
    --max_steps 5000

To fine-tune the model from Indonesian to a regional language (e.g. Batak Karo) using Alkitab Sabda, run:

python src/run_translation.py \
    --model_name facebook/nllb-200-distilled-1.3B \
    --dataset_name LazarusNLP/alkitab-sabda-mt \
    --src_lang ind \
    --tgt_lang btx \
    --src_lang_nllb ind_Latn \
    --tgt_lang_nllb btx_Latn \
    --max_length 256 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 16 \
    --learning_rate 2e-4 \
    --max_steps 5000

📊 Results

Source	Target	Dataset	Eval Set	BLEU	chrF
`ind`	`ptu`	`biblenlp-corpus`	`validation`	18.4774	47.1922
`ind`	`ptu`	`biblenlp-corpus`	`test`	19.4204	49.0809
`ind`	`btx`	`alkitab-sabda-mt`	`validation` (NT)	26.2684	52.8884
`ind`	`btx`	`alkitab-sabda-mt`	`test` (OT)	8.4475	28.9892

🦙 Few-shot Translation with LLMs

We can also leverage LLMs (e.g. GPT, Gemini) for few-shot translation of Bible verses, using the same datasets. This approach is more flexible and can be used for languages not covered by NLLB. We implemented few-shot prompting using TF-IDF, BM25, and Sentence Transformers to retrieve similar paired examples from the corpus, and then use them to prompt the LLM for translation.

🚀 Usage

To run few-shot translation using an LLM (e.g. Gemini 2.0 Flash) from Indonesian to Bambam with BM25 as the retriever, you can run the following command:

python src/run_llm_few_shot.py \
    --model gemini-2.0-flash \
    --src_lang ind \
    --tgt_lang ptu \
    --src_lang_name Indonesian \
    --tgt_lang_name Bambam \
    --vectorizer bm25 \
    --num_few_shot 5

📊 Results

Source	Target	Dataset	Eval Set	BLEU	chrF
`ind`	`ptu`	`biblenlp-corpus`	`test`	14.5615	41.1438

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_evaluation.sh		run_evaluation.sh
run_llm_few_shot.sh		run_llm_few_shot.sh
run_train.sh		run_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📖 Bible Neural Machine Translation with NLLB and LLMs

🔧 Setup

🌏 Fine-tuning with NLLB

🚀 Usage

📊 Results

🦙 Few-shot Translation with LLMs

🚀 Usage

📊 Results

About

Uh oh!

Releases

Packages

Languages

License

LazarusNLP/bible-nmt

Folders and files

Latest commit

History

Repository files navigation

📖 Bible Neural Machine Translation with NLLB and LLMs

🔧 Setup

🌏 Fine-tuning with NLLB

🚀 Usage

📊 Results

🦙 Few-shot Translation with LLMs

🚀 Usage

📊 Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages