Environment Setup

This repository contains the code and data for the paper "Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?"

Environment Setup

To setup the environment, we recommend using conda, e.g.:

conda create -n ml_llm -c conda-forge python=3.10 cudatoolkit=11.8 -y
conda activate ml_llm
pip install vllm==0.2.1
pip install -r requirements.txt

Download model used for language detection to resources/lid/

mkdir resources
wget https://data.statmt.org/lid/lid201-model.bin.gz -P resources/lid/
gzip -d resources/lid/lid201-model.bin.gz

For evaluations using Eleuther AI's LM Evaluation Harness, run:

git clone [email protected]:EleutherAI/lm-evaluation-harness.git
git reset --hard 3ccea2b2
pip install -e ".[multilingual]"

API Keys

If running experiments with OpenAI's API-based models, create a file containing your API key, e.g.:

echo "OPENAI_API_KEY = 'YOUR_OPENAI_API_KEY'" > api_secrets.py

Models

All models training datasets used in our experiments are available on the Hugging Face Hub.

Data

The data used for our experiments is available in data/.

This includes: - Guanaco and its subsets (Mono, Multi-2, Multi-3, etc.) - Alpaca Eval prompts in different languages (used for single-turn dialogue evaluation) - MultiSim simplification benchmark (used for sentence simplification evaluation) - XQuAD (used for extractive QA evaluation) - X-CSQA (used for commonsense reasoning evaluation)

Where applicable, we include the prompt templates used to run the evaluations with each dataset.

For reproducibility, the data can be prepared from the original sources using the relevant notebooks in data_prep/.

Model Training

To train a model on a given dataset, use the script sft_training.py. For example:

CUDA_VISIBLE_DEVICES=2,3 nohup python sft_training.py \
    --model_name_or_path "meta-llama/Llama-2-7b-hf" \
    --train_dataset "data/guanaco/guanaco_train_ml2.json" \
    --eval_dataset "data/guanaco/guanaco_test.json" \
    --output_dir "resources/models/llama_2_7b_hf_ml2" \
    --num_train_epochs 10 \
    --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 4 \
    --log_with "wandb" >| resources/models/logs/llama_2_7b_hf_ml2.log &

Once training is completed, we merge the learned adapters with the base model for easy loading with vLLM.

python merge_peft_adapter.py \
    --adapter_model_name_or_path "resources/models/llama_2_7b_hf_ml2" \
    --output_dir "resources/models/llama_2_7b_hf_ml2_merged"

Inference

To run inference for the different tasks, you can use the appropriate run_inference*.sh script (here), specifying the GPU device ID, model directories and evaluation datasets.

Single-turn Dialogue

bash scripts/run_alpaca_inference.sh \
    -d 0 \
    -m resources/models/llama_2_7b_hf_ml2_merged resources/models/llama_2_7b_hf_ml3_merged \
    -t data/alpaca_eval/alpaca_eval_instructions_is.json data/alpaca_eval/alpaca_eval_instructions_el.json data/alpaca_eval/alpaca_eval_instructions_hi.json

Sentence Simplification

bash scripts/run_ts_inference.sh -d 0 \
    -m resources/models/llama_2_7b_hf_ml2_merged resources/models/llama_2_7b_hf_ml3_merged \
    -t data/multisim/en-en.json data/multisim/en-de.json data/multisim/de-de.json

X-CSQA

bash scripts/run_xcsqa_inference.sh \
    -d 0 \
    -m resources/models/llama_2_7b_hf_ml2_merged resources/models/llama_2_7b_hf_ml3_merged \
    -t data/xcsqa/xcsqa_dev_zh_zh.json data/xcsqa/xcsqa_dev_fr_fr.json

XQuAD

bash scripts/run_xnli_inference.sh \
    -d 0 \
    -m resources/models/llama_2_7b_hf_ml2_merged resources/models/llama_2_7b_hf_ml3_merged \
    -t data/xquad/xquad_dev_en_hi.json data/xquad/xquad_dev_hi_hi.json

XNLI

nohup bash scripts/run_lm_eval_harness.sh 0 resources/models/llama_2_7b_hf_ml2_merged >| logs/llama_2_7b_hf_ml2_merged.log &

Evaluation

The script run_llm_judge.sh, can be used to evaluate chat responses given multiple models and target languages. E.g.:

bash scripts/run_llm_judge.sh \
    -m data/alpaca_eval_outputs/llama_2_7b_hf_ml2_merged data/alpaca_eval_outputs/llama_2_7b_hf_ml3_merged \
    -l is el hi

Results

Plots from the paper can be generated using this notebook. This assumes the model outputs and evaluation results are available in the following directory: ./resources/outputs.

Citation

@misc{kew2023turning,
      title={Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?}, 
      author={Tannon Kew and Florian Schottmann and Rico Sennrich},
      year={2023},
      eprint={2312.12683},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
data		data
data_prep		data_prep
prompts		prompts
scripts		scripts
.gitignore		.gitignore
README.md		README.md
api_request_parallel_processor.py		api_request_parallel_processor.py
assign_langs.py		assign_langs.py
clargs.py		clargs.py
evaluation.py		evaluation.py
evaluation_with_sari.py		evaluation_with_sari.py
helpers.py		helpers.py
inference.py		inference.py
inspect_eval_reasoning.ipynb		inspect_eval_reasoning.ipynb
lang_codes.py		lang_codes.py
llm_eval_harness.py		llm_eval_harness.py
llm_judge.py		llm_judge.py
llm_judge_prometheus.py		llm_judge_prometheus.py
merge_peft_adapter.py		merge_peft_adapter.py
open_lid.py		open_lid.py
process_main_results.ipynb		process_main_results.ipynb
requirements.txt		requirements.txt
sft_training.py		sft_training.py
translate_with_gpt.py		translate_with_gpt.py
vllm_inference.py		vllm_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Environment Setup

API Keys

Models

Data

Model Training

Inference

Single-turn Dialogue

Sentence Simplification

X-CSQA

XQuAD

XNLI

Evaluation

Results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ZurichNLP/multilingual-instruction-tuning

Folders and files

Latest commit

History

Repository files navigation

Environment Setup

API Keys

Models

Data

Model Training

Inference

Single-turn Dialogue

Sentence Simplification

X-CSQA

XQuAD

XNLI

Evaluation

Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages