llm-jp-modernbert

This repository contains the training and evaluation code for llm-jp/llm-jp-modernbert-base.

The technical report is available here: llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length.

Installation

$ uv sync

Train

Prepare Tokenizer

We use llm-jp-tokenizer v3.0b2 as the tokenizer for the model.

The original llm-jp-tokenizer v3.0b2 is designed for decoder-only models. It adds a beginning-of-sequence (BOS) token <s> before each input:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-3-1.8b")
>>> print([tokenizer.decode(token) for token in tokenizer("foo", return_tensors="pt")["inpu
t_ids"][0]])
['<s>', 'foo']
>>> print([tokenizer.decode(token) for token in tokenizer("foo", "bar", return_tensors="pt"
)["input_ids"][0]])
['<s>', 'foo', '<s>', 'bar']

For encoder-based models like BERT, proper use of special tokens ([CLS], [SEP]) is required. To adapt the tokenizer for such use cases, run:

$ python src/train/prepare_tokenizer.py

This modifies the tokenizer to output BERT-compatible tokens:

tokenizer("こんにちは")
# {'input_ids': [5, 39801, 6], 'attention_mask': [1, 1, 1]}
tokenizer.decode([5, 39801, 6])
# <CLS|LLM-jp> こんにちは <SEP|LLM-jp>

Prepare ModernBERT

To set up a modernbert-base model with the modified tokenizer, run:

python src/train/prepare_modernbert.py

Pretrain

To train the model, run the following command (This script uses Japanese subset of the Wikipedia):

$ bash script/train.sh

To handle datasets that don't fit in memory, this implementation uses IterableDataset. It is adapted from the official Hugging Face script: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm_no_trainer.py

Unlike the original, which does not support checkpoint resumption when using IterableDataset, this code supports resuming from checkpoints even when using IterableDataset.

Use Custom Dataset

Prepare your dataset by placing JSON files containing only a "text" field into separate train and test directories, as shown below:

Example JSON file:

{"text": "foo"}
{"text": "bar"}

In this example, we use a subset of the Japanese Wikipedia dataset.

$ python src/train/prepare_dataset.py

This will create a directory structure like this:

$ tree dataset
dataset/
└── wiki_ja_nano
    ├── test
    │   └── 00000.json
    └── train
        └── 00000.json

You can download a large-scale Japanese corpus from llm-jp-corpus-v3.

When using a large, non-shuffled dataset, it's recommended to keep each JSON file small. This is because IterableDataset performs approximate shuffling: it first shuffles the list of shards (files), then sequentially loads them into a buffer and shuffles only within that buffer. If individual files are too large, the shuffling becomes less effective due to limited mixing across files. For more detailes, refer to the Hugging Face documentation.

Evaluation

Let's evaluate BERT models from various perspectives!

Evaluate on JGLUE

Evaluate the performance of models on the JGLUE benchmark:

$ python src/eval/run_glue_no_trainer.py \
                --model_name_or_path llm-jp/llm-jp-modernbert-base \
                --task_name JSTS \
                --max_length 512 \
                --per_device_train_batch_size 32 \
                --learning_rate 2e-5 \
                --num_train_epochs 3 \
                --output_dir results

Make leaderboard:

$ python src/eval/make_leaderboard.py

Model	JSTS	JNLI	JCoLA	Avg(JGLUE)
llm-jp/llm-jp-modernbert-base	91.8	91.3	84.4	89.2

Zero-shot Sentence Retrieval Task using MIRACL dataset

Eevaluate the retrieval performance of models on the MIRACL dataset:

$ python src/eval/zeroshot_retrieval.py --model llm-jp/llm-jp-modernbert-base
Model: llm-jp/llm-jp-modernbert-base
Recall@10: 0.574
MRR@10: 0.389

Pseudo-Perplexity

Conduct the pseudo-perplexity evaluation (refer to the paper NeoBERT):

$ python src/eval/pseudo_perplexity.py --model llm-jp/llm-jp-modernbert-base --num_examples 2000

You can decrease the number of examples to speed up the evaluation.

Alignment & Uniformity

Conduct the alignment and uniformity evaluation (refer to the paper SimCSE):

$ python src/eval/alignment_and_uniformity.py
$ python src/eval/plot_align_and_uni.py

Visualize the sentence similarity distribution:

$ python src/eval/sim_distribution.py --model llm-jp/llm-jp-modernbert-base

Fill-Mask Test

Evaluate the fill-mask performance of models:

$ python src/eval/fill_mask_test.py --text "今日のご飯は{mask_str}である。"
Question: 今日のご飯は{mask_str}である。
cl-tohoku/bert-base-japanese-v3: こう, sbintuitions/modernbert-ja-130m: カレーライス, llm-jp/llm-jp-modernbert-base: 納豆,

Citation

@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained,
      title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length}, 
      author={Issa Sugiura and Kouta Nakayama and Yusuke Oda},
      year={2025},
      eprint={2504.15544},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15544}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
results		results
script		script
src		src
.gitignore		.gitignore
.python-version		.python-version
Alignment_vs_Uniformity.png		Alignment_vs_Uniformity.png
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-jp-modernbert

Installation

Train

Prepare Tokenizer

Prepare ModernBERT

Pretrain

Use Custom Dataset

Evaluation

Evaluate on JGLUE

Zero-shot Sentence Retrieval Task using MIRACL dataset

Pseudo-Perplexity

Alignment & Uniformity

Fill-Mask Test

Citation

References

About

Releases

Packages

Languages

License

llm-jp/llm-jp-modernbert

Folders and files

Latest commit

History

Repository files navigation

llm-jp-modernbert

Installation

Train

Prepare Tokenizer

Prepare ModernBERT

Pretrain

Use Custom Dataset

Evaluation

Evaluate on JGLUE

Zero-shot Sentence Retrieval Task using MIRACL dataset

Pseudo-Perplexity

Alignment & Uniformity

Fill-Mask Test

Citation

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages