dataset: Add JapaneseSentimentClassification #2913

lsz05 · 2025-07-18T08:46:43Z

This PR intends to add a Japanese dataset JapaneseSentimentClassification.

We made this dataset based on MultilingualSentimentClassification. However, in the Japanese split of MultilingualSentimentClassification, sentences are splitted with spaces (that do not typically exist in natural Japanese texts) by morphological analysis tools. We found that the performances with/without spaces are totally different, so we reverted morphological analysis to remove unnatural spaces. Our method is not perfect but best-effort, as there're some corner cases in border of Japanese and non-Japanese words.

We made it available in JMTEB, and here we cited JMTEB dataset.

I have outlined why this dataset is filling an existing gap in mteb
I have tested that the dataset runs with the mteb package.

I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks)

Here are some examples that show the difference between this dataset (JapaneseSentimentClassification) and the Japanese split of the original MultilingualSentimentClassification.

JapaneseSentimentClassification

In [1]: import mteb

In [2]: ja_sent_cls = mteb.get_task("JapaneseSentimentClassification")

In [3]: ja_sent_cls
Out[3]: JapaneseSentimentClassification(name='JapaneseSentimentClassification', languages=['jpn'])

In [4]: ja_sent_cls.load_data()

In [5]: ja_sent_cls.dataset
Out[5]: 
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9831
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1677
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2552
    })
})

In [8]: ja_sent_cls.dataset["test"][500:510]
Out[8]: 
{'text': ['良い商品でしたよ、ツムツムしてる人にはおすすめです！また買います',
  '凄くいい感じです。厚みも気にならないし、買って良かったです。',
  '今使っている物が使えなくなったらのために買いました。早く発送されましたし、梱包も良かったです。',
  '今まで、いくつも延長ケーブルを使っていましたが、これは非常にいいです。がっしり接続しますし、接触不良が皆無です。最近 USB経由でMicroSDカードの変換アダプタを使っていたのですが、よく、数回に1度くらい認識しなかったり、"unkown device"になったり、最悪 "フォーマットしますか"とか出たり、結構、遭遇しました。いくつも、延長コードを替えて試しましたが、これがいい。密着したと言う、すごい安心感があります。家と会社、両方ともこれに置き換えました。お勧めです。',
  '床置きのミドルタワーPCに接続していたUSB2.0ハブ（ケーブル長1.5m）が購入後12年経過して壊れて（寿命と思われます）しまいました。当製品を購入し、ノートPC用として過去に購入済みのUSB2.0ハブ（ケーブル長6cm）を接続し、机上に置いて使用しています。ケーブル長がちょっと長過ぎたため、余分な長さは輪っかにして向かい合う2か所をそれぞれ結束バンドで括りました。 USBハブには、主に、コードレスマウスのレシーバーを接続しています。時々、フライトシミュレイターのジョイスティックを接続します。たまーに、USB外付けHDD、プリンター、iPod nano（第五世代）を接続します。 iPod nanoは電池残量なしの状態からフル充電まで3時間くらいかかりました。使用開始から３ヶ月程度経過したところです。『抜けやすい』とのカスタマーレビューを見かけましたが、届いた製品はしっかり噛んでくれます。USBハブ（コードレスマウスのレシーバーを接続）を接続したまま、ケーブルを持って逆さにして軽く10秒間ほど振っても外れませんでした。横方向に力を加えると少しグラグラしますが、外れることはないです。今のところ、抜き差しはちょうどいい感じです。特に問題ありません。',
  'PCから 1.5mケーブル付のUSBハブ→ 3m延長ケーブル(この商品)→ 1.5mUSBケーブル。という合計6mの長さで3Dプリンターを操作していますが、問題ありません。案外、大丈夫なんですね。',
  'PS3コントローラーを充電中でもいつもの位置で利用するために購入しました。問題なく充電できますし、操作遅延等もありませんでした。',
  '問題なく使えています。適度な弾力があり斜めにつって使用しても９０度折れることはありません。対ノイズ効果については他の製品との違いや具体的な効果が目に見えて理解できてるわけではありませんので？です。',
  'PC本体を机下に置いているため、手元での接続用に購入しました。長すぎるかと思いつつ2ｍを選びましたが、モニタの支柱に1巻きしておけばずり落ち防止にもなるので、余裕のある長さを選んで正解でした。接続部は、邪魔にならずかつ見失うほど小さくもない、という大きさで、机の上での収まりがよいと思います。抜差しは、硬すぎずかつ外れにくい硬さだと思いますが、操作には両手が必要です。接続機器の認識速度はPCへの直挿しより数秒遅い気がします。',
  '接触不良を心配していたけれど、５本購入して１つも異常が無いのがありがたい。'],
 'label': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

the Japanese split of MultilingualSentimentClassification

In [1]: import mteb

In [2]: senti = mteb.get_task("MultilingualSentimentClassification", languages=["jpn"])

In [3]: senti
Out[3]: MultilingualSentimentClassification(name='MultilingualSentimentClassification', languages=['jpn'])

In [4]: senti.load_data()

In [5]: senti
Out[5]: MultilingualSentimentClassification(name='MultilingualSentimentClassification', languages=['jpn'])

In [6]: senti.dataset
Out[6]: 
{'jpn': DatasetDict({
     train: Dataset({
         features: ['label', 'text'],
         num_rows: 9831
     })
     test: Dataset({
         features: ['label', 'text'],
         num_rows: 2552
     })
     validation: Dataset({
         features: ['label', 'text'],
         num_rows: 1677
     })
 })}

In [7]: senti.dataset["jpn"]["test"][500:510]
Out[7]: 
{'label': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'text': ['良い 商品 でした よ 、 ツムツム して る 人 に は お すすめ です ！ また 買い ます\n',
  '凄く いい 感じ です 。 厚み も 気 に なら ない し 、 買って 良かった です 。\n',
  '今 使って いる 物 が 使え なく なったら の ため に 買い ました 。  早く 発送 さ れ ました し 、 梱包 も 良かった です 。\n',
  '今 まで 、 いく つ も 延長 ケーブル を 使って い ました が 、 これ は 非常に いい です 。  がっしり 接続 し ます し 、 接触 不良 が 皆無です 。  最近  USB 経由 で MicroSD カード の 変換 アダプタ を 使って いた のです が 、  よく 、 数 回 に 1 度 くらい 認識 し なかったり 、 "unkown  device" に なったり 、  最悪  " フォーマット し ます か " と か 出たり 、 結構 、 遭遇 し ました 。  いく つ も 、 延長 コード を 替えて 試し ました が 、 これ が いい 。  密着 した と 言う 、 すごい 安心 感 が あり ます 。  家 と 会社 、 両方 と も これ に 置き 換え ました 。  お 勧め です 。\n',
  '床 置き の ミドル タワー PC に 接続 して いた USB2.0 ハブ （ ケーブル 長 1.5m ） が  購入 後 12 年 経過 して 壊れて （ 寿命 と 思わ れ ます ） しまい ました 。  当 製品 を 購入 し 、 ノート PC 用 と して 過去 に 購入 済み の USB2.0 ハブ （ ケーブル 長 6cm ） を  接続 し 、 机上 に 置いて 使用 して い ます 。  ケーブル 長 が ちょっと 長 過ぎた ため 、 余分な 長 さ は 輪 っ か に して 向かい合う 2 か 所 を  それぞれ 結束 バンド で 括り ました 。  USB ハブ に は 、 主に 、 コードレス マウス の レシーバー を 接続 して い ます 。  時々 、 フライト シミュレイター の ジョイスティック を 接続 し ます 。  たまーに 、 USB 外 付け HDD 、 プリンター 、 iPod  nano （ 第 五 世 代 ） を 接続 し ます 。  iPod  nano は 電池 残 量 なし の 状態 から フル 充電 まで 3 時間 くらい かかり ました 。  使用 開始 から ３ ヶ月 程度 経過 した ところ です 。  『 抜け やすい 』 と の カスタマー レビュー を 見かけ ました が 、 届いた 製品 は  しっかり 噛んで くれ ます 。 USB ハブ （ コードレス マウス の レシーバー を 接続 ） を 接続 した まま 、  ケーブル を 持って 逆さ に して 軽く 10 秒 間 ほど 振って も 外れ ませ ん でした 。  横 方向 に 力 を 加える と 少し グラグラ し ます が 、 外れる こと は ない です 。  今 の ところ 、 抜き差し は ちょうど いい 感じ です 。  特に 問題 あり ませ ん 。\n',
  'PC から  1.5m ケーブル 付 のUSB ハブ → 3m 延長 ケーブル ( この 商品 ) → 1.5mUSB ケーブル 。  と いう 合計 6m の 長 さ で 3D プリンター を 操作 して い ます が 、 問題 あり ませ ん 。  案外 、 大丈夫な んです ね 。\n',
  'PS3 コント ローラー を 充電 中 でも いつも の 位置 で 利用 する ため に 購入 し ました 。  問題 なく 充電 でき ます し 、 操作 遅延 等 も あり ませ ん でした 。\n',
  '問題 なく 使えて い ます 。 適度な 弾力 が あり 斜めに つって 使用 して も ９０ 度 折れる こと は あり ませ ん 。 対 ノイズ 効果 に ついて は 他の 製品 と の 違い や 具体 的な 効果 が 目に見えて 理解 できて る わけで は あり ませ ん ので ？ です 。\n',
  'PC 本体 を 机 下 に 置いて いる ため 、 手元 で の 接続 用 に 購入 し ました 。 長 すぎる か と 思い つつ 2 ｍ を 選び ました が 、 モニタ の 支柱 に 1 巻き して おけば ずり 落ち 防止 に も なる ので 、 余裕 の ある 長 さ を 選んで 正解 でした 。 接続 部 は 、 邪魔に なら ず かつ 見失う ほど 小さく も ない 、 と いう 大き さ で 、 机 の 上 で の 収まり が よい と 思い ます 。 抜差し は 、 硬 すぎ ず かつ 外れ にくい 硬 さ だ と 思い ます が 、 操作 に は 両手 が 必要です 。 接続 機器 の 認識 速度 は PC へ の 直 挿し より 数 秒 遅い 気 が し ます 。\n',
  '接触 不良 を 心配 して いた けれど 、 ５ 本 購入 して １ つ も 異常 が 無い の が ありがたい 。\n']}

We tested several models to show that there is significant difference in whether spaces are removed.

evaluation script

import mteb
import torch

from sentence_transformers import SentenceTransformer

model_names = [
    "cl-nagoya/ruri-v3-30m",
    "cl-nagoya/ruri-v3-70m",
    "cl-nagoya/ruri-v3-130m",
    "cl-nagoya/ruri-v3-310m",
    "intfloat/multilingual-e5-small",
    "intfloat/multilingual-e5-base",
    "intfloat/multilingual-e5-large",
    "sbintuitions/sarashina-embedding-v1-1b",
    "pkshatech/GLuCoSE-base-ja-v2",
    "pkshatech/RoSEtta-base-ja",
]

tasks = mteb.get_tasks(tasks=["MultilingualSentimentClassification", "JapaneseSentimentClassification"], languages=["jpn"])

all_results = {}

def evaluate(model_name):
    model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.bfloat16})
    evaluation = mteb.MTEB(tasks=tasks)
    results = evaluation.run(model, encode_kwargs={"batch_size": 4}, output_folder=f"results/{model_name.replace('/', '_')}")
    return results

for model_name in model_names:
    all_results[model_name] = evaluate(model_name)

test accuracy:

model name	with spaces	without spaces
cl-nagoya/ruri-v3-30m	76.80	87.71
cl-nagoya/ruri-v3-70m	80.97	88.47
cl-nagoya/ruri-v3-130m	84.40	89.42
cl-nagoya/ruri-v3-310m	89.13	90.47
intfloat/multilingual-e5-small	72.03	74.97
intfloat/multilingual-e5-base	72.38	78.44
intfloat/multilingual-e5-large	76.97	80.21
sbintuitions/sarashina-embedding-v1-1b	91.74	94.29
pkshatech/GLuCoSE-base-ja-v2	70.31	80.58
pkshatech/RoSEtta-base-ja	65.27	73.28

test f1:

model name	with spaces	without spaces
cl-nagoya/ruri-v3-30m	75.98	87.21
cl-nagoya/ruri-v3-70m	80.35	87.98
cl-nagoya/ruri-v3-130m	83.84	88.94
cl-nagoya/ruri-v3-310m	88.65	90.00
intfloat/multilingual-e5-small	71.46	74.40
intfloat/multilingual-e5-base	71.62	77.79
intfloat/multilingual-e5-large	76.48	79.67
sbintuitions/sarashina-embedding-v1-1b	91.32	93.97
pkshatech/GLuCoSE-base-ja-v2	69.70	80.00
pkshatech/RoSEtta-base-ja	64.74	72.74

isaac-chung · 2025-07-19T08:25:58Z

@lsz05 thanks for this interesting addition!

Add JapaneseSentimentClassification

0a7fec7

isaac-chung changed the title ~~Add JapaneseSentimentClassification~~ dataset: Add JapaneseSentimentClassification Jul 18, 2025

isaac-chung approved these changes Jul 19, 2025

View reviewed changes

isaac-chung merged commit 57438c2 into embeddings-benchmark:main Jul 19, 2025
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset: Add JapaneseSentimentClassification #2913

dataset: Add JapaneseSentimentClassification #2913

Uh oh!

lsz05 commented Jul 18, 2025 •

edited

Loading

Uh oh!

isaac-chung commented Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

dataset: Add JapaneseSentimentClassification #2913

dataset: Add JapaneseSentimentClassification #2913

Uh oh!

Conversation

lsz05 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

lsz05 commented Jul 18, 2025 •

edited

Loading