Sudachi Transformers (chiTra)

chiTraは事前学習済みの大規模な言語モデルと Transformers 向けの日本語形態素解析器を提供します。 / chiTra provides the pre-trained language models and a Japanese tokenizer for Transformers.

chiTraはSudachi Transformersの略称です。 / chiTra stands for Sudachi Transformers.

事前学習済みモデル / Pretrained Model

公開データは Open Data Sponsorship Program を使用してAWSでホストされています。 / Datas are generously hosted by AWS with their Open Data Sponsorship Program.

Version	Normalized	SudachiTra	Sudachi	SudachiDict	Text	Pretrained Model
v1.0	normalized_and_surface	v0.1.7	0.6.2	20211220-core	NWJC (109GB)	395 MB (tar.gz)
v1.1	normalized_nouns	v0.1.8	0.6.6	20220729-core	NWJC with additional cleaning (79GB)	396 MB (tar.gz)

特長 / Features

大規模テキストによる学習 / Training on large texts
- 国語研日本語ウェブコーパス (NWJC) をつかってモデルを学習することで多様な表現とさまざまなドメインに対応しています / Models are trained on NINJAL Web Japanese Corpus (NWJC) to support a wide variety of expressions and domains.
Sudachi の利用 / Using Sudachi
- 形態素解析器 Sudachi を利用することで表記ゆれによる弊害を抑えています / By using the morphological analyzer Sudachi, reduce the negative effects of various notations.

chiTraの使い方 / How to use chiTra

クイックツアー / Quick Tour

事前準備 / Requirements

$ pip install sudachitra
$ wget https://sudachi.s3.ap-northeast-1.amazonaws.com/chitra/chiTra-1.1.tar.gz
$ tar -zxvf chiTra-1.1.tar.gz

モデルの読み込み / Load the model

>>> from sudachitra.tokenization_bert_sudachipy import BertSudachipyTokenizer
>>> from transformers import BertModel

>>> tokenizer = BertSudachipyTokenizer.from_pretrained('chiTra-1.1')
>>> tokenizer.tokenize("選挙管理委員会とすだち")
['選挙', '##管理', '##委員会', 'と', '酢', '##橘']

>>> model = BertModel.from_pretrained('chiTra-1.1')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state
tensor([[[ 0.8583, -1.1752, -0.7987,  ..., -1.1691, -0.8355,  3.4678],
         [ 0.0220,  1.1702, -2.3334,  ...,  0.6673, -2.0774,  2.7731],
         [ 0.0894, -1.3009,  3.4650,  ..., -0.1140,  0.1767,  1.9859],
         ...,
         [-0.4429, -1.6267, -2.1493,  ..., -1.7801, -1.8009,  2.5343],
         [ 1.7204, -1.0540, -0.4362,  ..., -0.0228,  0.5622,  2.5800],
         [ 1.1125, -0.3986,  1.8532,  ..., -0.8021, -1.5888,  2.9520]]],
       grad_fn=<NativeLayerNormBackward0>)

インストール / Installation

$ pip install sudachitra

デフォルトの Sudachi dictionary は SudachiDict-core を使用します。 / The default Sudachi dictionary is SudachiDict-core.

SudachiDict-small や SudachiDict-full など他の辞書をインストールして使用することもできます。 / You can use other dictionaries, such as SudachiDict-small and SudachiDict-full .
その場合は以下のように使いたい辞書をインストールしてください。 / In such cases, you need to install the dictionaries.
事前学習済みモデルを使いたい場合はcore辞書を使用して学習されていることに注意してください。 / If you want to use a pre-trained model, note that it is trained with SudachiDict-core.

$ pip install sudachidict_small sudachidict_full

事前学習 / Pretraining

事前学習方法の詳細は pretraining/bert/README.md を参照ください。 / Please refer to pretraining/bert/README.md.

開発者向け / For Developers

TBD

ライセンス / License

"chiTra"は Apache License, Version 2.0 で国立国語研究所及び株式会社ワークスアプリケーションズによって提供されています。 / "chiTra" is distributed by National Institute for Japanese Language and Linguistics and Works Applications Co.,Ltd. under Apache License, Version 2.0.

連絡先 / Contact us

質問があれば、issueやslackをご利用ください。 / Open an issue, or come to our Slack workspace for questions and discussion.

開発者やユーザーの方々が質問したり議論するためのSlackワークスペースを用意しています。 / We have a Slack workspace for developers and users to ask questions and discuss. https://sudachi-dev.slack.com/ ( こちらから招待を受けてください) / https://sudachi-dev.slack.com/ (Get invitation here )

chiTraの引用 / Citing chiTra

chiTraについての論文を発表しています。 / We have published a following paper about chiTra;

勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸, 単語正規化による表記ゆれに頑健な BERT モデルの構築. 言語処理学会第28回年次大会, 2022.

chiTraを論文や書籍、サービスなどで引用される際には、以下のBibTexをご利用ください。 / When citing chiTra in papers, books, or services, please use the follow BibTex entries;

@INPROCEEDINGS{katsuta2022chitra,
    author    = {勝田哲弘, 林政義, 山村崇, Tolmachev Arseny, 高岡一馬, 内田佳孝, 浅原正幸},
    title     = {単語正規化による表記ゆれに頑健な BERT モデルの構築},
    booktitle = "言語処理学会第28回年次大会(NLP2022)",
    year      = "2022",
    pages     = "",
    publisher = "言語処理学会",
}

実験に使用したモデル / Model used for experiment

「単語正規化による表記ゆれに頑健なBERTモデルの構築」の実験において使用したモデルを以下で公開しています。/ The model used in the experiment of "単語正規化による表記ゆれに頑健なBERTモデルの構築" is published below.

Normalized	Text	Pretrained Model
surface	Wiki-40B	tar.gz
normalized_and_surface	Wiki-40B	tar.gz
normalized_conjugation	Wiki-40B	tar.gz
normalized	Wiki-40B	tar.gz

Enjoy chiTra!

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github		.github
evaluation		evaluation
misc		misc
pretraining/bert		pretraining/bert
sudachitra		sudachitra
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sudachi Transformers (chiTra)

事前学習済みモデル / Pretrained Model

特長 / Features

chiTraの使い方 / How to use chiTra

クイックツアー / Quick Tour

インストール / Installation

事前学習 / Pretraining

開発者向け / For Developers

ライセンス / License

連絡先 / Contact us

chiTraの引用 / Citing chiTra

実験に使用したモデル / Model used for experiment

About

Releases 10

Sponsor this project

Packages

Contributors 7

Languages

License

WorksApplications/SudachiTra

Folders and files

Latest commit

History

Repository files navigation

Sudachi Transformers (chiTra)

事前学習済みモデル / Pretrained Model

特長 / Features

chiTraの使い方 / How to use chiTra

クイックツアー / Quick Tour

インストール / Installation

事前学習 / Pretraining

開発者向け / For Developers

ライセンス / License

連絡先 / Contact us

chiTraの引用 / Citing chiTra

実験に使用したモデル / Model used for experiment

About

Resources

License

Stars

Watchers

Forks

Releases 10

Sponsor this project

Packages 0

Contributors 7

Languages

Packages