-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
漢字から平仮名への変換処理の改善 #32
Comments
Janome と pykakasi の差分
Notes: Janome, pykakasi どちらも悪いケースを検知できていないことに注意 ❗ $ cat diff_janome_pykakasi.txt | wc -l
503 (501 + Markdown table header 2 lines)
$ cat diff_janome_pykakasi.uniq.txt | wc -l
283 (281 + Markdown table header 2 lines)
$ rg ':white_check_mark:' diff_janome_pykakasi.uniq.txt | wc -l
179
$ rg ':heavy_check_mark:' diff_janome_pykakasi.uniq.txt | wc -l
17
$ rg ':x:' diff_janome_pykakasi.uniq.txt | wc -l
85
$ python
>>> (179 + 17) / 283
0.692 diff_janome_pykakasi.txt 👉 Click to expand diff_janome_pykakasi.txt
|
cf. (very experimental) NEologd 辞書を内包した janome をビルドする方法 · mocobeta/janome Wiki Janome を neologd でビルドする時の構成と手順。時間掛かるので FROM python:3.8-slim-buster
ENV DEBIAN_FRONTEND=noninteractive
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
RUN apt-get update && \
apt-get -y install --no-install-recommends \
wget \
jq \
git curl make zip xz-utils file sudo mecab libmecab-dev mecab-ipadic-utf8 && \
apt-get autoclean && \
apt-get clean && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/*
RUN git clone --depth=1 https://github.com/neologd/mecab-ipadic-neologd.git /build/mecab-ipadic-neologd
WORKDIR /build/mecab-ipadic-neologd
RUN ./bin/install-mecab-ipadic-neologd -n -a -y
WORKDIR /build
COPY requirements.txt ./requirements.txt
RUN python3 -m pip install --no-cache-dir --upgrade pip && \
python3 -m pip install --no-cache-dir -r ./requirements.txt && \
python3 -m pip check
RUN git clone --depth=1 https://github.com/mocobeta/janome.git -b 0.4.1 /build/janome
WORKDIR /build/janome
RUN python setup.py develop && \
cd ipadic && \
./build.sh /build/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-???????? utf8
RUN rm -rf /build
RUN wget -q 'https://raw.githubusercontent.com/yagays/emoji-ja/20190726/data/emoji_ja.json' \
-O /root/emoji_ja.json
WORKDIR /src mecab-ipadic-neologd/README.ja.md at master · neologd/mecab-ipadic-neologd docker run --rm -i -t --cpus 6 -v ${PWD}:/src docker.pkg.github.com/peaceiris/emoji-ime-dictionary/emoji-dev:latest bash
apt-get update && apt-get -y install --no-install-recommends wget jq git curl make zip xz-utils file patch sudo mecab libmecab-dev mecab-ipadic-utf8
git clone --depth=1 https://github.com/neologd/mecab-ipadic-neologd.git /build/mecab-ipadic-neologd
cd /build/mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n -a -y
git clone --depth=1 https://github.com/mocobeta/janome.git -b 0.4.1 /build/janome
cd /build/janome
python setup.py develop && cd ipadic
./build.sh /build/mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-???????? utf8 &
jobs
# コンテナから抜けて待つ 1.5h くらい
vim janome/version.py
# JANOME_VERSION = '0.4.1-neologd-20200910'
python setup.py sdist
cp dist/Janome-0.4.1-neologd-20200910.tar.gz /src/
pip install dist/Janome-0.4.1-neologd-20200910.tar.gz --no-compile $ echo "異星人" | janome
異 接頭詞,名詞接続,*,*,*,*,異,イ,イ
星人 名詞,固有名詞,人名,名,*,*,星人,セイジン,セイジン |
Janome + neologd と pykakasi の差分 $ cat diff_janome-neologd_pykakasi.txt | wc -l
593 (591 + Markdown table header 2 lines)
$ cat diff_janome-neologd_pykakasi.txt | uniq > diff_janome-neologd_pykakasi.uniq.txt
$ cat diff_janome-neologd_pykakasi.uniq.txt | wc -l
328
$ rg ':white_check_mark:' diff_janome-neologd_pykakasi.uniq.txt | wc -l
203 (201 + Markdown table header 2 lines)
$ rg ':heavy_check_mark:' diff_janome-neologd_pykakasi.uniq.txt | wc -l
58
$ rg ':x:' diff_janome-neologd_pykakasi.uniq.txt | wc -l
65
$ python
>>> (201 + 58) / 328
0.789 diff_janome-neologd_pykakasi.uniq.txt 👉 Click to expand diff_janome-neologd_pykakasi.uniq.txt
|
結論
TODO
|
#31 was closed, #40 opened.
The text was updated successfully, but these errors were encountered: