Algorithms and models for Bilibili Search Engine (blbl.top).
Train sentencepiece model from video texts.
python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -ec -vs 400000 -cc 0.9995 -sf 0.9 -e
or use train.sh
with pre-grouped regions and pre-defined input-output paths:
./models/sentencepiece/train.sh 1
./models/sentencepiece/train.sh 2
Test:
python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -t
Merge sentencepiece models which are trained on different data_utils.
python -m models.sentencepiece.merge
or with more params:
python -m models.sentencepiece.merge -vs 1000000 -i sp_518m_ -o sp_merged
Tokenize video texts from database, and save to parquets. Used by data_utils.videos.freq
and models.fasttext.train
.
python -m data_utils.videos.cache -ec -dn video_texts_tid_all -fw 200 -bw 100 -bs 10000
or use cache.sh
with pre-grouped regions and pre-defined input-output paths:
./data_utils/videos/cache.sh 1
./data_utils/videos/cache.sh 2
Count video terms freqs from database or parquets (-ds
) with dataset name (-dn
) and filters (-td
, -pd
, -fg
), and save to csv and pickle with prefix (-o
). Used by models.fasttext.train
.
Specify region tid:
python -m data_utils.videos.freq -o video_texts_freq_tid_17_nt -dn "video_texts_tid_17" -td 17 -nt
All regions:
python -m data_utils.videos.freq -o video_texts_freq_tid_all_nt -dn "video_texts_tid_all" -nt
or use freq.sh
with pre-grouped regions and pre-defined input-output paths:
./data_utils/videos/freq.sh 1
./data_utils/videos/freq.sh 2
Merge token freqs with total vocab size limit (-mv
) from different regions (generated by freq.sh
), and save to csv with prefix (-o
).
python -m models.fasttext.vocab -mv 1000000 -o merged_video_texts
Train FastText model from video texts, with (merged) vocab freqs (.csv) and cached tokens (.parquet).
python -m models.fasttext.train -m fasttext_other_game_vf_merged_csv -ep 1 -dr "parquets" -dn "video_texts_other_game" -vf "merged_video_texts" -vl csv -bs 20000 -mv 900000
Add pos tags to original tokens freqs data (merged_video_texts.csv
), and save to new csv file (merged_video_texts_pos.csv
).
python -m models.fasttext.pos
Run fasttext model (word
or doc
) as server or client.
-m
: model_prefix, default isfasttext_merged
-v
: vocab_limit, default is150000
-w
: vector weighted, must specify to enable-r
: run as server-ms
: model class, default isword
, but recommend to usedoc
for most cases-l
: list models-t
: test model-tc
: test client
List models:
python -m models.fasttext.run -l
Test models:
python -m models.fasttext.run -t -m fasttext_tid_all_mv_60w -v 150000
Run as server:
python -m models.fasttext.run -r -m fasttext_merged -v 150000 -w
python -m models.fasttext.run -r -ms doc -m fasttext_merged -v 150000 -w
Test remote client:
python -m models.fasttext.run -tc
python -m models.fasttext.run -tc -ms doc