Skip to content

NLP algorithms and models used in Bilibili Search Engine

Notifications You must be signed in to change notification settings

Hansimov/bili-search-algo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bili-search-algo

Algorithms and models for Bilibili Search Engine (blbl.top).

models.sentencepiece.train

Train sentencepiece model from video texts.

python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -ec -vs 400000 -cc 0.9995 -sf 0.9 -e

or use train.sh with pre-grouped regions and pre-defined input-output paths:

./models/sentencepiece/train.sh 1
./models/sentencepiece/train.sh 2

Test:

python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -t

models.sentencepiece.merge

Merge sentencepiece models which are trained on different data_utils.

python -m models.sentencepiece.merge

or with more params:

python -m models.sentencepiece.merge -vs 1000000 -i sp_518m_ -o sp_merged

data_utils.videos.cache

Tokenize video texts from database, and save to parquets. Used by data_utils.videos.freq and models.fasttext.train.

python -m data_utils.videos.cache -ec -dn video_texts_tid_all -fw 200 -bw 100 -bs 10000

or use cache.sh with pre-grouped regions and pre-defined input-output paths:

./data_utils/videos/cache.sh 1
./data_utils/videos/cache.sh 2

data_utils.videos.freq

Count video terms freqs from database or parquets (-ds) with dataset name (-dn) and filters (-td, -pd, -fg), and save to csv and pickle with prefix (-o). Used by models.fasttext.train.

Specify region tid:

python -m data_utils.videos.freq -o video_texts_freq_tid_17_nt -dn "video_texts_tid_17" -td 17 -nt

All regions:

python -m data_utils.videos.freq -o video_texts_freq_tid_all_nt -dn "video_texts_tid_all" -nt

or use freq.sh with pre-grouped regions and pre-defined input-output paths:

./data_utils/videos/freq.sh 1
./data_utils/videos/freq.sh 2

models.fasttext.vocab

Merge token freqs with total vocab size limit (-mv) from different regions (generated by freq.sh), and save to csv with prefix (-o).

python -m models.fasttext.vocab -mv 1000000 -o merged_video_texts

models.fasttext.train

Train FastText model from video texts, with (merged) vocab freqs (.csv) and cached tokens (.parquet).

python -m models.fasttext.train -m fasttext_other_game_vf_merged_csv -ep 1 -dr "parquets" -dn "video_texts_other_game" -vf "merged_video_texts" -vl csv -bs 20000 -mv 900000

models.fasttext.pos

Add pos tags to original tokens freqs data (merged_video_texts.csv), and save to new csv file (merged_video_texts_pos.csv).

python -m models.fasttext.pos

models.fasttext.run

Run fasttext model (word or doc) as server or client.

  • -m: model_prefix, default is fasttext_merged
  • -v: vocab_limit, default is 150000
  • -w: vector weighted, must specify to enable
  • -r: run as server
  • -ms: model class, default is word, but recommend to use doc for most cases
  • -l: list models
  • -t: test model
  • -tc: test client

List models:

python -m models.fasttext.run -l

Test models:

python -m models.fasttext.run -t -m fasttext_tid_all_mv_60w -v 150000

Run as server:

python -m models.fasttext.run -r -m fasttext_merged -v 150000 -w
python -m models.fasttext.run -r -ms doc -m fasttext_merged -v 150000 -w

Test remote client:

python -m models.fasttext.run -tc
python -m models.fasttext.run -tc -ms doc

About

NLP algorithms and models used in Bilibili Search Engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages