The motivation of this project is to optimize the inference speed of deploying BERT on CPU using PyTorch in Python, while also supporting C++ projects.
Here's a blog but writes in Chinese notes a real case of using the project to optimize inference speed.
- Make sure
Rust
installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
and check it byrustc -V
) - Convert Pytorch checkpoint file and configs to ggml file using
cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}
- Make sure
tokenizer.json
exists, otherwise execute
cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}
- Build dynamic library(
libbert_shared.so
)
git submodule update --init --recursive
mkdir build
cd build/
cmake ..
make
- Refer to
examples/sample_dylib.py
, replace PyTorch inference.
- Make sure
Rust
installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
and check it byrustc -V
) - Convert Pytorch checkpoint file and configs to ggml file using
cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}
- Make sure
tokenizer.json
exists, otherwise execute
cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}
- Add this project as a submodule and include it via
add_sub_directory
in your CMake project. You also need to turn on c++17 support. you can then link the library.
Tokenizer type | cost time |
---|---|
transformers.BertTokenizer (Python) | 734ms |
tokenizers-cpp (binding Rust) | 3ms |
Type | cost time |
---|---|
Python W/O Loading | 248ms |
C++&Rust W/O Loading(n_thread=4) | 2ms |
Python W Loading | 1104ms |
C++&Rust W Loading(n_thread=4) | 19ms |
Type | cost time |
---|---|
Python(batch_size=50) | 260ms |
C++&Rust (batch_size=50, n_thread=8) | 23ms |
ggml performance worse as sentence length increases
Type | cost time |
---|---|
Python&C++&Rust (batch_size=50, n_thread=8) | 26ms |
- Using broadcast instead of
ggml.repeat
. (WIP) - Update ggml format to gguf.
- Implement Python binding instead of dynamic library.
Thanks for the projects we rely on or refer to.