bert.cpp

The motivation of this project is to optimize the inference speed of deploying BERT on CPU using PyTorch in Python, while also supporting C++ projects.

Get Start

Here's a blog but writes in Chinese notes a real case of using the project to optimize inference speed.

For Python

Make sure Rust installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh and check it by rustc -V)
Convert Pytorch checkpoint file and configs to ggml file using

cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}

Make sure tokenizer.json exists, otherwise execute

cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}

Build dynamic library(libbert_shared.so)

git submodule update --init --recursive
mkdir build
cd build/
cmake ..
make

Refer to examples/sample_dylib.py, replace PyTorch inference.

For C++

Make sure Rust installed (curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh and check it by rustc -V)
Convert Pytorch checkpoint file and configs to ggml file using

cd scripts/
python convert_hf_to_ggml.py ${dir_hf_model} -s ${dir_saved_ggml_model}

Make sure tokenizer.json exists, otherwise execute

cd scripts/
python generate_tokenizer_json.py ${dir_hf_model}

Add this project as a submodule and include it via add_sub_directory in your CMake project. You also need to turn on c++17 support. you can then link the library.

Performance

Tokenizer performance

Tokenizer type	cost time
transformers.BertTokenizer (Python)	734ms
tokenizers-cpp (binding Rust)	3ms

Single sentence inference performance

Type	cost time
Python W/O Loading	248ms
C++&Rust W/O Loading(n_thread=4)	2ms
Python W Loading	1104ms
C++&Rust W Loading(n_thread=4)	19ms

Batch inference performance

Type	cost time
Python(batch_size=50)	260ms
C++&Rust (batch_size=50, n_thread=8)	23ms

ggml performance worse as sentence length increases

Python inference using cpp dynamic library

Type	cost time
Python&C++&Rust (batch_size=50, n_thread=8)	26ms

Future Work

Using broadcast instead of ggml.repeat. (WIP)
Update ggml format to gguf.
Implement Python binding instead of dynamic library.

Acknowledgements

Thanks for the projects we rely on or refer to.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
examples		examples
ggml @ 46e22f5		ggml @ 46e22f5
requirements		requirements
scripts		scripts
src		src
test		test
tokenizers-cpp @ 5a2d40e		tokenizers-cpp @ 5a2d40e
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bert.cpp

Get Start

For Python

For C++

Performance

Tokenizer performance

Single sentence inference performance

Batch inference performance

Python inference using cpp dynamic library

Future Work

Acknowledgements

About

Releases

Packages

Languages

License

EeyoreLee/bert.cpp

Folders and files

Latest commit

History

Repository files navigation

bert.cpp

Get Start

For Python

For C++

Performance

Tokenizer performance

Single sentence inference performance

Batch inference performance

Python inference using cpp dynamic library

Future Work

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages