This is a modern alternative for deploying Speech Recognition models developed using Kaldi.
Features:
- Standardized API. We use a modified version of Jarvis proto files, which mimic the Google speech API. This allows for easy switching between Gloud speech recognizers and custom models developed with Kaldi
- Fully pythonic implementation. We utilize pykaldi bindings to interface with Kaldi programmatically. This allows for a clean, customizable and extendable implementation
- Fully bidirectional streaming using HTTP/2 (gRPC). Binary speech segments are streamed to the server and partial hypotheses are streamed back to the client
- Transcribe arbitrarily long speech
- DNN-HMM models supported out of the box
- Supports RNNLM lattice rescoring
- Clients for other languages can be easily generated using the proto files
We recommend the following structure for the deployed model
model
├── conf
│ ├── ivector_extractor.conf
│ ├── mfcc.conf
│ ├── online_cmvn.conf
│ ├── online.conf
│ └── splice.conf
├── final.mdl
├── global_cmvn.stats
├── HCLG.fst
├── ivector_extractor
│ ├── final.dubm
│ ├── final.ie
│ ├── final.mat
│ ├── global_cmvn.stats
│ ├── online_cmvn.conf
│ ├── online_cmvn_iextractor
│ └── splice_opts
└── words.txt
The key files / directories are:
conf
: Configuration files that are used to train the modelfinal.mdl
: The acoustic modelHCLG.fst
: The composed HCLG graph (output of mkgraph.sh)global_cmvn.stats
: Mean and std used for CMVN normalizationwords.txt
: Vocabulary file, mapping words to integersivector_extractor
: Model trained to extract ivector features (used for tdnn / chain models)
We provide the option to build a (for all intents and purposes) binary file using the kaldi bindings through singularity containers. In short, singularity containers build a fakeroot filesystem into a single, executable file. For more info check the documentation.
Instructions:
- Install singularity on your machine. Instructions here
- Build the container
make build-singularity kaldi_model=$MY_MODEL_DIR image_tag=myasr # Do not include special characters like : in the image_tag argument because this will be the path to the container file
- Run the container with
./containers/myasr.sif --beam=11 --streaming --wav=$MYTEST.wav
- For more options run
./containers/myasr.sif --help
Note: You can also use the command make build-flex-singularity
so that the singularity container does not include / expect the model at build time, in order to build a more flexible container that can run any local model.
Then you can do something like
./containers/asr.sif --model_dir=$MY_LOCAL_MODEL --wav=$MYTEST.wav
./containers/asr.sif --model_dir=$MY_OTHER_LOCAL_MODEL --wav=$MYTEST.wav
Once you create this model structure, you can use the provided Dockerfile to build the server container. Run:
make build-server kaldi_model=$MY_MODEL_DIR image_tag=$CONTAINER_TAG
# example: make build-server kaldi_model=/models/kaldi/english_model image_tag=kaldigrpc:en-latest
And you can run the container
# Run your container for maximum 3 simultaneous clients on port 1234
make run-server image_tag=kaldigrpc:en-latest max_workers=3 server_port=1234
Install client library:
pip install kaldigrpc-client
Run client from command line:
kaldigrpc-transcribe --streaming --host localhost --port 50051 mytest.wav
For more infomation refer to client/README.md
TODO: Write documentation
- Add support for mixed kaldi and pytorch acoustic / language models
- Add full support for pause detection (interim results)
- Add load balancer / benchmarks
- Streamined conversion scripts from exp folder to model tarball
- Support all Speech API configuration options