A quantizer for advanced developers to quantize converted LiteRT models. It aims to facilitate advanced users to strive for optimal performance on resource demanding models (e.g., GenAI models).
Build Type | Status |
---|---|
Unit Tests (Linux) | |
Nightly Release | |
Nightly Colab |
Nightly PyPi package:
pip install ai-edge-quantizer-nightly
The quantizer requires two inputs:
- An unquantized source LiteRT model (with FP32 data type in the FlatBuffers format with
.tflite
extension) - A quantization recipe (details below)
and outputs a quantized LiteRT model that's ready for deployment on edge devices.
In a nutshell, the quantizer works according to the following steps:
- Instantiate a
Quantizer
class. This is the entry point to the quantizer's functionalities that the user accesses. - Load a desired quantization recipe (details in subsection).
- Quantize (and save) the model. This is where most of the quantizer's internal logic works.
qt = quantizer.Quantizer("path/to/input/tflite")
qt.load_quantization_recipe(recipe.dynamic_wi8_afp32())
qt.quantize().export_model("/path/to/output/tflite")
Please see the getting started colab for the simplest quick start guide on those 3 steps, and the selective quantization colab with more details on advanced features.
Please refer to the LiteRT documentation for ways to generate LiteRT models from Jax, PyTorch and TensorFlow. The input source model should be an FP32 (unquantized) model in the FlatBuffers format with .tflite
extension.
The user needs to specify a quantization recipe using AI Edge Quantizer's API to apply to the source model. The quantization recipe encodes all information on how a model is to be quantized, such as number of bits, data type, symmetry, scope name, etc.
Essentially, a quantization recipe is defined as a collection of the following command:
“Apply Quantization Algorithm X on Operator Y under Scope Z with ConfigN”.
For example:
"Uniformly quantize the FullyConnected op under scope 'dense1/' with INT8 symmetric with Dynamic Quantization".
All the unspecified ops will be kept as FP32 (unquantized). The scope of an operator in TFLite is defined as the output tensor name of the op, which preserves the hierarchical model information from the source model (e.g., scope in TF). The best way to obtain scope name is by visualizing the model with Model Explorer.
The simplest recipe to get started with is using existing recipes from recipe.py. This is demonstrated in the getting started colab example.
Please refer to the LiteRT deployment documentation for ways to deploy a quantized LiteRT model.
There are many ways the user can configure and customize the quantization recipe beyond using a template in recipe.py. For example, the user can configure the recipe to achieve these features:
- Selective quantization (exclude selected ops from being quantized)
- Flexible mixed scheme quantization (mixture of different precision, compute precision, scope, op, config, etc)
- 4-bit weight quantization
The selective quantization colab shows some of these more advanced features.
For specifics of the recipe schema, please refer to the OpQuantizationRecipe
in [recipe_manager.py].
For advanced usage involving mixed quantization, the following API may be useful:
- Use
Quantizer:load_quantization_recipe()
in quantizer.py to load a custom recipe. - Use
Quantizer:update_quantization_recipe()
in quantizer.py to extend or override specific parts of the recipe.
The table below outlines the allowed configurations for available recipes.
Config | DYNAMIC_WI8_AFP32 | DYNAMIC_WI4_AFP32 | STATIC_WI8_AI16 | STATIC_WI4_AI16 | STATIC_WI8_AI8 | STATIC_WI4_AI8 | WEIGHTONLY_WI8_AFP32 | WEIGHTONLY_WI4_AFP32 | |
activation | num_bits | None | None | 16 | 16 | 8 | 8 | None | None |
symmetric | None | None | TRUE | TRUE | [TRUE, FALSE] | [TRUE, FALSE] | None | None | |
granularity | None | None | TENSORWISE | TENSORWISE | TENSORWISE | TENSORWISE | None | None | |
dtype | None | None | INT | INT | INT | INT | None | None | |
weight | num_bits | 8 | 4 | 8 | 4 | 8 | 4 | 8 | 4 |
symmetric | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | [TRUE, FALSE] | [TRUE, FALSE] | |
granularity | [CHANNELWISE, TENSORWISE] | [CHANNELWISE, TENSORWISE] | [CHANNELWISE, TENSORWISE] | [CHANNELWISE, TENSORWISE] | [CHANNELWISE, TENSORWISE] | [CHANNELWISE, TENSORWISE] | [CHANNELWISE, TENSORWISE] | [CHANNELWISE, TENSORWISE] | |
dtype | INT | INT | INT | INT | INT | INT | INT | INT | |
explicit_dequantize | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE | |
compute_precision | INTEGER | INTEGER | INTEGER | INTEGER | INTEGER | INTEGER | FLOAT | FLOAT |
Operators Supporting Quantization
Config | DYNAMIC_WI8_AFP32 | DYNAMIC_WI4_AFP32 | STATIC_WI8_AI16 | STATIC_WI4_AI16 | STATIC_WI8_AI8 | STATIC_WI4_AI8 | WEIGHTONLY_WI8_AFP32 | WEIGHTONLY_WI4_AFP32 |
FULLY_CONNECTED | ✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
CONV_2D | ✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
||
BATCH_MATMUL | ✓ |
✓ |
✓ |
✓ |
||||
EMBEDDING_LOOKUP | ✓ |
✓ |
✓ |
✓ |
||||
DEPTHWISE_CONV_2D | ✓ |
✓ |
✓ |
✓ |
||||
AVERAGE_POOL_2D | ✓ |
✓ |
||||||
RESHAPE | ✓ |
✓ |
||||||
SOFTMAX | ✓ |
✓ |
||||||
TANH | ✓ |
✓ |
||||||
TRANSPOSE | ✓ |
✓ |
||||||
GELU | ✓ |
✓ |
||||||
ADD | ✓ |
✓ |
||||||
CONV_2D_TRANSPOSE | ✓ |
✓ |
✓ |
|||||
SUB | ✓ |
✓ |
||||||
MUL | ✓ |
✓ |
||||||
MEAN | ✓ |
✓ |
||||||
RSQRT | ✓ |
✓ |
||||||
CONCATENATION | ✓ |
✓ |
||||||
STRIDED_SLICE | ✓ |
✓ |
||||||
SPLIT | ✓ |
✓ |
||||||
LOGISTIC | ✓ |
✓ |