-
Notifications
You must be signed in to change notification settings - Fork 857
Description
Overview
The idea is to change the implementation of the TOSAQuantizer and the subclasses EthosUQuantizer/VgfQuantizer to be based on the same modular system as used for the Cortex-M backend. This has the following benefits:
- Improved configurability -
CortexMQuantizerallows custom annotation filters or even custom quantizers to allow perfect tailoring of quantization parameters. - Improved visibility into the annotation process - The
CortexMQuantizercomes with a
QuantizerReporterwhich helps debugging the annotation process, and a singlequantizer_supportfile which clearly defines supported operators. - Code sharing - By aligning the two quantizers users get a more predictable behaviour and both backends will continuously benefit from each others improvements.
API
The function call API will stay consistent with the previous implementation, together with new functions exposing the new configuration possibilities. Any other known behaviour changes will be listed here:
- Input/output nodes dtypes are now determined by
set_global, rather than the closest annotated node. - Nodes with
SharedQspecswill by default inherit dtype from its input rather than being set, seeSharedQspecQuantizerin the detailed breakdown.
Sketch of new API:
qconfig1 = get_symmetric_quantization_config() # Old way of creating quantization configs intact
qconfig2 = TOSAQuantizationConfig() # TOSAQuantizationConfigs can also be created directly
# Old API still intact
quantizer = TOSAQuantizer()
quantizer.set_global(qconfig1)
quantizer.set_module_name("sigmoid", qconfig2)
# New API function using a NodeFinder to filter out nodes, does the same thing but more flexible
node_finder = ModuleNameNodeFinder("sigmoid") # Many more available, or create your own implementing the NodeFinder interface
quantizer.set_node_finder(node_finder, qconfig2)
# Third way of doing the same thing, even more flexible
pattern_matcher = PatternMatcher(TOSA_QUANTIZER_SUPPORT_DICT) # TOSA_QUANTIZER_DICT is defined in the arm backend.
pattern_quantizer = PatternQuantizer(qconfig2, node_finder, pattern_matcher)
quantizer.add_quantizer(pattern_quantizer)
Detailed breakdown
The CortexMQuantizer is in turn made up of multiple smaller quantizers run sequentially which is what enables the biggest level of flexibility, custom quantizers. Realistically however most commonly used will be two types of predefined quantizers: PatternQuantizers and the SharedQspecQuantizer.
PatternQuantizer
The PatternQuantizer is used for annotating a selected set of nodes in the operator graph with a given QuantizationConfig. The nodes are selected via a NodeFinder, which can be either one of the ready-made finders already available, or custom made. The QuantizationConfig defines which QuantizationSpecs to be used for inputs, outputs, weights and biases respectively, i.e. dtypes, symmetric/asymmetric, observer types and so on. Backends may have special requirements on the qspecs for certain operators, for example equal qparams on input and output, which is why the configs are backend-specific. The goal is to expose choices the userr is interested in (int8 or int16 activations?) while hiding implementation details (transpose conv must set ch_axis=1).
The selected nodes are partitioned into patterns which are defined as supported for each backend by a support_dict. A pattern is a group of nodes which maps against one QuantizationConfig, most commonly single nodes or something like a convolution together with an activation function. The support_dict lists all such patterns which the backend handles; and maps it to a PatternChecker which checks if that particular configuration of the pattern and QuantizationConfig is supported. For example, a convolution might generally be supported so the pattern exists in the support_dict; however, it is only supported for channels_last input or int8 quantization.
SharedQspecQuantizer
The SharedQspecQuantizer is always applied after all other quantizers and aims to handle nodes which the user typically doesn't care about and which should just work. This refers to, for example, comparison ops, max/min-ops, and data movement ops such as copies, transposes and concats. These are simply annotated with a SharedQspec, with some extra logic to handle edge cases.
Quantizer ordering
The quantizers are generally applied "bottom-up", so the quantizer added last is applied first, and previos annotations are never overwritten. The exception is the quantizer configured by set_global which is always applied second-last as the default quantizer, and the SharedQspecQuantizer which is applied last as previosuly noted.
A final important detail is that the quantization is run twice to allow support for mixed int/float quantization. This relates to the graph transforms applied in transform_for_annotation which handles decompositions required before quantization. The first run of the quantizer marks which nodes should be decomposed by these transforms, while the second performs the actual annotation.
Quantizer reporter
The reporter prints a short per-quantizer summary at info-level loggging and a per-operator level report at debug-level logging. Here is an example of what the report looks like when one sigmoid has been selected to not be quantized, while all other nodes are int8.
----------------------------------------------------------------------------------------------------
FINAL QUANTIZATION REPORT
----------------------------------------------------------------------------------------------------
PatternQuantizer using ModuleTypeNodeFinder targeting module types: Sigmoid
Annotating with NO_QSPEC
Supported operators and patterns defined by TOSA_QUANTIZER_SUPPORT_DICT
Accepted nodes: 1
Rejected due to previous annotation: 0
Rejected nodes: 0
NODE NAME INPUT QSPEC MAP OUTPUT QSPEC MAP
----------- ----------------- ------------------
sigmoid add: NO_QSPEC NO_QSPEC
----------------------------------------------------------------------------------------------------
PatternQuantizer using GlobalNodeFinder targeting all nodes
Annotating with INT8_TOSA_QCONFIG
Supported operators and patterns defined by TOSA_QUANTIZER_SUPPORT_DICT
Accepted nodes: 5
Rejected due to previous annotation: 1
Rejected nodes: 0
NODE NAME INPUT QSPEC MAP OUTPUT QSPEC MAP
----------- ------------------------------ ---------------------
x INT8_PER_TENSOR_QSPEC
y INT8_PER_TENSOR_QSPEC
add x: INT8_PER_TENSOR_QSPEC INT8_PER_TENSOR_QSPEC
y: INT8_PER_TENSOR_QSPEC
mul sigmoid: INT8_PER_TENSOR_QSPEC INT8_PER_TENSOR_QSPEC
x: INT8_PER_TENSOR_QSPEC
output mul: INT8_PER_TENSOR_QSPEC NO_QSPEC
----------------------------------------------------------------------------------------------------
SharedQspecQuantizer using
Annotating with SHARED_QCONFIG
Supported operators and patterns defined by executorch.backends.cortex_m.quantizer.quantizer.SharedQspecQuantizer.SHARED_QSPEC_OPS_DEFAULT
No patterns accepted or rejected.
----------------------------------------------------------------------------------------------------
Non annotated nodes:
None
----------------------------------------------------------------------------------------------------
Implementation plan
The Cortex-M quantizer is in processes of being updated to have the same level of support as the old TOSAQuantizer. When this is ready, the implementation ofthe new TOSAQuantizer will only be a matter of creating a TOSA support_dict, TOSA QuantizerConfigs, and some interface glue.
The new TOSAQuantizer will first be available as an experimental feature for some time, before it is set to the default and the old TOSAQuantizer starts being deprecated. Feedback during this period is much appreciated and can be posted in this thread.
cc @digantdesai @SS-JIA @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status