Align Arm quantizers and CortexMQuantizer

## Overview 
The idea is to change the implementation of the `TOSAQuantizer` and the subclasses `EthosUQuantizer`/`VgfQuantizer` to be based on the same modular system as used for the Cortex-M backend. This has the following benefits:

1. **Improved configurability** - `CortexMQuantizer` allows custom annotation filters or even custom quantizers to allow perfect tailoring of quantization parameters.
2. **Improved visibility into the annotation process** - The `CortexMQuantizer` comes with a
`QuantizerReporter` which helps debugging the annotation process, and a single `quantizer_support` file which clearly defines supported operators.
3. **Code sharing** - By aligning the two quantizers users get a more predictable behaviour and both backends  will continuously benefit from each others improvements. 


## API
The function call API will stay consistent with the previous implementation, together with new functions exposing the new configuration possibilities. Any other known behaviour changes will be listed here:
* Input/output nodes dtypes are now determined by `set_global`, rather than the closest annotated node.
* Nodes with `SharedQspecs` will by default inherit dtype from its input rather than being set, see `SharedQspecQuantizer` in the detailed breakdown.

Sketch of new API:
```
qconfig1 = get_symmetric_quantization_config()  # Old way of creating quantization configs intact
qconfig2 = TOSAQuantizationConfig()             # TOSAQuantizationConfigs can also be created directly 

# Old API still intact
quantizer = TOSAQuantizer()
quantizer.set_global(qconfig1)
quantizer.set_module_name("sigmoid", qconfig2)

# New API function using a NodeFinder to filter out nodes, does the same thing but more flexible
node_finder = ModuleNameNodeFinder("sigmoid")   # Many more available, or create your own implementing the NodeFinder interface
quantizer.set_node_finder(node_finder, qconfig2)

# Third way of doing the same thing, even more flexible
pattern_matcher = PatternMatcher(TOSA_QUANTIZER_SUPPORT_DICT)  # TOSA_QUANTIZER_DICT is defined in the arm backend.
pattern_quantizer = PatternQuantizer(qconfig2, node_finder, pattern_matcher)
quantizer.add_quantizer(pattern_quantizer)
```

## Detailed breakdown
The `CortexMQuantizer` is in turn made up of multiple smaller quantizers run sequentially which is what enables the biggest level of flexibility, custom quantizers. Realistically however most commonly used will be two types of predefined quantizers: `PatternQuantizers` and the `SharedQspecQuantizer`.

### PatternQuantizer
The PatternQuantizer is used for annotating a selected set of nodes in the operator graph with a given `QuantizationConfig`. The  nodes are selected via a `NodeFinder`, which can be either one of the ready-made finders already available, or custom made. The `QuantizationConfig` defines which `QuantizationSpecs` to be used for inputs, outputs, weights and biases respectively, i.e. dtypes, symmetric/asymmetric, observer types and so on. Backends may have special requirements on the qspecs for certain operators, for example equal qparams on input and output, which is why the configs are backend-specific. The goal is to expose choices the userr is interested in (int8 or int16 activations?) while hiding implementation details (transpose conv must set ch_axis=1).

The selected nodes are partitioned into `patterns` which are defined as supported for each backend by a `support_dict`. A `pattern` is a group of nodes which maps against one `QuantizationConfig`, most commonly single nodes or something like a convolution together with an activation function. The `support_dict` lists all such patterns which the backend handles; and maps it to a `PatternChecker` which checks if that particular configuration of the pattern and `QuantizationConfig` is supported. For example, a convolution might generally be supported so the pattern exists in the support_dict; however, it is only supported for `channels_last` input or `int8` quantization.

### SharedQspecQuantizer
The `SharedQspecQuantizer` is always applied after all other quantizers and aims to handle nodes which the user typically doesn't care about and which should just work. This refers to, for example, comparison ops, max/min-ops, and data movement ops such as copies, transposes and concats. These are simply annotated with a SharedQspec, with some extra logic to handle edge cases.

### Quantizer ordering
The quantizers are generally applied "bottom-up", so the quantizer added last is applied first, and previos annotations are never overwritten. The exception is the quantizer configured by `set_global` which is always applied second-last as the default quantizer, and the `SharedQspecQuantizer` which is applied last as previosuly noted.

A final important detail is that the quantization is run twice to allow support for mixed int/float quantization. This relates to the graph transforms applied in `transform_for_annotation` which handles decompositions required before quantization. The first run of the quantizer marks which nodes should be decomposed by these transforms, while the second performs the actual annotation.

### Quantizer reporter
The reporter prints a short per-quantizer summary at info-level loggging and a per-operator level report at debug-level logging. Here is an example of what the report looks like when one sigmoid has been selected to not be quantized, while all other nodes are int8.

```
----------------------------------------------------------------------------------------------------
                                     FINAL QUANTIZATION REPORT                                      
----------------------------------------------------------------------------------------------------
PatternQuantizer using ModuleTypeNodeFinder targeting module types: Sigmoid
Annotating with NO_QSPEC
Supported operators and patterns defined by TOSA_QUANTIZER_SUPPORT_DICT
   Accepted nodes: 1
   Rejected due to previous annotation: 0
   Rejected nodes: 0

   NODE NAME    INPUT QSPEC MAP    OUTPUT QSPEC MAP
   -----------  -----------------  ------------------
   sigmoid      add: NO_QSPEC      NO_QSPEC
----------------------------------------------------------------------------------------------------
PatternQuantizer using GlobalNodeFinder targeting all nodes
Annotating with INT8_TOSA_QCONFIG
Supported operators and patterns defined by  TOSA_QUANTIZER_SUPPORT_DICT
   Accepted nodes: 5
   Rejected due to previous annotation: 1
   Rejected nodes: 0

   NODE NAME    INPUT QSPEC MAP                 OUTPUT QSPEC MAP
   -----------  ------------------------------  ---------------------
   x                                            INT8_PER_TENSOR_QSPEC
   y                                            INT8_PER_TENSOR_QSPEC
   add          x: INT8_PER_TENSOR_QSPEC        INT8_PER_TENSOR_QSPEC
                y: INT8_PER_TENSOR_QSPEC
   mul          sigmoid: INT8_PER_TENSOR_QSPEC  INT8_PER_TENSOR_QSPEC
                x: INT8_PER_TENSOR_QSPEC
   output       mul: INT8_PER_TENSOR_QSPEC      NO_QSPEC
----------------------------------------------------------------------------------------------------
SharedQspecQuantizer using 
Annotating with SHARED_QCONFIG
Supported operators and patterns defined by executorch.backends.cortex_m.quantizer.quantizer.SharedQspecQuantizer.SHARED_QSPEC_OPS_DEFAULT
   No patterns accepted or rejected.

----------------------------------------------------------------------------------------------------
Non annotated nodes:
    None
----------------------------------------------------------------------------------------------------
```

## Implementation plan
The Cortex-M quantizer is in processes of being updated to have the same level of support as the old TOSAQuantizer. When this is ready, the implementation ofthe new TOSAQuantizer will only be a matter of creating a TOSA `support_dict`, TOSA QuantizerConfigs, and some interface glue.

The new TOSAQuantizer will first be available as an experimental feature for some time, before it is set to the default and the old TOSAQuantizer starts being deprecated. **Feedback during this period is much appreciated and can be posted in this thread**.



cc @digantdesai @SS-JIA @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align Arm quantizers and CortexMQuantizer #17701

Overview

API

Detailed breakdown

PatternQuantizer

SharedQspecQuantizer

Quantizer ordering

Quantizer reporter

Implementation plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Align Arm quantizers and CortexMQuantizer #17701

Description

Overview

API

Detailed breakdown

PatternQuantizer

SharedQspecQuantizer

Quantizer ordering

Quantizer reporter

Implementation plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions