Does SpinQuant implemented R3 when using quantized kv cache? #9705
Labels
module: llm
Issues related to LLM examples and apps, and to the extensions/llm/ code
module: quantization
Issues related to quantization
When we execute SpinQuant using ExecuTorch, we observe that only R4 supports online rotation, while R3 does not. We would like to confirm whether ExecuTorch does not support R3 for SpinQuant.
convert to pte, already enable quantize_kv_cache
python -m examples.models.llama.export_llama
--model "llama3_2"
--checkpoint "/home/zhuan.zhang/llama_models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/consolidated.00.pth"
--params "/home/zhuan.zhang/llama_models/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8/params.json"
--use_sdpa_with_kv_cache
-X
--xnnpack-extended-ops
--preq_mode 8da4w_output_8da8w
--preq_group_size 32
--max_seq_length 2048
--max_seq_length 2048
--output_name "llama3_2.pte"
-kv
-d fp32
--preq_embedding_quantize 8,0
--quantize_kv_cache
--output_name 'llama3_2_spinquant_qkv.pte'
--use_spin_quant native
--generate_etrecord
Runtime delegate op show "llama_fast_hadamard_transform_default" calling 16 times(1 time / decoder layer), which is R4
| op_type | occurrences_in_delegated_graphs | occurrences_in_non_delegated_graphs |
19 | llama_fast_hadamard_transform | 0 | 16 |
Source code show using SpinQuant, which just replace FeedForward with FeedForwardNativeCustom using inject_fast_hadamard_transform_native_for_spin_quant
cc @kimishpatel @jerryzh168 @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng
The text was updated successfully, but these errors were encountered: