When running basic generation with NVFP4 models, I sometimes see torch recompilation issues.
Occurred while running examples/quantization_w4a4_fp4/qwen_30b_a3b.py
========== SAMPLE GENERATION ==============
[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
W0612 01:46:22.922000 871313 torch/_dynamo/convert_frame.py:1853] [0/8] torch._dynamo hit config.recompile_limit (8)
W0612 01:46:22.922000 871313 torch/_dynamo/convert_frame.py:1853] [0/8] function: 'cast_to_fp4' (/home/kylesayrs/compressed-tensors/src/compressed_tensors/quantization/quant_args.py:55)
W0612 01:46:22.922000 871313 torch/_dynamo/convert_frame.py:1853] [0/8] last reason: 0/7: tensor 'x' rank mismatch. expected 4, actual 3
W0612 01:46:22.922000 871313 torch/_dynamo/convert_frame.py:1853] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0612 01:46:22.922000 871313 torch/_dynamo/convert_frame.py:1853] [0/8] To diagnose recompilation issues, see https://docs.pytorch.org/docs/main/user_guide/torch_compiler/compile/programming_model.recompilation.html
When running basic generation with NVFP4 models, I sometimes see torch recompilation issues.
Occurred while running
examples/quantization_w4a4_fp4/qwen_30b_a3b.py