-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
vLLM model structure may be different from HuggingFace model structure so a remapping is used (WeightsMapper) when loading models. The remapping is defined in model executor logic, e.g. the Qwen2.5 VL defines it here
In quantized models, in addition to the model weights, there may be extra configs such as to describe what modules are not quantized. For example, the compressed-tensors has an "ignore" in the config as a list of modules that shall be treated as not quantized; the Nvidia ModelOpt has an "exclude_modules" also in the config as a list of modules that shall be treated as not quantized. The issue is that these configs may not be simple module path/prefix names, they may be regex (in compressed-tensors) or wildcards (in Nvidia ModelOpt).
Take compressed-tensors as an example given it's also a vLLM owned project. An item in the ignore list may be a regex pattern such as "re:vision_tower.*" to exclude the whole vision encoder in a VLM. One random example quantized model that I find on HF:
https://huggingface.co/gaunernst/gemma-3-27b-it-qat-compressed-tensors
In it's config, it has the ignore list of:
"ignore": [
"lm_head",
"re:vision_tower.*"
],
When vLLM loads the quantization configs, it also applies weights mapping through apply_vllm_mapper, but the WeightsMapper assumes that it just work with each single weight names. It will break if it hits a regex or a wildcard. For example, for the above ignore list, if a model has weight remap of:
prefix remap: vision_tower. -> vision.
Then the remap will miss the regular expression: "re:vision_tower.*".
I googled models that are quantized to compressed-tensors, it seems they just dodge this issue in one way or another, e.g. The above Gemma3 model does not have weight remapping in vLLM model executor; https://huggingface.co/cpatonn/Qwen3-VL-30B-A3B-Thinking-AWQ-4bit, the Qwen3-VL model has weight remapping but this quantized checkpoint just mark all the submodules of the vision encoder as ignore instead of using a regex.
There are ways to doge the issue, but, the semantic is broken. This needs to be fixed so it is semantically correct. And it is causing issues for models quantized by Nvidia ModelOpt.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.