You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the recent support for deepspeedfp quantization introduced in #4652 and #4690, a new issue has emerged due to the nature of the runtime quantization implementation. This implementation allows users to load an unquantized model and enable the quantization argument to reduce the memory footprint required for loading the model. However, the main challenge lies in the fact that the deepspeedfp implementation has a parameter num_bits that supports quantizing the weights down to either 8 or 6 bits, with the default value set to 8.
Problem Statement:
Currently, if a user wants to apply quantization="deepspeedfp", vLLM will only be able to quantize the model to num_bits=8 since that is the default value. The only way to change this behavior is by providing a quant_config.json file that explicitly defines the desired value for num_bits. This limitation restricts users from easily customizing the quantization settings without modifying the configuration file.
Proposed Solution:
To address this issue, we propose adding a new argument quant_kwargs=Union[str, Dict] to the common LLM() and OpenAI server interfaces in vLLM. This argument would accept either a dictionary of keyword arguments or a string that can be converted to a dictionary. The purpose of quant_kwargs is to allow users to override the default values or loaded config values for the quantization configuration.
By introducing this new argument, users gain the flexibility to specify custom quantization settings directly through the API, without the need to modify the quant_config.json file. This enhancement improves the usability and convenience of applying quantization in vLLM, enabling users to easily experiment with different quantization settings based on their specific requirements.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
mgoin
changed the title
[Feature]: Support specifying quant_config details in the LLM or Server entrypoints
[RFC]: Support specifying quant_config details in the LLM or Server entrypoints
May 10, 2024
馃殌 The feature, motivation and pitch
Background:
With the recent support for
deepspeedfp
quantization introduced in #4652 and #4690, a new issue has emerged due to the nature of the runtime quantization implementation. This implementation allows users to load an unquantized model and enable the quantization argument to reduce the memory footprint required for loading the model. However, the main challenge lies in the fact that thedeepspeedfp
implementation has a parameternum_bits
that supports quantizing the weights down to either 8 or 6 bits, with the default value set to 8.Problem Statement:
Currently, if a user wants to apply
quantization="deepspeedfp"
, vLLM will only be able to quantize the model tonum_bits=8
since that is the default value. The only way to change this behavior is by providing aquant_config.json
file that explicitly defines the desired value fornum_bits
. This limitation restricts users from easily customizing the quantization settings without modifying the configuration file.Proposed Solution:
To address this issue, we propose adding a new argument
quant_kwargs=Union[str, Dict]
to the commonLLM()
and OpenAI server interfaces in vLLM. This argument would accept either a dictionary of keyword arguments or a string that can be converted to a dictionary. The purpose ofquant_kwargs
is to allow users to override the default values or loaded config values for the quantization configuration.By introducing this new argument, users gain the flexibility to specify custom quantization settings directly through the API, without the need to modify the
quant_config.json
file. This enhancement improves the usability and convenience of applying quantization in vLLM, enabling users to easily experiment with different quantization settings based on their specific requirements.Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: