[RFC]: Support specifying quant_config details in the LLM or Server entrypoints #4743

mgoin · 2024-05-10T15:23:48Z

🚀 The feature, motivation and pitch

Background:

With the recent support for deepspeedfp quantization introduced in #4652 and #4690, a new issue has emerged due to the nature of the runtime quantization implementation. This implementation allows users to load an unquantized model and enable the quantization argument to reduce the memory footprint required for loading the model. However, the main challenge lies in the fact that the deepspeedfp implementation has a parameter num_bits that supports quantizing the weights down to either 8 or 6 bits, with the default value set to 8.

Problem Statement:

Currently, if a user wants to apply quantization="deepspeedfp", vLLM will only be able to quantize the model to num_bits=8 since that is the default value. The only way to change this behavior is by providing a quant_config.json file that explicitly defines the desired value for num_bits. This limitation restricts users from easily customizing the quantization settings without modifying the configuration file.

Proposed Solution:

To address this issue, we propose adding a new argument quant_kwargs=Union[str, Dict] to the common LLM() and OpenAI server interfaces in vLLM. This argument would accept either a dictionary of keyword arguments or a string that can be converted to a dictionary. The purpose of quant_kwargs is to allow users to override the default values or loaded config values for the quantization configuration.

By introducing this new argument, users gain the flexibility to specify custom quantization settings directly through the API, without the need to modify the quant_config.json file. This enhancement improves the usability and convenience of applying quantization in vLLM, enabling users to easily experiment with different quantization settings based on their specific requirements.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

robertgshaw2-neuralmagic · 2024-05-10T15:28:14Z

Can you tag with RFC?

mgoin added the feature request label May 10, 2024

mgoin changed the title ~~[Feature]: Support specifying quant_config details in the LLM or Server entrypoints~~ [RFC]: Support specifying quant_config details in the LLM or Server entrypoints May 10, 2024

zhuohan123 added the RFC label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Support specifying quant_config details in the LLM or Server entrypoints #4743

[RFC]: Support specifying quant_config details in the LLM or Server entrypoints #4743

mgoin commented May 10, 2024

robertgshaw2-neuralmagic commented May 10, 2024

[RFC]: Support specifying quant_config details in the LLM or Server entrypoints #4743

[RFC]: Support specifying quant_config details in the LLM or Server entrypoints #4743

Comments

mgoin commented May 10, 2024

🚀 The feature, motivation and pitch

Background:

Problem Statement:

Proposed Solution:

Alternatives

Additional context

robertgshaw2-neuralmagic commented May 10, 2024