You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thanks a lot for maintaining this amazing repository.
As far as I understand, AutoAWQ supports quantization of models that are distributed on multiple GPUs, i.e. loaded with device_map="auto". However, during the initialization of the quantization, I think the device casting might be improvable.
In particular, in aws/quantize/quantizer.py in lines 542 - 544 we cast both the embeddings and the first decoder layer to best_device:
However, the get_best_device()function, in the case of utilizing CUDA, forces a common casting on GPU:0. This can fail due to CUDA OOM in line 542, even though if multiple GPUs are available this would not have to be an issue:
This failure can occur if there is something else held on GPU:0 that blocks memory or possibly a model loaded with device_map=sequential with unfortunate distribution of the decoder and embedding layers (e.g. model components are loaded until GPU:0 is full and embeddings end up on GPU:1).
Do you think it would be possible and useful to adjust this device selection by e.g. selecting the least utilized GPU?
I would be grateful to hear your thoughts on this and also help out on this issue if I can. Thank you!
Here would be a simple example to reproduce the issue (autoawq==0.2.6):
Dear AutoAWQ team,
thanks a lot for maintaining this amazing repository.
As far as I understand, AutoAWQ supports quantization of models that are distributed on multiple GPUs, i.e. loaded with
device_map="auto"
. However, during the initialization of the quantization, I think the device casting might be improvable.In particular, in
aws/quantize/quantizer.py
in lines 542 - 544 we cast both the embeddings and the first decoder layer tobest_device
:However, the
get_best_device()
function, in the case of utilizing CUDA, forces a common casting on GPU:0. This can fail due to CUDA OOM in line 542, even though if multiple GPUs are available this would not have to be an issue:This failure can occur if there is something else held on GPU:0 that blocks memory or possibly a model loaded with
device_map=sequential
with unfortunate distribution of the decoder and embedding layers (e.g. model components are loaded until GPU:0 is full and embeddings end up on GPU:1).Do you think it would be possible and useful to adjust this device selection by e.g. selecting the least utilized GPU?
I would be grateful to hear your thoughts on this and also help out on this issue if I can. Thank you!
Here would be a simple example to reproduce the issue (
autoawq==0.2.6
):On a system with 4 NVIDIA A10Gs, where sufficient memory is left on the other GPUs, it fails with OOM:
The text was updated successfully, but these errors were encountered: