Experiment with implementing AWQ for BERT models #4

casper-hansen · 2023-08-25T22:47:31Z

If we can speed up the BERT model, we will significantly increase the throughput of many cases. Experiment with SentenceTransformers first.

z3ugma · 2023-11-25T07:37:58Z

Would love this for image captioning with quantized speedup.
The kosmos-2 model from Microsoft would be another good candidate

michaelfeil · 2024-01-27T20:46:26Z

hej @casper-hansen I would be curious to implement this for https://github.com/michaelfeil/infinity. Do you see any road-blockers in regards to encoder only architectures? Will the GEMM kernels work for non-causal masked LMs?

casper-hansen · 2024-01-27T22:47:15Z

@michaelfeil AWQ/GEMM kernels can work for any linear layer. However, there is a challenge in applying it to BERT models because it lacks some scaling methods. For example, we would usually scale from a layernorm to a linear layer.

See more about the scaling of layers here:
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/scale.py

I also created a PR for better scaling for Mixtral, which may be interesting to you:
#301

casper-hansen added the help wanted Extra attention is needed label Sep 6, 2023

casper-hansen mentioned this issue Sep 14, 2023

📌 AutoAWQ Roadmap #32

Closed

30 tasks

michaelfeil mentioned this issue Feb 5, 2024

Adding bert - WIP #328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with implementing AWQ for BERT models #4

Experiment with implementing AWQ for BERT models #4

casper-hansen commented Aug 25, 2023

z3ugma commented Nov 25, 2023

michaelfeil commented Jan 27, 2024

casper-hansen commented Jan 27, 2024

Experiment with implementing AWQ for BERT models #4

Experiment with implementing AWQ for BERT models #4

Comments

casper-hansen commented Aug 25, 2023

z3ugma commented Nov 25, 2023

michaelfeil commented Jan 27, 2024

casper-hansen commented Jan 27, 2024