Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with implementing AWQ for BERT models #4

Open
casper-hansen opened this issue Aug 25, 2023 · 3 comments
Open

Experiment with implementing AWQ for BERT models #4

casper-hansen opened this issue Aug 25, 2023 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@casper-hansen
Copy link
Owner

If we can speed up the BERT model, we will significantly increase the throughput of many cases. Experiment with SentenceTransformers first.

@casper-hansen casper-hansen added the help wanted Extra attention is needed label Sep 6, 2023
@casper-hansen casper-hansen mentioned this issue Sep 14, 2023
30 tasks
@z3ugma
Copy link

z3ugma commented Nov 25, 2023

Would love this for image captioning with quantized speedup.
The kosmos-2 model from Microsoft would be another good candidate

@michaelfeil
Copy link

hej @casper-hansen I would be curious to implement this for https://github.com/michaelfeil/infinity. Do you see any road-blockers in regards to encoder only architectures? Will the GEMM kernels work for non-causal masked LMs?

@casper-hansen
Copy link
Owner Author

@michaelfeil AWQ/GEMM kernels can work for any linear layer. However, there is a challenge in applying it to BERT models because it lacks some scaling methods. For example, we would usually scale from a layernorm to a linear layer.

See more about the scaling of layers here:
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/scale.py

I also created a PR for better scaling for Mixtral, which may be interesting to you:
#301

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants