-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with implementing AWQ for BERT models #4
Comments
Would love this for image captioning with quantized speedup. |
hej @casper-hansen I would be curious to implement this for https://github.com/michaelfeil/infinity. Do you see any road-blockers in regards to encoder only architectures? Will the GEMM kernels work for non-causal masked LMs? |
@michaelfeil AWQ/GEMM kernels can work for any linear layer. However, there is a challenge in applying it to BERT models because it lacks some scaling methods. For example, we would usually scale from a layernorm to a linear layer. See more about the scaling of layers here: I also created a PR for better scaling for Mixtral, which may be interesting to you: |
If we can speed up the BERT model, we will significantly increase the throughput of many cases. Experiment with SentenceTransformers first.
The text was updated successfully, but these errors were encountered: