-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to use AirLLM with a quantized input model? #117
Comments
I just re-read the README again and learned about the compression option! However, it doesn't quite work, I get this error:
I tried changing that And reading the bitsandbytes docs, it says that bitsandbytes is a CUDA library, so I'm guessing this compression feature is only meant for CUDA computers. They are working on supporting Mac but not done yet. Unfortunate! Hopefully there's a way to quantize the input instead. |
Looking at the code more, it looks like AirLLM only supports pytorch and safetensors file formats. This might work if I can get something quantized into one of those. |
will add. |
Hi there! Thanks for this amazing library. I was able to run a 70B model on my M2 Macbook Pro!
I was able to get about one token every 100 seconds, which is almost good enough for my overnight tasks; I'm hoping i can get it down to 20 seconds per token though.
Is it possible to quantize the input model to make it faster?
I've tried quantizing with llama.cpp, but I think the output format is wrong for that. I see that pytorch has a way to quantize, but I can't figure out how to do it with AutoModel.
Any pointers in the right direction would help. Thanks!
The text was updated successfully, but these errors were encountered: