-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected ppl diff #116
Comments
This issue is documented in our paper (https://arxiv.org/pdf/2309.05516v3) in Table 14, with a detailed explanation in Section 4.1. We hypothesize that the perplexity is highly sensitive to outliers. However, our limited tests did not show a significant impact in real deployment. To avoid this issue, setting the minmax lr to 2.0/iterations could be a solution based on my experiments for this model. |
Besides, if your gpu memory is enough, you could set--disable_gpu_memory_usage, typically 1.5x-2x speedup based on my experiments. |
Cool! Thanks for your help! |
I'm now trying to quantize llama2-7b under w4a16g128 setting.
The script is
python3 main.py \ --model_name /mnt/bn/wyh-train/4bit/models/llama2-7b/model \ --device 0 \ --group_size 128 \ --bits 4 \ --iters 1000 \ --deployment_device 'fake,cpu,gpu' \ --output_dir "/mnt/bn/wyh-train/4bit/models/llama2-7b-auto-round"
The result is
wikitext2 c4
llama2-7b-fp16 5.4721 6.9727
llama2-7b-w4a16g128(auto_round) 10.4401 7.4204
Any Insight here?
The text was updated successfully, but these errors were encountered: