Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected ppl diff #116

Open
YihengBrianWu opened this issue May 23, 2024 · 3 comments
Open

Unexpected ppl diff #116

YihengBrianWu opened this issue May 23, 2024 · 3 comments

Comments

@YihengBrianWu
Copy link

I'm now trying to quantize llama2-7b under w4a16g128 setting.
The script is
python3 main.py \ --model_name /mnt/bn/wyh-train/4bit/models/llama2-7b/model \ --device 0 \ --group_size 128 \ --bits 4 \ --iters 1000 \ --deployment_device 'fake,cpu,gpu' \ --output_dir "/mnt/bn/wyh-train/4bit/models/llama2-7b-auto-round"

The result is
wikitext2 c4
llama2-7b-fp16 5.4721 6.9727
llama2-7b-w4a16g128(auto_round) 10.4401 7.4204

Any Insight here?

@wenhuach21
Copy link
Contributor

This issue is documented in our paper (https://arxiv.org/pdf/2309.05516v3) in Table 14, with a detailed explanation in Section 4.1. We hypothesize that the perplexity is highly sensitive to outliers. However, our limited tests did not show a significant impact in real deployment. To avoid this issue, setting the minmax lr to 2.0/iterations could be a solution based on my experiments for this model.

@wenhuach21
Copy link
Contributor

wenhuach21 commented May 23, 2024

Besides, if your gpu memory is enough, you could set--disable_gpu_memory_usage, typically 1.5x-2x speedup based on my experiments.

@YihengBrianWu
Copy link
Author

Besides, if your gpu memory is enough, you could set--disable_gpu_memory_usage, typically 1.5x-2x speedup based on my experiments.

Cool! Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants