IQ1_M: 1.75 bpw quantization #6302

ikawrakow · 2024-03-25T16:48:49Z

While waiting for the 1.58 bit era...

Compared to IQ1_S:

Same codebook with 2048 entries, so 11 bits per 8 weights - 11/8 bpw
Blocks of 16 instead of blocks of 32 used by IQ1_S. Scales are 3 bit, so 3/16 bpw
Separate shift for each group of 8 weights instead of 1 shift per 32 weights. This costs 1/8 bpw

Along with the fp16 super-block scale this ends up being exactly 1.75 bpw.

The table shows a PPL comparison between IQ1_S and IQ1_M (this PR). Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows the rms_norm_epsilon used to generate the PR results.

Model	PPL (`IQ1_S`)	PPL (`IQ1_M`)	rms_norm_epsilon
LLaMA-v1-7B	12.83	10.13	5e-5
LLaMA-v1-13B	8.338	7.236	4e-5
LLaMA-v1-30B	6.722	6.053	2.5e-5
LLaMA-v2-7B	11.86	9.335	1.875e-5
LLaMA-v2-13B	7.741	6.842	2e-5
LLaMA-v2-70B	5.211	4.829	3e-5
Mistral-7B	10.42	8.162	default
Mixtral8x7B	6.168	5.574	default

@Nexesenex Looking forward to your improved 2.0 / sub-2.0 bpw quantization mixes.

Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.

We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B

There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw

Works, but very slow (10.5 t/s)

About the same performance as iq1_s.

It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K.

10.5 t/s -> 11.65 t/s

11.65 t/s -> 14.9 t/s

14.9 -> 15.0 t/s

After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624

We have progressed to warnings being errors.

Nexesenex · 2024-03-25T19:14:36Z

@ikawrakow Thank you so much, man!

I was almost done with my IQ1_S strategy, Mixtral caused me trouble (it's heavy to requant it endlessly) but I found my mistake and now it works as intended, with sizeable improvements on perplexity and often on ARC benches.

I will PR tonight or tomorrow an 1Q1_XS LLAMA_FTYPE, which offers an almost comparable quality to your current IQ1_S LLAMA_FTYPE with a slight reduction in size, to act as a new "demo of the smallest quant", before being refactored with IQ1_M GGML_Type for an ulterior PR.

As for the IQ1_S LLAMA_TYPE I revamped, it's almost ready as well, and will follow shortly after in another PR, before being refactored with IQ1_M GGML_Type as well for an ulterior PR.

Then I'll (and/or you and/or anyone lol) work on a derived IQ1_M LLAMA_FTYPE to make the best sub 2bpw quant possible.

ggml-cuda/convert.cu

ggerganov · 2024-03-26T11:26:47Z

ggml.h

+ GGML_TYPE_I16 = 26,
+ GGML_TYPE_I32 = 27,
+ GGML_TYPE_I64 = 28,
+ GGML_TYPE_F64 = 29,


Need to also update the enum in gguf-py/gguf/constants.py:

llama.cpp/gguf-py/gguf/constants.py

Lines 681 to 708 in deb7240

class GGMLQuantizationType(IntEnum):

F32 = 0

F16 = 1

Q4_0 = 2

Q4_1 = 3

Q5_0 = 6

Q5_1 = 7

Q8_0 = 8

Q8_1 = 9

Q2_K = 10

Q3_K = 11

Q4_K = 12

Q5_K = 13

Q6_K = 14

Q8_K = 15

IQ2_XXS = 16

IQ2_XS = 17

IQ3_XXS = 18

IQ1_S = 19

IQ4_NL = 20

IQ3_S = 21

IQ2_S = 22

IQ4_XS = 23

I8 = 24

I16 = 25

I32 = 26

I64 = 27

F64 = 28

Also, move GGML_TYPE_IQ1_M at the end of the enum to keep backwards compatibility with any GGUF files that might have started using integer of 64-bit types

Nexesenex · 2024-03-28T19:21:07Z

@ikawrakow, the IQ1_M quant is like twice slower to quantize than IQ1_S (on a I7-6700K with AVX and AVX2 enabled). Is there something to do about that?

ikawrakow · 2024-03-29T07:49:27Z

@ikawrakow, the IQ1_M quant is like twice slower to quantize than IQ1_S (on a I7-6700K with AVX and AVX2 enabled). Is there something to do about that?

Sorry, I did not see a way to make it more efficient. It is doing 4X the work, so being 2X slower is not too bad. Both, IQ1_S and IQ1_M, use the exact solution of the mixed integer optimization problem that minimizes the difference between the fp16 weights and the tertiary quantization used by these quants. I have found that heuristics that work faster but are not guaranteed to find the best solution tend to produce significantly worst quantization. The solution method in IQ1_S is very effective, being O(BS^2), where BS is the block size (32 weights). But in IQ1_M we have a separate shift for each group of 8, so the only solution technique I see is O(BS^3) (but now BS = 16, so 4X the work).

ikawrakow · 2024-03-29T08:11:13Z

@Nexesenex

I do have another version of IQ1_M that uses 1.8125 bpw. Quantization is much faster (basically the same as IQ1_S), and PPL vs size tradeoff is better (see this graph that shows results for LLaMA-v2-70B)

The reason I'm reluctant to make a PR is that it uses an even larger codebook (4096 entries vs 2048 in IQ1_M on master). CUDA on my GPU (RTX-4080) handles the associated large lookup table quite OK - performance decreases only by ~2% from 198 t/s to 190 t/s for a 7B model. But on my Ryzen-5950X CPU, the AVX2 implementation drops to 15 t/s from 24 t/s. I have not even bothered implementing for Apple Silicon, but based on the experience with other quants, I'm expecting a complete disaster there.

Nexesenex · 2024-03-29T12:01:43Z

@ikawrakow I understand that speed on all platforms has its relative importance in the final choices, like size do, but it's a pity to leave such jewels on a shelf!

Could you eventually share the quant as a "CUDA optimized quant" for those interested to use it?

Ultimately, even if the approach "one quant for all archs" is pertinent for the sake of optimal compatibility, the differences in architectures should also be accounted for as well to not only rely on the "common denominator", but also on the "best for each case" in order to have SOTA quants for most of "broad particular cases", like CUDA is.

In my opinion, if LlamaCPP doesn't integrate this approach, some others will eventually.

* iq1_m: basics * iq1_m: basics-2 * iq1_m: CUDA dequantize works Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B. * iq1_m: separate shifts for each group of 8 in a block We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B * iq1_m: go to 3-bit scales There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw * iq1_m: scalar dot product * iq1_m: AVX2 dot product * iq1_m: very slightly faster AVX2 dot product * iq1_m: ARM_NEON dot product Works, but very slow (10.5 t/s) * iq1_m: Metal - dequantize works, dot product does not * iq1_m: Metal now works About the same performance as iq1_s. * iq1_m: minor * iq1_m: checking pure iq1_m quantization It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K. * iiq1_m: slightly faster ARM_NEON dot product 10.5 t/s -> 11.65 t/s * iq1_m: faster ARM_NEON dot product 11.65 t/s -> 14.9 t/s * iq1_m: another minor ARM_NEON dot product improvement 14.9 -> 15.0 t/s * iq1_m: small PPL improvement via super-block scale adjustment After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624 * iq1_m: adapt to CUDA refactoring * iq1_m: remove unused variable We have progressed to warnings being errors. * iq1_m: add to backend-ops tests * iq1_m: fix Windows ARM * iq1_m: use common definition of iq1m_scale_t * cuda: assert -> NO_DEVICE_CODE * iq1_M: PR comments --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Kawrakow added 18 commits March 25, 2024 19:15

iq1_m: basics

2a2d66d

iq1_m: basics-2

ac8b3dd

iq1_m: CUDA dequantize works

1df37b6

Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.

iq1_m: go to 3-bit scales

308c50d

There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw

iq1_m: scalar dot product

64b9dfd

iq1_m: AVX2 dot product

a139de5

iq1_m: very slightly faster AVX2 dot product

379fdb6

iq1_m: ARM_NEON dot product

8009b6d

Works, but very slow (10.5 t/s)

iq1_m: Metal - dequantize works, dot product does not

0e36afa

iq1_m: Metal now works

19fb974

About the same performance as iq1_s.

iq1_m: minor

abc1d4f

iq1_m: checking pure iq1_m quantization

dff85a8

It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K.

iiq1_m: slightly faster ARM_NEON dot product

f664692

10.5 t/s -> 11.65 t/s

iq1_m: faster ARM_NEON dot product

b1d1c26

11.65 t/s -> 14.9 t/s

iq1_m: another minor ARM_NEON dot product improvement

78ce561

14.9 -> 15.0 t/s

iq1_m: small PPL improvement via super-block scale adjustment

3d9c21f

After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624

iq1_m: adapt to CUDA refactoring

480d6d6

ikawrakow force-pushed the ik/iq1_m_new branch from 4eecee5 to 480d6d6 Compare March 25, 2024 17:41

iq1_m: remove unused variable

62dd11f

We have progressed to warnings being errors.

ikawrakow mentioned this pull request Mar 25, 2024

cuda : refactor into multiple files #6269

Merged

iq1_m: add to backend-ops tests

22fa121

slaren reviewed Mar 25, 2024

View reviewed changes

ggml-cuda/convert.cu Outdated Show resolved Hide resolved

ggml-cuda/convert.cu Outdated Show resolved Hide resolved

Kawrakow added 3 commits March 26, 2024 05:54

iq1_m: fix Windows ARM

b68f32b

iq1_m: use common definition of iq1m_scale_t

9a5786e

cuda: assert -> NO_DEVICE_CODE

cdb2d65

ggerganov approved these changes Mar 26, 2024

View reviewed changes

iq1_M: PR comments

6e4cef5

ikawrakow merged commit 55c1b2a into master Mar 26, 2024
51 of 57 checks passed

ikawrakow deleted the ik/iq1_m_new branch March 26, 2024 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQ1_M: 1.75 bpw quantization #6302

IQ1_M: 1.75 bpw quantization #6302

ikawrakow commented Mar 25, 2024

Nexesenex commented Mar 25, 2024 •

edited

ggerganov Mar 26, 2024

Nexesenex commented Mar 28, 2024

ikawrakow commented Mar 29, 2024

ikawrakow commented Mar 29, 2024

Nexesenex commented Mar 29, 2024 •

edited

	class GGMLQuantizationType(IntEnum):
	F32 = 0
	F16 = 1
	Q4_0 = 2
	Q4_1 = 3
	Q5_0 = 6
	Q5_1 = 7
	Q8_0 = 8
	Q8_1 = 9
	Q2_K = 10
	Q3_K = 11
	Q4_K = 12
	Q5_K = 13
	Q6_K = 14
	Q8_K = 15
	IQ2_XXS = 16
	IQ2_XS = 17
	IQ3_XXS = 18
	IQ1_S = 19
	IQ4_NL = 20
	IQ3_S = 21
	IQ2_S = 22
	IQ4_XS = 23
	I8 = 24
	I16 = 25
	I32 = 26
	I64 = 27
	F64 = 28

IQ1_M: 1.75 bpw quantization #6302

IQ1_M: 1.75 bpw quantization #6302

Conversation

ikawrakow commented Mar 25, 2024

Nexesenex commented Mar 25, 2024 • edited

ggerganov Mar 26, 2024

Choose a reason for hiding this comment

Nexesenex commented Mar 28, 2024

ikawrakow commented Mar 29, 2024

ikawrakow commented Mar 29, 2024

Nexesenex commented Mar 29, 2024 • edited

Nexesenex commented Mar 25, 2024 •

edited

Nexesenex commented Mar 29, 2024 •

edited