Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

DisOOM · 2024-04-03T07:25:44Z

Statement: This has nothing to do with the fine-grained MoE architecture in Qwen/Qwen1.5-MoE-A2.7B. It is more akin to a traditional MoE, except that its experts are derived from the qwen2 (qwen1.5) model.

I was previously using mergekit-moe to merge the qwen1.5 model into an MoE, but the resulting models were corrupted after being converted into the gguf format.
Subsequently, I discovered this custom mergekit script that successfully merges into qwen2MoE: https://github.com/Aratako/mergekit-qwen2. Following the example of #4912, I made some modifications to llama.cpp, enabling it to correctly convert, quantize, and run MoEs merged using this custom script.
It performs well on older versions, but I encountered errors with the latest version. It can correctly convert and quantize but fails to run. I believe the issue lies in incompatibility with the changes made to llamacpp in #6122, but I am unsure how to resolve this problem.

I am a newbie to coding and this is my first PR, please be lenient.

I encountered no issues when converting with convert-hf-to-gguf.py and quantizing with quantize.exe, but I encountered the following issues when I ran main.exe.

PS D:\llama.cpp\llama.cpp> ./build/bin/Release/main.exe -m D:/model/ggml-model-f16.gguf -n 128
Log start
main: build = 2585 (f87f7b89)
main: built with MSVC 19.39.33523.0 for x64
main: seed  = 1712122664
llama_model_loader: loaded meta data with 21 key-value pairs and 643 tensors from D:/model-merge/Merged/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Merged
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 40
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 13696
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 40
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                         qwen2.expert_count u32              = 2
llama_model_loader: - kv  11:                    qwen2.expert_used_count u32              = 2
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 151645
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - type  f32:  201 tensors
llama_model_loader: - type  f16:  442 tensors
llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 2
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 22.58 B
llm_load_print_meta: model size       = 42.07 GiB (16.00 BPW)
llm_load_print_meta: general.name     = Merged
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151645 '<|im_end|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_tensors: ggml ctx size =    0.25 MiB
llm_load_tensors:        CPU buffer size = 43074.71 MiB
................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   400.00 MiB
llama_new_context_with_model: KV self size  =  400.00 MiB, K (f16):  200.00 MiB, V (f16):  200.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   353.76 MiB
llama_new_context_with_model: graph nodes  = 2164
llama_new_context_with_model: graph splits = 1
GGML_ASSERT: D:\llama.cpp\llama.cpp:9701: lctx.inp_out_ids && "every model that can must skip unused outputs"

github-actions · 2024-04-03T07:38:43Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 503 iterations 🚀

Concurrent users: 8, duration: 10m
HTTP request : avg=9295.2ms p(90)=26525.97ms fails=0, finish reason: stop=503 truncated=0
Prompt processing (pp): avg=241.95tk/s p(90)=732.6tk/s total=200.08tk/s
Token generation (tg): avg=98.97tk/s p(90)=277.24tk/s total=130.21tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=115f49a08a1c9fd59c60ed1425827d9ae2614565

Time series

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 274.95, 274.95, 274.95, 274.95, 274.95, 623.49, 623.49, 623.49, 623.49, 623.49, 646.27, 646.27, 646.27, 646.27, 646.27, 672.33, 672.33, 672.33, 672.33, 672.33, 702.26, 702.26, 702.26, 702.26, 702.26, 711.49, 711.49, 711.49, 711.49, 711.49, 710.87, 710.87, 710.87, 710.87, 710.87, 691.5, 691.5, 691.5, 691.5, 691.5, 691.23, 691.23, 691.23, 691.23, 691.23, 691.08, 691.08, 691.08, 691.08, 691.08, 697.17, 697.17, 697.17, 697.17, 697.17, 714.08, 714.08, 714.08, 714.08, 714.08, 735.9, 735.9, 735.9, 735.9, 735.9, 744.08, 744.08, 744.08, 744.08, 744.08, 733.46, 733.46, 733.46, 733.46, 733.46, 695.97, 695.97, 695.97, 695.97, 695.97, 699.29, 699.29, 699.29, 699.29, 699.29, 699.47, 699.47, 699.47, 699.47, 699.47, 704.05, 704.05, 704.05, 704.05, 704.05, 711.21, 711.21, 711.21, 711.21, 711.21, 708.94, 708.94, 708.94, 708.94, 708.94, 706.46, 706.46, 706.46, 706.46, 706.46, 708.91, 708.91, 708.91, 708.91, 708.91, 708.27, 708.27, 708.27, 708.27, 708.27, 709.56, 709.56, 709.56, 709.56, 709.56, 724.76, 724.76, 724.76, 724.76, 724.76, 726.05, 726.05, 726.05, 726.05, 726.05, 727.15, 727.15, 727.15, 727.15, 727.15, 733.43, 733.43, 733.43, 733.43, 733.43, 730.58, 730.58, 730.58, 730.58, 730.58, 727.73, 727.73, 727.73, 727.73, 727.73, 727.66, 727.66, 727.66, 727.66, 727.66, 724.75, 724.75, 724.75, 724.75, 724.75, 724.06, 724.06, 724.06, 724.06, 724.06, 726.06, 726.06, 726.06, 726.06, 726.06, 733.35, 733.35, 733.35, 733.35, 733.35, 735.28, 735.28, 735.28, 735.28, 735.28, 737.03, 737.03, 737.03, 737.03, 737.03, 741.18, 741.18, 741.18, 741.18, 741.18, 738.92, 738.92, 738.92, 738.92, 738.92, 737.42, 737.42, 737.42, 737.42, 737.42, 737.96, 737.96, 737.96, 737.96, 737.96, 737.32, 737.32, 737.32, 737.32, 737.32, 746.58, 746.58, 746.58, 746.58, 746.58, 745.49, 745.49, 745.49, 745.49, 745.49, 740.22, 740.22, 740.22, 740.22, 740.22, 739.68, 739.68, 739.68, 739.68, 739.68, 736.68, 736.68, 736.68, 736.68, 736.68, 734.52, 734.52, 734.52, 734.52, 734.52, 736.54, 736.54, 736.54, 736.54, 736.54, 738.49, 738.49, 738.49, 738.49, 738.49, 737.84, 737.84, 737.84, 737.84, 737.84, 738.39, 738.39, 738.39, 738.39, 738.39, 739.33, 739.33, 739.33, 739.33, 739.33, 741.19, 741.19, 741.19, 741.19, 741.19, 741.92, 741.92, 741.92, 741.92, 741.92, 741.13, 741.13, 741.13, 741.13, 741.13, 740.7, 740.7, 740.7, 740.7, 740.7, 742.67, 742.67, 742.67, 742.67, 742.67, 742.16, 742.16, 742.16, 742.16, 742.16, 742.95, 742.95, 742.95, 742.95]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.69, 29.69, 29.69, 29.69, 29.69, 17.1, 17.1, 17.1, 17.1, 17.1, 18.55, 18.55, 18.55, 18.55, 18.55, 18.45, 18.45, 18.45, 18.45, 18.45, 19.19, 19.19, 19.19, 19.19, 19.19, 19.9, 19.9, 19.9, 19.9, 19.9, 20.56, 20.56, 20.56, 20.56, 20.56, 20.63, 20.63, 20.63, 20.63, 20.63, 20.66, 20.66, 20.66, 20.66, 20.66, 20.47, 20.47, 20.47, 20.47, 20.47, 20.49, 20.49, 20.49, 20.49, 20.49, 20.4, 20.4, 20.4, 20.4, 20.4, 20.12, 20.12, 20.12, 20.12, 20.12, 19.98, 19.98, 19.98, 19.98, 19.98, 19.3, 19.3, 19.3, 19.3, 19.3, 19.34, 19.34, 19.34, 19.34, 19.34, 19.11, 19.11, 19.11, 19.11, 19.11, 19.23, 19.23, 19.23, 19.23, 19.23, 19.16, 19.16, 19.16, 19.16, 19.16, 18.91, 18.91, 18.91, 18.91, 18.91, 18.8, 18.8, 18.8, 18.8, 18.8, 18.67, 18.67, 18.67, 18.67, 18.67, 18.57, 18.57, 18.57, 18.57, 18.57, 18.58, 18.58, 18.58, 18.58, 18.58, 18.52, 18.52, 18.52, 18.52, 18.52, 18.61, 18.61, 18.61, 18.61, 18.61, 18.7, 18.7, 18.7, 18.7, 18.7, 18.69, 18.69, 18.69, 18.69, 18.69, 18.61, 18.61, 18.61, 18.61, 18.61, 18.49, 18.49, 18.49, 18.49, 18.49, 18.44, 18.44, 18.44, 18.44, 18.44, 18.48, 18.48, 18.48, 18.48, 18.48, 18.52, 18.52, 18.52, 18.52, 18.52, 18.66, 18.66, 18.66, 18.66, 18.66, 18.7, 18.7, 18.7, 18.7, 18.7, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.66, 18.59, 18.59, 18.59, 18.59, 18.59, 18.51, 18.51, 18.51, 18.51, 18.51, 18.48, 18.48, 18.48, 18.48, 18.48, 18.47, 18.47, 18.47, 18.47, 18.47, 18.5, 18.5, 18.5, 18.5, 18.5, 18.49, 18.49, 18.49, 18.49, 18.49, 18.43, 18.43, 18.43, 18.43, 18.43, 18.27, 18.27, 18.27, 18.27, 18.27, 18.26, 18.26, 18.26, 18.26, 18.26, 18.2, 18.2, 18.2, 18.2, 18.2, 17.77, 17.77, 17.77, 17.77, 17.77, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.46, 17.49, 17.49, 17.49, 17.49, 17.49, 17.55, 17.55, 17.55, 17.55, 17.55, 17.58, 17.58, 17.58, 17.58, 17.58, 17.62, 17.62, 17.62, 17.62, 17.62, 17.65, 17.65, 17.65, 17.65, 17.65, 17.64, 17.64, 17.64, 17.64, 17.64, 17.63, 17.63, 17.63, 17.63, 17.63, 17.58, 17.58, 17.58, 17.58, 17.58, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06, 0.06, 0.06, 0.06, 0.06, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.27, 0.27, 0.27, 0.27, 0.27, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.31, 0.31, 0.31, 0.31, 0.31, 0.45, 0.45, 0.45, 0.45, 0.45, 0.47, 0.47, 0.47, 0.47, 0.47, 0.49, 0.49, 0.49, 0.49, 0.49, 0.52, 0.52, 0.52, 0.52, 0.52, 0.33, 0.33, 0.33, 0.33, 0.33, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.25, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 503 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712129291 --> 1712129917
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0]

maziyarpanahi · 2024-04-03T08:06:39Z

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.

I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387

I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.

ggerganov · 2024-04-03T08:28:14Z

Qwen MoE models should be able to work after merging #6387 and then #6074

DBRX models likely also depend on #6387 + we need conversion scripts and compute graph implementation

DisOOM · 2024-04-03T08:34:05Z

Thanks @DisOOM for creating this PR based on our discussion regarding why MoE models based on Qwen don't work properly.感谢根据我们讨论的内容创建这个基于 Qwen 的 MoE 模型无法正常工作的 PR。

I will tag @compilade and @slaren who was involve in the PR you mentioned. However, have you tried using this PR to see if the MoE models based on Qwen architecture works properly? #6387我会标记并提及在你提到的 PR 中参与的人。不过，你是否尝试过使用这个 PR 来查看基于 Qwen 架构的 MoE 模型是否正常工作？#6387

I am testing #6387 now for DBRX, but if it's to solve issues with MoE (not sure if there is a difference between Mergekit MoE and others like Qwen, Mixtral, DBRX). I would personally try it to see if my quantized Qwen MoE model would work.我正在为 DBRX 进行 #6387 的测试，但如果它是为了解决 MoE 的问题（不确定 Mergekit MoE 和其他模型如 Qwen、Mixtral、DBRX 之间是否有区别）。我个人会尝试一下，看看我的量化 Qwen MoE 模型是否能正常工作。

I haven't tried this PR yet. I will give it a try later.

maziyarpanahi · 2024-04-04T13:05:04Z

I have pulled and used the latest changes from the master branch. I have successfully converted this model into fp16 GGUF: https://huggingface.co/MaziyarPanahi/Qwen1.5-8x7b-v0.1

It works very fine and has a coherent output. However, any quantized model from this pf16 results in the following error:

..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   343.26 MiB
llama_new_context_with_model: graph nodes  = 1638
llama_new_context_with_model: graph splits = 1
GGML_ASSERT: ggml.c:11015: wdata == wdata_src1_end
Aborted (core dumped)

@ggerganov I am not sure what causes this error. This is a MoE made by MergeKit based on Qwen models. (one of those situation where the fp16 GGUF model works fine, but the quantized just either crashes or outputs nonsense)

DisOOM added 3 commits April 3, 2024 14:44

Update llama.cpp

b3cf383

Update tensor_mapping.py

3b22eb7

Update constants.py

115f49a

DisOOM changed the title ~~Adding Support for Custom Qwen2moe Architectures Using mergekit-qwen2~~ Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2 Apr 3, 2024

DisOOM changed the title ~~Adding Support for Custom Qwen2moe Architectures withmergekit-qwen2~~ Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 Apr 3, 2024

This was referenced Apr 3, 2024

Are Qwen1.5 MOE models supported? #6415

Open

Add support for Qwen MoE (Qwen2MoeForCausalLM) #6380

Open

mofosyne added review complexity : high Generally require indepth knowledge of LLMs or GPUs model Model specific labels May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

DisOOM commented Apr 3, 2024 •

edited

github-actions bot commented Apr 3, 2024

maziyarpanahi commented Apr 3, 2024

ggerganov commented Apr 3, 2024

DisOOM commented Apr 3, 2024 •

edited

maziyarpanahi commented Apr 4, 2024

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

Are you sure you want to change the base?

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

Conversation

DisOOM commented Apr 3, 2024 • edited

github-actions bot commented Apr 3, 2024

maziyarpanahi commented Apr 3, 2024

ggerganov commented Apr 3, 2024

DisOOM commented Apr 3, 2024 • edited

maziyarpanahi commented Apr 4, 2024

DisOOM commented Apr 3, 2024 •

edited

DisOOM commented Apr 3, 2024 •

edited