Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model loading failed with --gpulayer 80 on Metal #744

Closed
beebopkim opened this issue Mar 12, 2024 · 21 comments
Closed

Model loading failed with --gpulayer 80 on Metal #744

beebopkim opened this issue Mar 12, 2024 · 21 comments
Labels
bug Something isn't working

Comments

@beebopkim
Copy link

Commit hash: edb05e7
Branch: concedo_experimental

With --gpulayers 80:

% python koboldcpp.py --noblas --gpulayers 80 --model $LLM_MODEL_Q/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf
***
Welcome to KoboldCpp - Version 1.61
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model='/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf', model_param='/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=4, usecublas=None, usevulkan=None, useclblast=None, noblas=True, gpulayers=80, tensor_split=None, contextsize=2048, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=4, lora=None, smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, onready='', benchmark=None, multiuser=0, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, sdconfig=None, mmproj='', password=None)
==========
Loading model: /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf 
[Threads: 4, BlasThreads: 4, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
llama_model_loader: loaded meta data with 24 key-value pairs and 723 tensors from /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_1, some F16
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 23.71 GiB (2.95 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.64 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloaded 80/81 layers to GPU
llm_load_tensors:        CPU buffer size = 24282.14 MiB
llm_load_tensors:      Metal buffer size = 24200.09 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: failed to initialize Metal backend
gpttype_load_model: error: failed to load model '/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf'
Load Text Model OK: False
Could not load text model: /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf
% 

Without --gpulayers 80:

% python koboldcpp.py --noblas --model $LLM_MODEL_Q/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf 
***
Welcome to KoboldCpp - Version 1.61
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model='/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf', model_param='/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=4, usecublas=None, usevulkan=None, useclblast=None, noblas=True, gpulayers=0, tensor_split=None, contextsize=2048, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=4, lora=None, smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, onready='', benchmark=None, multiuser=0, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, sdconfig=None, mmproj='', password=None)
==========
Loading model: /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf 
[Threads: 4, BlasThreads: 4, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
llama_model_loader: loaded meta data with 24 key-value pairs and 723 tensors from /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_1, some F16
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 23.71 GiB (2.95 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors:        CPU buffer size = 24282.14 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   665.00 MiB
llama_new_context_with_model: KV self size  =  665.00 MiB, K (f16):  332.50 MiB, V (f16):  332.50 MiB
llama_new_context_with_model:        CPU input buffer size   =    21.18 MiB
llama_new_context_with_model:        CPU compute buffer size =   330.00 MiB
llama_new_context_with_model: graph splits (measure): 1
Load Text Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

For comparison - server with -ngl 999 from llama.cpp commit hash 306d34b:

% ./server -m /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf -ngl 999 -c 16384
{"build":2409,"commit":"306d34be","function":"main","level":"INFO","line":2732,"msg":"build info","tid":"0x1e27d5c40","timestamp":1710286276}
{"function":"main","level":"INFO","line":2739,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"0x1e27d5c40","timestamp":1710286276,"total_threads":10}
llama_model_loader: loaded meta data with 24 key-value pairs and 723 tensors from /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 32764
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 10
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q2_K:  321 tensors
llama_model_loader: - type q3_K:  160 tensors
llama_model_loader: - type q5_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 23.71 GiB (2.95 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.55 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 24200.12 MiB, (24200.19 / 49152.00)
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:      Metal buffer size = 24200.12 MiB
llm_load_tensors:        CPU buffer size =    82.03 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/******/test/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  5120.00 MiB, (29322.00 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =  5120.00 MiB
llama_new_context_with_model: KV self size  = 5120.00 MiB, K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    49.13 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  2144.02 MiB, (31466.02 / 49152.00)
llama_new_context_with_model:      Metal compute buffer size =  2144.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.00 MiB
llama_new_context_with_model: graph splits (measure): 2
{"function":"init","level":"INFO","line":700,"msg":"initializing slots","n_slots":1,"tid":"0x1e27d5c40","timestamp":1710286281}
{"function":"init","id_slot":0,"level":"INFO","line":712,"msg":"new slot","n_ctx_slot":16384,"tid":"0x1e27d5c40","timestamp":1710286281}
{"function":"main","level":"INFO","line":2828,"msg":"model loaded","tid":"0x1e27d5c40","timestamp":1710286281}
{"built_in":true,"chat_example":"[INST] You are a helpful assistant\nHello [/INST]Hi there</s>[INST] How are you? [/INST]","function":"main","level":"INFO","line":2853,"msg":"chat template","tid":"0x1e27d5c40","timestamp":1710286281}
{"function":"main","hostname":"127.0.0.1","level":"INFO","line":3494,"msg":"HTTP server listening","n_threads_http":"9","port":"8080","tid":"0x1e27d5c40","timestamp":1710286281}
{"function":"update_slots","level":"INFO","line":1647,"msg":"all slots are idle","tid":"0x1e27d5c40","timestamp":1710286281}

@LostRuins
Copy link
Owner

LostRuins commented Mar 13, 2024

Is this a new issue? Did the same model load correctly previously?

Looking at your debug logs, I don't see metal being initialized.

Did you build with LLAMA_METAL=1 when compiling KoboldCpp?

@beebopkim
Copy link
Author

Yes, this is a new problem.

When I saw your reply first time, I thought I might do a mistake. But now, I confirmed that I confirmed thaat this is really an issue.

I pulled the concedo_experimental branch up to date.

Click Me
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % git pull
remote: Enumerating objects: 47, done.
remote: Counting objects: 100% (47/47), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 47 (delta 30), reused 42 (delta 28), pack-reused 0
Unpacking objects: 100% (47/47), 429.47 KiB | 1.70 MiB/s, done.
From https://github.com/LostRuins/koboldcpp
   edb05e76..7a2de82c  concedo_experimental -> origin/concedo_experimental
   5174f9de..7a2de82c  concedo              -> origin/concedo
   48358b2e..306d34be  master               -> origin/master
 * [new tag]           v1.61                -> v1.61
 * [new tag]           v1.61.1              -> v1.61.1
Updating edb05e76..7a2de82c
Fast-forward
 CMakeLists.txt                           |    4 +-
 Makefile                                 |   22 +-
 colab.ipynb                              |    2 +-
 common/common.cpp                        |   18 +-
 examples/batched-bench/batched-bench.cpp |    2 +-
 examples/batched/batched.cpp             |    2 +-
 examples/main/main.cpp                   |    1 +
 examples/perplexity/perplexity.cpp       |    6 +-
 ggml-common.h                            |  410 +++++++++++++++++++++++++++++-
 ggml-cuda.cu                             |  280 ++-------------------
 ggml-metal.m                             |    2 +-
 ggml-metal.metal                         |  198 ++-------------
 ggml-quants.c                            |  173 +++++++------
 ggml-quants.h                            |  244 +-----------------
 ggml-sycl.cpp                            |  254 +++----------------
 gpttype_adapter.cpp                      |   20 +-
 klite.embd                               |  117 +++++++--
 koboldcpp.py                             |    2 +-
 llama.cpp                                |  113 ++++-----
 llama.h                                  |   14 +-
 model_adapter.cpp                        |    4 +
 model_adapter.h                          |    1 +
 unicode.cpp                              | 1672 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 unicode.h                                |  790 ++--------------------------------------------------------
 24 files changed, 2488 insertions(+), 1863 deletions(-)
 create mode 100644 unicode.cpp

And I built it with LLAMA_METAL=1.

Click Me
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % git rev-parse HEAD
7a2de82c96906ae7d331ce229948ebcf55601f7c
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % make clean
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread
I LDFLAGS:   -ld_classic -framework Accelerate
I CC:       Apple clang version 15.0.0 (clang-1500.1.0.2.5)
I CXX:      Apple clang version 15.0.0 (clang-1500.1.0.2.5)

rm -vf *.o main sdmain quantize_gguf quantize_clip quantize_gpt2 quantize_gptj quantize_neox quantize_mpt quantize-stats perplexity embedding benchmark-matmult save-load-state gguf imatrix imatrix.exe gguf.exe main.exe quantize_clip.exe quantize_gguf.exe quantize_gptj.exe quantize_gpt2.exe quantize_neox.exe quantize_mpt.exe koboldcpp_default.dll koboldcpp_openblas.dll koboldcpp_failsafe.dll koboldcpp_noavx2.dll koboldcpp_clblast.dll koboldcpp_clblast_noavx2.dll koboldcpp_cublas.dll koboldcpp_hipblas.dll koboldcpp_vulkan.dll koboldcpp_vulkan_noavx2.dll koboldcpp_default.so koboldcpp_openblas.so koboldcpp_failsafe.so koboldcpp_noavx2.so koboldcpp_clblast.so koboldcpp_clblast_noavx2.so koboldcpp_cublas.so koboldcpp_hipblas.so koboldcpp_vulkan.so koboldcpp_vulkan_noavx2.so
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % make clean                                                                                              
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread
I LDFLAGS:   -ld_classic -framework Accelerate
I CC:       Apple clang version 15.0.0 (clang-1500.1.0.2.5)
I CXX:      Apple clang version 15.0.0 (clang-1500.1.0.2.5)

rm -vf *.o main sdmain quantize_gguf quantize_clip quantize_gpt2 quantize_gptj quantize_neox quantize_mpt quantize-stats perplexity embedding benchmark-matmult save-load-state gguf imatrix imatrix.exe gguf.exe main.exe quantize_clip.exe quantize_gguf.exe quantize_gptj.exe quantize_gpt2.exe quantize_neox.exe quantize_mpt.exe koboldcpp_default.dll koboldcpp_openblas.dll koboldcpp_failsafe.dll koboldcpp_noavx2.dll koboldcpp_clblast.dll koboldcpp_clblast_noavx2.dll koboldcpp_cublas.dll koboldcpp_hipblas.dll koboldcpp_vulkan.dll koboldcpp_vulkan_noavx2.dll koboldcpp_default.so koboldcpp_openblas.so koboldcpp_failsafe.so koboldcpp_noavx2.so koboldcpp_clblast.so koboldcpp_clblast_noavx2.so koboldcpp_cublas.so koboldcpp_hipblas.so koboldcpp_vulkan.so koboldcpp_vulkan_noavx2.so
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % LLAMA_METAL=1 make -j
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL
I CXXFLAGS: -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL
I LDFLAGS:   -ld_classic -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 15.0.0 (clang-1500.1.0.2.5)
I CXX:      Apple clang version 15.0.0 (clang-1500.1.0.2.5)

cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -c ggml.c -o ggml.o
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -c otherarch/ggml_v3.c -o ggml_v3.o
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -c otherarch/ggml_v2.c -o ggml_v2.o
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -c otherarch/ggml_v1.c -o ggml_v1.o
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c expose.cpp -o expose.o
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c common/common.cpp -o common.o
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c gpttype_adapter.cpp -o gpttype_adapter.o
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -c ggml-quants.c -o ggml-quants.o
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL -c ggml-alloc.c -o ggml-alloc.o
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL -c ggml-backend.c -o ggml-backend.o
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c examples/llava/llava.cpp -o llava.o
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c examples/llava/clip.cpp -o llavaclip.o
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c unicode.cpp -o unicode.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c common/grammar-parser.cpp -o grammar-parser.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL -c otherarch/sdcpp/sdtype_adapter.cpp -o sdcpp_default.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
cc -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL -c ggml-metal.m -o ggml-metal.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c ggml.c -o ggml_v4_openblas.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c otherarch/ggml_v3.c -o ggml_v3_openblas.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
cc  -I.            -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -Ofast -DNDEBUG -std=c11   -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -DSD_USE_METAL  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c otherarch/ggml_v2.c -o ggml_v2_openblas.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
In file included from expose.cpp:20:
./expose.h:32:8: warning: struct 'load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct load_model_inputs
       ^
./expose.h:34:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:35:15: note: const member 'blasthreads' will never be initialized
    const int blasthreads;
              ^
./expose.h:36:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:37:16: note: const member 'low_vram' will never be initialized
    const bool low_vram;
               ^
./expose.h:38:16: note: const member 'use_mmq' will never be initialized
    const bool use_mmq;
               ^
./expose.h:39:16: note: const member 'use_rowsplit' will never be initialized
    const bool use_rowsplit;
               ^
./expose.h:45:16: note: const member 'use_mmap' will never be initialized
    const bool use_mmap;
               ^
./expose.h:46:16: note: const member 'use_mlock' will never be initialized
    const bool use_mlock;
               ^
./expose.h:47:16: note: const member 'use_smartcontext' will never be initialized
    const bool use_smartcontext;
               ^
./expose.h:48:16: note: const member 'use_contextshift' will never be initialized
    const bool use_contextshift;
               ^
./expose.h:61:8: warning: struct 'generation_inputs' does not declare any constructor to initialize its non-modifiable members
struct generation_inputs
       ^
./expose.h:63:15: note: const member 'seed' will never be initialized
    const int seed;
              ^
./expose.h:67:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:68:15: note: const member 'max_length' will never be initialized
    const int max_length;
              ^
./expose.h:69:17: note: const member 'temperature' will never be initialized
    const float temperature;
                ^
./expose.h:70:15: note: const member 'top_k' will never be initialized
    const int top_k;
              ^
./expose.h:72:17: note: const member 'top_p' will never be initialized
    const float top_p;
                ^
./expose.h:74:17: note: const member 'typical_p' will never be initialized
    const float typical_p;
                ^
./expose.h:75:17: note: const member 'tfs' will never be initialized
    const float tfs;
                ^
./expose.h:76:17: note: const member 'rep_pen' will never be initialized
    const float rep_pen;
                ^
./expose.h:77:15: note: const member 'rep_pen_range' will never be initialized
    const int rep_pen_range;
              ^
./expose.h:80:17: note: const member 'mirostat_eta' will never be initialized
    const float mirostat_eta;
                ^
./expose.h:81:17: note: const member 'mirostat_tau' will never be initialized
    const float mirostat_tau;
                ^
./expose.h:83:15: note: const member 'sampler_len' will never be initialized
    const int sampler_len;
              ^
./expose.h:84:16: note: const member 'unban_tokens_rt' will never be initialized
    const bool unban_tokens_rt;
               ^
./expose.h:86:16: note: const member 'stream_sse' will never be initialized
    const bool stream_sse;
               ^
./expose.h:88:16: note: const member 'grammar_retain_state' will never be initialized
    const bool grammar_retain_state;
               ^
./expose.h:106:8: warning: struct 'sd_load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct sd_load_model_inputs
       ^
./expose.h:112:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:116:8: warning: struct 'sd_generation_inputs' does not declare any constructor to initialize its non-modifiable members
struct sd_generation_inputs
       ^
./expose.h:120:17: note: const member 'cfg_scale' will never be initialized
    const float cfg_scale;
                ^
./expose.h:121:15: note: const member 'sample_steps' will never be initialized
    const int sample_steps;
              ^
./expose.h:122:15: note: const member 'width' will never be initialized
    const int width;
              ^
./expose.h:123:15: note: const member 'height' will never be initialized
    const int height;
              ^
./expose.h:124:15: note: const member 'seed' will never be initialized
    const int seed;
              ^
In file included from gpttype_adapter.cpp:12:
In file included from ./model_adapter.h:14:
./expose.h:32:8: warning: struct 'load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct load_model_inputs
       ^
./expose.h:34:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:35:15: note: const member 'blasthreads' will never be initialized
    const int blasthreads;
              ^
./expose.h:36:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:37:16: note: const member 'low_vram' will never be initialized
    const bool low_vram;
               ^
./expose.h:38:16: note: const member 'use_mmq' will never be initialized
    const bool use_mmq;
               ^
./expose.h:39:16: note: const member 'use_rowsplit' will never be initialized
    const bool use_rowsplit;
               ^
./expose.h:45:16: note: const member 'use_mmap' will never be initialized
    const bool use_mmap;
               ^
./expose.h:46:16: note: const member 'use_mlock' will never be initialized
    const bool use_mlock;
               ^
./expose.h:47:16: note: const member 'use_smartcontext' will never be initialized
    const bool use_smartcontext;
               ^
./expose.h:48:16: note: const member 'use_contextshift' will never be initialized
    const bool use_contextshift;
               ^
./expose.h:61:8: warning: struct 'generation_inputs' does not declare any constructor to initialize its non-modifiable members
struct generation_inputs
       ^
./expose.h:63:15: note: const member 'seed' will never be initialized
    const int seed;
              ^
./expose.h:67:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:68:15: note: const member 'max_length' will never be initialized
    const int max_length;
              ^
./expose.h:69:17: note: const member 'temperature' will never be initialized
    const float temperature;
                ^
./expose.h:70:15: note: const member 'top_k' will never be initialized
    const int top_k;
              ^
./expose.h:72:17: note: const member 'top_p' will never be initialized
    const float top_p;
                ^
./expose.h:74:17: note: const member 'typical_p' will never be initialized
    const float typical_p;
                ^
./expose.h:75:17: note: const member 'tfs' will never be initialized
    const float tfs;
                ^
./expose.h:76:17: note: const member 'rep_pen' will never be initialized
    const float rep_pen;
                ^
./expose.h:77:15: note: const member 'rep_pen_range' will never be initialized
    const int rep_pen_range;
              ^
./expose.h:80:17: note: const member 'mirostat_eta' will never be initialized
    const float mirostat_eta;
                ^
./expose.h:81:17: note: const member 'mirostat_tau' will never be initialized
    const float mirostat_tau;
                ^
./expose.h:83:15: note: const member 'sampler_len' will never be initialized
    const int sampler_len;
              ^
./expose.h:84:16: note: const member 'unban_tokens_rt' will never be initialized
    const bool unban_tokens_rt;
               ^
./expose.h:86:16: note: const member 'stream_sse' will never be initialized
    const bool stream_sse;
               ^
./expose.h:88:16: note: const member 'grammar_retain_state' will never be initialized
    const bool grammar_retain_state;
               ^
./expose.h:106:8: warning: struct 'sd_load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct sd_load_model_inputs
       ^
./expose.h:112:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:116:8: warning: struct 'sd_generation_inputs' does not declare any constructor to initialize its non-modifiable members
struct sd_generation_inputs
       ^
./expose.h:120:17: note: const member 'cfg_scale' will never be initialized
    const float cfg_scale;
                ^
./expose.h:121:15: note: const member 'sample_steps' will never be initialized
    const int sample_steps;
              ^
./expose.h:122:15: note: const member 'width' will never be initialized
    const int width;
              ^
./expose.h:123:15: note: const member 'height' will never be initialized
    const int height;
              ^
./expose.h:124:15: note: const member 'seed' will never be initialized
    const int seed;
              ^
expose.cpp:210:24: warning: 'generate' has C-linkage specified, but returns user-defined type 'generation_outputs' which is incompatible with C [-Wreturn-type-c-linkage]
    generation_outputs generate(const generation_inputs inputs)
                       ^
expose.cpp:219:27: warning: 'sd_generate' has C-linkage specified, but returns user-defined type 'sd_generation_outputs' which is incompatible with C [-Wreturn-type-c-linkage]
    sd_generation_outputs sd_generate(const sd_generation_inputs inputs)
                          ^
expose.cpp:267:25: warning: 'token_count' has C-linkage specified, but returns user-defined type 'token_count_outputs' which is incompatible with C [-Wreturn-type-c-linkage]
    token_count_outputs token_count(const char * input)
                        ^
In file included from otherarch/sdcpp/sdtype_adapter.cpp:13:
In file included from ./model_adapter.h:14:
./expose.h:32:8: warning: struct 'load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct load_model_inputs
       ^
./expose.h:34:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:35:15: note: const member 'blasthreads' will never be initialized
    const int blasthreads;
              ^
./expose.h:36:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:37:16: note: const member 'low_vram' will never be initialized
    const bool low_vram;
               ^
./expose.h:38:16: note: const member 'use_mmq' will never be initialized
    const bool use_mmq;
               ^
./expose.h:39:16: note: const member 'use_rowsplit' will never be initialized
    const bool use_rowsplit;
               ^
./expose.h:45:16: note: const member 'use_mmap' will never be initialized
    const bool use_mmap;
               ^
./expose.h:46:16: note: const member 'use_mlock' will never be initialized
    const bool use_mlock;
               ^
./expose.h:47:16: note: const member 'use_smartcontext' will never be initialized
    const bool use_smartcontext;
               ^
./expose.h:48:16: note: const member 'use_contextshift' will never be initialized
    const bool use_contextshift;
               ^
./expose.h:61:8: warning: struct 'generation_inputs' does not declare any constructor to initialize its non-modifiable members
struct generation_inputs
       ^
./expose.h:63:15: note: const member 'seed' will never be initialized
    const int seed;
              ^
./expose.h:67:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:68:15: note: const member 'max_length' will never be initialized
    const int max_length;
              ^
./expose.h:69:17: note: const member 'temperature' will never be initialized
    const float temperature;
                ^
./expose.h:70:15: note: const member 'top_k' will never be initialized
    const int top_k;
              ^
./expose.h:72:17: note: const member 'top_p' will never be initialized
    const float top_p;
                ^
./expose.h:74:17: note: const member 'typical_p' will never be initialized
    const float typical_p;
                ^
./expose.h:75:17: note: const member 'tfs' will never be initialized
    const float tfs;
                ^
./expose.h:76:17: note: const member 'rep_pen' will never be initialized
    const float rep_pen;
                ^
./expose.h:77:15: note: const member 'rep_pen_range' will never be initialized
    const int rep_pen_range;
              ^
./expose.h:80:17: note: const member 'mirostat_eta' will never be initialized
    const float mirostat_eta;
                ^
./expose.h:81:17: note: const member 'mirostat_tau' will never be initialized
    const float mirostat_tau;
                ^
./expose.h:83:15: note: const member 'sampler_len' will never be initialized
    const int sampler_len;
              ^
./expose.h:84:16: note: const member 'unban_tokens_rt' will never be initialized
    const bool unban_tokens_rt;
               ^
./expose.h:86:16: note: const member 'stream_sse' will never be initialized
    const bool stream_sse;
               ^
./expose.h:88:16: note: const member 'grammar_retain_state' will never be initialized
    const bool grammar_retain_state;
               ^
./expose.h:106:8: warning: struct 'sd_load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct sd_load_model_inputs
       ^
./expose.h:112:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:116:8: warning: struct 'sd_generation_inputs' does not declare any constructor to initialize its non-modifiable members
struct sd_generation_inputs
       ^
./expose.h:120:17: note: const member 'cfg_scale' will never be initialized
    const float cfg_scale;
                ^
./expose.h:121:15: note: const member 'sample_steps' will never be initialized
    const int sample_steps;
              ^
./expose.h:122:15: note: const member 'width' will never be initialized
    const int width;
              ^
./expose.h:123:15: note: const member 'height' will never be initialized
    const int height;
              ^
./expose.h:124:15: note: const member 'seed' will never be initialized
    const int seed;
              ^
In file included from otherarch/sdcpp/sdtype_adapter.cpp:15:
In file included from ./otherarch/sdcpp/stable-diffusion.cpp:1:
./otherarch/sdcpp/ggml_extend.hpp:84:43: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
    printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
                  ~~~                     ^~~~~~~~~~~~~
                  %lld
./otherarch/sdcpp/ggml_extend.hpp:84:58: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
    printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
                       ~~~                               ^~~~~~~~~~~~~
                       %lld
./otherarch/sdcpp/ggml_extend.hpp:84:73: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
    printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
                            ~~~                                         ^~~~~~~~~~~~~
                            %lld
./otherarch/sdcpp/ggml_extend.hpp:84:88: warning: format specifies type 'size_t' (aka 'unsigned long') but the argument has type 'int64_t' (aka 'long long') [-Wformat]
    printf("shape(%zu, %zu, %zu, %zu)\n", tensor->ne[0], tensor->ne[1], tensor->ne[2], tensor->ne[3]);
                                 ~~~                                                   ^~~~~~~~~~~~~
                                 %lld
In file included from gpttype_adapter.cpp:18:
In file included from ./otherarch/llama_v2.cpp:9:
./otherarch/llama_v2.h:171:33: warning: 'legacy_llama_v2_tokenize' has C-linkage specified, but returns user-defined type 'std::vector<llama_v2_token>' (aka 'vector<int>') which is incompatible with C [-Wreturn-type-c-linkage]
    std::vector<llama_v2_token> legacy_llama_v2_tokenize(struct llama_v2_context * ctx, const std::string & text, bool add_bos);
                                ^
7 warnings generated.
In file included from otherarch/sdcpp/sdtype_adapter.cpp:15:
In file included from ./otherarch/sdcpp/stable-diffusion.cpp:14:
./otherarch/sdcpp/tae.hpp:194:17: warning: field 'decode_only' is uninitialized when used here [-Wuninitialized]
          taesd(decode_only),
                ^
In file included from gpttype_adapter.cpp:23:
./otherarch/gptj_v2.cpp:298:52: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
                            __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
                                                   ^~~~~~~~~~~~~
./otherarch/gptj_v2.cpp:298:67: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
                            __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
                                                                  ^~~~~~~~~~~~~
In file included from gpttype_adapter.cpp:24:
./otherarch/gptj_v3.cpp:308:52: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
                            __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
                                                   ^~~~~~~~~~~~~
./otherarch/gptj_v3.cpp:308:67: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
                            __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
                                                                  ^~~~~~~~~~~~~
In file included from gpttype_adapter.cpp:26:
./otherarch/gpt2_v2.cpp:291:48: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
                        __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
                                               ^~~~~~~~~~~~~
./otherarch/gpt2_v2.cpp:291:63: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
                        __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
                                                              ^~~~~~~~~~~~~
In file included from gpttype_adapter.cpp:28:
./otherarch/rwkv_v2.cpp:370:103: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
    RWKV_V2_ASSERT_NULL(emb->ne[0] == model->n_embed, "Unexpected dimension of embedding matrix %ld", emb->ne[0]);
                                                                                                ~~~   ^~~~~~~~~~
                                                                                                %lld
./otherarch/rwkv_v2.cpp:39:29: note: expanded from macro 'RWKV_V2_ASSERT_NULL'
            fprintf(stderr, __VA_ARGS__); \
                            ^~~~~~~~~~~
./otherarch/rwkv_v2.cpp:371:103: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
    RWKV_V2_ASSERT_NULL(emb->ne[1] == model->n_vocab, "Unexpected dimension of embedding matrix %ld", emb->ne[1]);
                                                                                                ~~~   ^~~~~~~~~~
                                                                                                %lld
./otherarch/rwkv_v2.cpp:39:29: note: expanded from macro 'RWKV_V2_ASSERT_NULL'
            fprintf(stderr, __VA_ARGS__); \
                            ^~~~~~~~~~~
gpttype_adapter.cpp:2312:128: warning: format specifies type 'int' but the argument has type 'size_type' (aka 'unsigned long') [-Wformat]
    printf("\nCtxLimit: %d/%d, Process:%.2fs (%.1fms/T = %.2fT/s), Generate:%.2fs (%.1fms/T = %.2fT/s), Total:%.2fs (%.2fT/s)",current_context_tokens.size(),nctx, time1, pt1, ts1, time2, pt2, ts2, (time1 + time2), tokens_per_second);
                        ~~                                                                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                        %zu
otherarch/sdcpp/sdtype_adapter.cpp:234:73: warning: result of comparison of constant 244 with expression of type 'char' is always true [-Wtautological-constant-out-of-range-compare]
        if (static_cast<unsigned char>(ch) <= 0x7F || (ch >= 0xC2 && ch <= 0xF4)) {
                                                                     ~~ ^  ~~~~
otherarch/sdcpp/sdtype_adapter.cpp:234:59: warning: result of comparison of constant 194 with expression of type 'char' is always false [-Wtautological-constant-out-of-range-compare]
        if (static_cast<unsigned char>(ch) <= 0x7F || (ch >= 0xC2 && ch <= 0xF4)) {
                                                       ~~ ^  ~~~~
otherarch/sdcpp/sdtype_adapter.cpp:325:13: warning: format specifies type 'int' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
            sd_params->seed,
            ^~~~~~~~~~~~~~~
otherarch/sdcpp/sdtype_adapter.cpp:327:13: warning: format specifies type 'int' but the argument has type 'sd_image_t *' [-Wformat]
            control_image,
            ^~~~~~~~~~~~~
13 warnings generated.
14 warnings generated.
c++ -I. -I./common -I./include -I./include/CL -I./otherarch -I./otherarch/tools -I./otherarch/sdcpp -I./otherarch/sdcpp/thirdparty -I./include/vulkan -O3 -DNDEBUG -std=c++11 -fPIC -DLOG_DISABLE_LOGS -D_GNU_SOURCE -pthread -s -Wno-multichar -Wno-write-strings -Wno-deprecated -Wno-deprecated-declarations -pthread -DGGML_USE_METAL -DSD_USE_METAL  ggml.o ggml_v3.o ggml_v2.o ggml_v1.o expose.o common.o gpttype_adapter.o ggml-quants.o ggml-alloc.o ggml-backend.o llava.o llavaclip.o unicode.o grammar-parser.o sdcpp_default.o ggml-metal.o -shared -o koboldcpp_default.so  -ld_classic -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
Your OS  does not appear to be Windows. For faster speeds, install and link a BLAS library. Set LLAMA_OPENBLAS=1 to compile with OpenBLAS support or LLAMA_CLBLAST=1 to compile with ClBlast support. This is just a reminder, not an error.
ld: warning: -s is obsolete
ld: warning: option -s is obsolete and being ignored

It compiled well without errors, then I tried to run it.

Click Me
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % python koboldcpp.py --noblas --gpulayers 999 --contextsize 8192 --model $LLM_MODEL_Q/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf
***
Welcome to KoboldCpp - Version 1.61.1
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model='/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf', model_param='/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=4, usecublas=None, usevulkan=None, useclblast=None, noblas=True, gpulayers=999, tensor_split=None, contextsize=8192, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=4, lora=None, smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, onready='', benchmark=None, multiuser=0, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, sdconfig=None, mmproj='', password=None)
==========
Loading model: /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf 
[Threads: 4, BlasThreads: 4, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
llama_model_loader: loaded meta data with 24 key-value pairs and 723 tensors from /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32764
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32764
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 23.71 GiB (2.95 BPW) 
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.64 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =    82.03 MiB
llm_load_tensors:      Metal buffer size = 24200.12 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: failed to initialize Metal backend
gpttype_load_model: error: failed to load model '/Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf'
Load Text Model OK: False
Could not load text model: /Volumes/cuttingedge/large_language_models/models_ggml_converted/maywell_kiqu-70b-GGUF/kiqu-70b.Q2_K.gguf

Failed. So I tried to run a small GGUF model on same disk volume.

Click Me
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % python koboldcpp.py --noblas --gpulayers 999 --contextsize 8192 --model ../llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf
***
Welcome to KoboldCpp - Version 1.61.1
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model='../llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf', model_param='../llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=4, usecublas=None, usevulkan=None, useclblast=None, noblas=True, gpulayers=999, tensor_split=None, contextsize=8192, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=4, lora=None, smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, onready='', benchmark=None, multiuser=0, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, sdconfig=None, mmproj='', password=None)
==========
Loading model: /Users/******/test/llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf 
[Threads: 4, BlasThreads: 4, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /Users/******/test/llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attm      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 6.67 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   132.81 MiB
llm_load_tensors:      Metal buffer size =  6695.83 MiB
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:32000.0).
llama_new_context_with_model: n_ctx      = 8272
llama_new_context_with_model: freq_base  = 32000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: failed to initialize Metal backend
gpttype_load_model: error: failed to load model '/Users/******/test/llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf'
Load Text Model OK: False
Could not load text model: /Users/******/test/llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf

It is failed again.

@beebopkim
Copy link
Author

beebopkim commented Mar 13, 2024

I also confirmed that tag/v1.60.1 works and tags/v1.61 is failed. There must be something between v1.60.1 and v1.61. Also found that last successful commit is 9229ea6, just before 6a32c14, Merge branch 'master' into concedo_experimental by you.

@LostRuins LostRuins added the bug Something isn't working label Mar 13, 2024
@LostRuins
Copy link
Owner

For some reason, your program is not calling the ggml_backend_metal_init function, otherwise you would see a ggml_metal_init: allocating displayed. I looked through the commits between these 2 tags and I could not find any reason why this would happen.

Let's try troubleshoot this sequentially.
First, start from the latest Commit: [9f102b9], I added the extra headers that were not present in the file before.
Do a full make clean followed by make LLAMA_METAL=1 and see if metal gets initialized correctly.

If it's still not working, there are 3 commits that change the metal related files. They are be858f6 bb6d00b and 8a3012a . Of these 3, I think the most likely one to cause issues is 8a3012a .

Unfortunately, you won't be able to directly revert these commits due to merge conflicts. But perhaps you could examine the changes and see if you can figure out what causes the problems. If you're still stuck, let me know and I'll create a few separate checkpoints you can try - I can't debug this on my side as I don't have a mac. It's weird as it seems the Init isn't even being called.

@LostRuins
Copy link
Owner

LostRuins commented Mar 13, 2024

If you can stick some print statements in this function https://github.com/LostRuins/koboldcpp/blob/concedo/ggml-metal.m#L2835 and within ggml_metal_init itself, it would be helpful to know if it's called, and which part the init fails at.

ggml_backend_t ggml_backend_metal_init(void) {
    struct ggml_metal_context * ctx = ggml_metal_init(GGML_DEFAULT_N_THREADS);

    if (ctx == NULL) {
        return NULL;
    }

    ggml_backend_t metal_backend = malloc(sizeof(struct ggml_backend));

    *metal_backend = (struct ggml_backend) {
        /* .guid      = */ ggml_backend_metal_guid(),
        /* .interface = */ ggml_backend_metal_i,
        /* .context   = */ ctx,
    };

    return metal_backend;
}

Would also be helpful to compare the terminal output of the successfull v1.60.1 build.

@beebopkim
Copy link
Author

I found that the first line of ggml_backend_metal_init was failed.

    struct ggml_metal_context * ctx = ggml_metal_init(GGML_DEFAULT_N_THREADS);

So I investigated ggml_metal_init.

static struct ggml_metal_context * ggml_metal_init(int n_cb) {
    GGML_METAL_LOG_INFO("%s: allocating\n", __func__);

#if TARGET_OS_OSX && !GGML_METAL_NDEBUG
    // Show all the Metal device instances in the system
    NSArray * devices = MTLCopyAllDevices();
    for (id<MTLDevice> device in devices) {
        GGML_METAL_LOG_INFO("%s: found device: %s\n", __func__, [[device name] UTF8String]);
    }
    [devices release]; // since it was created by a *Copy* C method
#endif

    // Pick and show default Metal device
    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    GGML_METAL_LOG_INFO("%s: picking default device: %s\n", __func__, [[device name] UTF8String]);

    // Configure context
    struct ggml_metal_context * ctx = malloc(sizeof(struct ggml_metal_context));
    ctx->device = device;
    ctx->n_cb   = MIN(n_cb, GGML_METAL_MAX_BUFFERS);
    ctx->queue  = [ctx->device newCommandQueue];
    ctx->d_queue = dispatch_queue_create("ggml-metal", DISPATCH_QUEUE_CONCURRENT);

    id<MTLLibrary> metal_library;

    // load library
    {
        NSBundle * bundle = nil;
#ifdef SWIFT_PACKAGE
        bundle = SWIFTPM_MODULE_BUNDLE;
#else
        bundle = [NSBundle bundleForClass:[GGMLMetalClass class]];
#endif
        NSError * error = nil;
        NSString * libPath = [bundle pathForResource:@"default" ofType:@"metallib"];
        if (libPath != nil) {
            // pre-compiled library found
            NSURL * libURL = [NSURL fileURLWithPath:libPath];
            GGML_METAL_LOG_INFO("%s: loading '%s'\n", __func__, [libPath UTF8String]);
            metal_library = [ctx->device newLibraryWithURL:libURL error:&error];
            if (error) {
                GGML_METAL_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
                return NULL;
            }
        } else {
#if GGML_METAL_EMBED_LIBRARY
            GGML_METAL_LOG_INFO("%s: using embedded metal library\n", __func__);

            extern const char ggml_metallib_start[];
            extern const char ggml_metallib_end[];

            NSString * src  = [[NSString alloc] initWithBytes:ggml_metallib_start length:(ggml_metallib_end-ggml_metallib_start) encoding:NSUTF8StringEncoding];
#else
            GGML_METAL_LOG_INFO("%s: default.metallib not found, loading from source\n", __func__);

            NSString * sourcePath;
            NSString * ggmlMetalPathResources = [[NSProcessInfo processInfo].environment objectForKey:@"GGML_METAL_PATH_RESOURCES"];

            GGML_METAL_LOG_INFO("%s: GGML_METAL_PATH_RESOURCES = %s\n", __func__, ggmlMetalPathResources ? [ggmlMetalPathResources UTF8String] : "nil");

            if (ggmlMetalPathResources) {
                sourcePath = [ggmlMetalPathResources stringByAppendingPathComponent:@"ggml-metal.metal"];
            } else {
                sourcePath = [bundle pathForResource:@"ggml-metal" ofType:@"metal"];
            }
            if (sourcePath == nil) {
                GGML_METAL_LOG_WARN("%s: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd\n", __func__);
                sourcePath = @"ggml-metal.metal";
            }
            GGML_METAL_LOG_INFO("%s: loading '%s'\n", __func__, [sourcePath UTF8String]);
            NSString * src = [NSString stringWithContentsOfFile:sourcePath encoding:NSUTF8StringEncoding error:&error];
            if (error) {
                GGML_METAL_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
                return NULL;
            }
#endif

            @autoreleasepool {
                // dictionary of preprocessor macros
                NSMutableDictionary * prep = [NSMutableDictionary dictionary];

#ifdef GGML_QKK_64
                prep[@"GGML_QKK_64"] = @(1);
#endif

                MTLCompileOptions* options = [MTLCompileOptions new];
                options.preprocessorMacros = prep;

                //[options setFastMathEnabled:false];

                metal_library = [ctx->device newLibraryWithSource:src options:options error:&error];
                if (error) {
                    GGML_METAL_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
                    return NULL;
                }
            }
        }
    }

An error is occurred at L347 - metal_library = [ctx->device newLibraryWithSource:src options:options error:&error];, and the problem is that I have no knowledge of Metal shader...

@beebopkim
Copy link
Author

And even more confusingly, ggml-metal.m from koboldcpp and from most recent commit d8fd0cc of llama.cpp are exactly same. 😢

@beebopkim
Copy link
Author

I changed the Makefile removing -DGGML_METAL_NDEBUG and saw following error messages.

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/******/test/koboldcpp_dev/ggml-metal.metal'
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:3:10: fatal error: 'ggml-common.h' file not found
#include "ggml-common.h"
         ^~~~~~~~~~~~~~~
" UserInfo={NSLocalizedDescription=program_source:3:10: fatal error: 'ggml-common.h' file not found
#include "ggml-common.h"
         ^~~~~~~~~~~~~~~
}
llama_new_context_with_model: failed to initialize Metal backend
gpttype_load_model: error: failed to load model '/Users/******/test/llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf'
Load Text Model OK: False
Could not load text model: /Users/******/test/llama.cpp/models/llama-7b-v2/ggml-model-q8_0.gguf
(kdev_env) ******@Mac-Studio-2022-01 koboldcpp_dev % 

ggml-common.h is definitely existed. I feel it is very strange.

@LostRuins
Copy link
Owner

Related: ggerganov#5977

@LostRuins
Copy link
Owner

ggerganov#5940 (comment)

@beebopkim
Copy link
Author

Related: ggerganov#5977

As a workround, this works.

xcrun -sdk macosx metal    -O3 -c ggml-metal.metal -o ggml-metal.air
xcrun -sdk macosx metallib        ggml-metal.air   -o default.metallib

@LostRuins
Copy link
Owner

Yeah, but it's not ideal.

@LostRuins
Copy link
Owner

I might go with the

sed -e '/#include "ggml-common.h"/r ggml-common.h' -e '/#include "ggml-common.h"/d' < ggml-metal.metal > ggml-metal-embed.metal which is basically sticking the contents of ggml-common.h directly into the metal shader. I don't really want to precompile the metal lib.

Might need your help to test it again after I tweak it. It's annoying cause I will not be able to test anything myself as I don't have a mac.

@beebopkim
Copy link
Author

beebopkim commented Mar 13, 2024

I might go with the

sed -e '/#include "ggml-common.h"/r ggml-common.h' -e '/#include "ggml-common.h"/d' < ggml-metal.metal > ggml-metal-embed.metal which is basically sticking the contents of ggml-common.h directly into the metal shader. I don't really want to precompile the metal lib.

Might need your help to test it again after I tweak it. It's annoying cause I will not be able to test anything myself as I don't have a mac.

After doing the tweak using sed, I renamed ggml-metal-embed.metal to ggml-metal.metal, and run koboldcpp. And the result is:

Load Text Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Tada!

I also noticed that ggml-metal.metal has 2 #include "ggml-common.h" lines.

#define GGML_COMMON_DECL_METAL
#define GGML_COMMON_IMPL_METAL
#include "ggml-common.h"

#include <metal_stdlib>

#define GGML_COMMON_IMPL_METAL
#include "ggml-common.h"

So, in ggml-metal-embed.metal, those 2 lines were replaced twice with the exact same content of ggml-common.h. 😕

@LostRuins
Copy link
Owner

LostRuins commented Mar 13, 2024

Ah yeah, that is fixed in ggerganov#6015 which I will merge together when fixing the makefile tomorrow. Thanks for helping test.

@LostRuins
Copy link
Owner

Hi @beebopkim , if you don't mind, can you see if the latest experimental branch runs fine with LLAMA_METAL=1 for you?

@beebopkim
Copy link
Author

@LostRuins I wish I do it right now but I'm afraid that I can do it after 9 hours... Sorry for your waiting.

@LostRuins
Copy link
Owner

No problem, just let me know

@beebopkim
Copy link
Author

beebopkim commented Mar 14, 2024

@LostRuins With f3b7651, there is no problem. Now I can run with bakllava-mistal-v1 with --gpulayers 99! Thanks alot! 😃

@LostRuins
Copy link
Owner

Thanks for testing!

@beebopkim
Copy link
Author

I also confirmed that ec5dea1 works too. You're welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants