Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.cpp server UR_RESULT_ERROR_OUT_OF_RESOURCES error #12872

Open
easyfab opened this issue Feb 22, 2025 · 3 comments
Open

llama.cpp server UR_RESULT_ERROR_OUT_OF_RESOURCES error #12872

easyfab opened this issue Feb 22, 2025 · 3 comments

Comments

@easyfab
Copy link

easyfab commented Feb 22, 2025

Hi, with latest intelanalytics/ipex-llm-inference-cpp-xpu:latest image I got this error :

UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml/src/ggml-sycl/ggml-sycl.cpp, line:2819

I use this commad line : ./llama-server -m Mistral-Small-24B-Instruct-2501-IQ4_XS.gguf -c 2048 -ngl 99 --temp 0 --port 1234 --host 192.168.1.64

Full output :

build: 1 (e66308a) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 192.168.1.64, port: 1234, http threads: 15
main: loading model
srv    load_model: loading model '../Mistral-Small-24B-Instruct-2501-IQ4_XS.gguf'
llama_load_model_from_file: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 363 tensors from ../Mistral-Small-24B-Instruct-2501-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Mistral Small 24B Instruct 2501
llama_model_loader: - kv   3:                            general.version str              = 2501
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Mistral-Small
llama_model_loader: - kv   6:                         general.size_label str              = 24B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Mistral Small 24B Base 2501
llama_model_loader: - kv  10:               general.base_model.0.version str              = 2501
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Mistralai
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/mistralai/Mist...
llama_model_loader: - kv  13:                          general.languages arr[str,10]      = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv  14:                          llama.block_count u32              = 40
llama_model_loader: - kv  15:                       llama.context_length u32              = 32768
llama_model_loader: - kv  16:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 32768
llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  19:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  20:                       llama.rope.freq_base f32              = 100000000.000000
llama_model_loader: - kv  21:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  22:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  23:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  24:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  25:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  33:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  37:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 30
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = /models_out/Mistral-Small-24B-Instruc...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  42:             quantize.imatrix.entries_count i32              = 280
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  241 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 1000
llm_load_vocab: token to piece cache size = 0.8498 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 131072
llm_load_print_meta: n_merges         = 269443
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 32768
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 100000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params     = 23.57 B
llm_load_print_meta: model size       = 11.88 GiB (4.33 BPW)
llm_load_print_meta: general.name     = Mistral Small 24B Instruct 2501
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1196 'Ä'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 150
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   340.00 MiB
llm_load_tensors:        SYCL0 model buffer size = 11820.33 MiB
...............................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 2048
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 100000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16225M|     1.6.32224.500000|
llama_kv_cache_init:      SYCL0 KV buffer size =   320.00 MiB
llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1064.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    56.02 MiB
llama_new_context_with_model: graph nodes  = 1126
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

ShaderDumpEnable Warning! BufferVec[] has 25752 elements. Including first 1000 items in ShaderDumps. To print all elements set IGC_ShowFullVectorsInShaderDumps register flag to True. ShaderOverride flag may not work properly without IGC_ShowFullVectorsInShaderDumps enabled.

srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 2048
main: model loaded
main: chat template, built_in: 1, chat_example: '[SYSTEM_PROMPT] You are a helpful assistant[/SYSTEM_PROMPT][INST] Hello[/INST] Hi there</s>[INST] How are you?[/INST]'
main: server is listening on http://192.168.1.64:1234 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1042
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1042, n_tokens = 1042, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1042, n_tokens = 1042
UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml/src/ggml-sycl/ggml-sycl.cpp, line:2819
@san-nos
Copy link

san-nos commented Feb 23, 2025

I am also experiencing this UR backend returns:40 error specifically with nomic_embed_text. Right now, my workaround for this is to use this command

set OLLAMA_NUM_GPU=11

(note: nomic_embed_text has a total of 13 layers)
Any value over this will result is the error but using this command will force CPU to do some work and not just the GPU.

Oddly enough, setting this parameter to 999, as instructed in the docs, with any other model will work just fine.

@qiuxin2012
Copy link
Contributor

qiuxin2012 commented Feb 24, 2025

Yes, A770 only have 16GB RAM, 24B model is too big for A770, so you will get ERROR_OUT_OF_RESOURCES.
We can change our docs to notice users.

@easyfab
Copy link
Author

easyfab commented Feb 24, 2025

@qiuxin2012 I'm not sure it's memory related.
It's a 24B model but quantized : model size = 11820.33 MiB + context 1064.00 MiB . Should be ok for 16GB RAM.
And it worked with older version.

After @san-nos comment, I tried some options and with --batch-size <= 1024 it works .
Is the 2048 default logical maximum batch size the problem ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants