Skip to content

Eval bug: Repeated sequences with gemma3 and image recognition #14888

@deiteris

Description

@deiteris

Name and Version

.\llama-server.exe --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
version: 5996 (11dd5a4)
built with MSVC 19.43.34808.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 7 7800X3D + RTX 4070 Ti Super

Models

Gemma-3-27b-it

Problem description & steps to reproduce

When I run llama-server with mmproj and try to generate text based on provided images, sometimes the model keeps generating repeated sequences of tokens. Sometimes it does this mid-generation and recovers.

For example, I provide the following image (which was found on the Internet, not a private information) with a prompt:

Translate the provided document
Image

And get the following output:

Here's a translation of the provided document, along with explanations to clarify some of the legal terminology:

**Document Translation: Decision**

**[Logo of the Agency for Registration of Companies - Serbia]**

**DECISION**

No 1/28.02.2015

On February 28, 2015, in the city of Belgrade, the owner of the capital of a **LIMITED LIABILITY COMPANY** with the name "ERMA 11" Ltd., registered and operating under the name: "ERMA 11" Ltd., with registered seat and address: Belgrade, Vasilija Mikica Street No. 16, 3rd floor, office 3, PIO 20156290, and a private limited company with the name: "BRITISH HOLDING & FINANCE INC.", registered in the register of legal entities as a British Virgin Islands company, represented by its director **SINISA MAJIĆ**, a citizen of Serbia, born on February 25, 1972, residing in Belgrade, Republic of Serbia, with a permanent address in Belgrade, Kiril i Metodij Street No. 008.

**DECISION:**

1.  **ACCEPT** the annual financial report of the company for the year 2014.
2.  **DECLARE** and **PUBLISH** the annual financial report of the company for the year 2014 in the Business Register, based on the Agency for Private Entrepreneurship.
3.  **AUTHORIZE** attorney Ana Ivanova Planinova, registered with the Bar Association under number СтК 7312170917, with personal ID card No. 641374607, issued on 07.12.2010, at MBR – Belgrade, with a permanent address in Belgrade, ул. Дердинца No. 108, et. 1, ап. 1, and Radomir Dimitrova Stamatova, ЕГН 8709250439, притежаваща лична карта No. 6015977087, issued on 03.01.2008, at MBR – Belgrade, with a permanent address in Belgrade, ул. Грънчар No. 27, et. 2, ап. 6, ДАЕШО, and Поотделно да изготвят всички документи и предприемат необходимите действия по отношение на регистриране на финансовия отчет за 2014 година в публичния регистър, водени от Агенцията за частна инициатива.

**[Signature]**

**[Stamp]**

**[Name and Title]**

**Explanation of Terms:**

*   **Limited Liability Company (ООД):**  A common type of business structure in many countries, where the owners are not personally liable for the company's debts.
*   **PIO:**  A unique identification number assigned to companies in Serbia.
*   **Business Register (Търговски регистър):** The official register where companies are registered and their information is publicly available.
*   **Agency for Private Entrepreneurship (Агенция за частна инициатива):** The government agency responsible for registering companies and maintaining the Business Register.
*   **СтК/ЕГН:** Identification numbers used in Serbia.
*   **ул.:** Abbreviation for улица (street)
*   **et.:** Abbreviation for етаж (floor)
*   **ап.:** Abbreviation for апартамент (apartment)
*   **ДАЕШО:** Likely refers to a department or section within the agency.



**In summary, this document is a decision by the Serbian Agency for Registration of Companies accepting the annual financial report of "ERMA 11" Ltd. for the year 2014 and authorizing specific attorneys to handle the registration of the report in the official Business Register.**



If you have another document you'd like me to translate, and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and

First Bad Commit

No response

Relevant log output

.\llama-server.exe -c 8192 -n -1 -m C:\Temp\gemma-3-27b-it-UD-Q3_K_XL.gguf -ngl 99 -fa --host 192.168.0.7 -ctk q8_0 -ctv q8_0 -a gemma3 --port 8083 --mmproj C:\Temp\mmproj-BF16.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
build: 5996 (11dd5a44e) with MSVC 19.43.34808.0 for x64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 192.168.0.7, port: 8083, http threads: 15
main: loading model
srv    load_model: loading model 'C:\Temp\gemma-3-27b-it-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070 Ti SUPER) - 15089 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from C:\Temp\gemma-3-27b-it-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3-27B-It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = Gemma-3-27B-It
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 27B
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv   9:                    gemma3.embedding_length u32              = 5376
llama_model_loader: - kv  10:                         gemma3.block_count u32              = 62
llama_model_loader: - kv  11:                 gemma3.feed_forward_length u32              = 21504
llama_model_loader: - kv  12:                gemma3.attention.head_count u32              = 32
llama_model_loader: - kv  13:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                gemma3.attention.key_length u32              = 128
llama_model_loader: - kv  15:              gemma3.attention.value_length u32              = 128
llama_model_loader: - kv  16:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  17:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  18:             gemma3.attention.head_count_kv u32              = 16
llama_model_loader: - kv  19:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  20:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  33:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                          general.file_type u32              = 12
llama_model_loader: - kv  36:                      quantize.imatrix.file str              = gemma-3-27b-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = unsloth_calibration_gemma-3-27b-it.txt
llama_model_loader: - kv  38:             quantize.imatrix.entries_count i32              = 434
llama_model_loader: - kv  39:              quantize.imatrix.chunks_count i32              = 663
llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q3_K:  208 tensors
llama_model_loader: - type q4_K:  122 tensors
llama_model_loader: - type q5_K:   59 tensors
llama_model_loader: - type q6_K:    6 tensors
llama_model_loader: - type iq3_xxs:   10 tensors
llama_model_loader: - type iq3_s:   10 tensors
llama_model_loader: - type iq4_xs:   20 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Medium
print_info: file size   = 12.76 GiB (4.06 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5376
print_info: n_layer          = 62
print_info: n_head           = 32
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 7.7e-02
print_info: n_ff             = 21504
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 27B
print_info: model params     = 27.01 B
print_info: general.name     = Gemma-3-27B-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors:        CUDA0 model buffer size = 13063.79 MiB
load_tensors:   CPU_Mapped model buffer size =  1102.77 MiB
..........srv  log_server_r: request: GET / 192.168.0.7 503
....srv  log_server_r: request: GET /favicon.ico 192.168.0.7 503
.......................................................srv  log_server_r: request: GET / 192.168.0.7 503
.srv  log_server_r: request: GET /favicon.ico 192.168.0.7 503
............srv  log_server_r: request: GET / 192.168.0.7 503
.srv  log_server_r: request: GET /favicon.ico 192.168.0.7 503
....
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = true
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 8192 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =   340.00 MiB
llama_kv_cache_unified: size =  340.00 MiB (  8192 cells,  10 layers,  1/ 1 seqs), K (q8_0):  170.00 MiB, V (q8_0):  170.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =   331.50 MiB
llama_kv_cache_unified: size =  331.50 MiB (  1536 cells,  52 layers,  1/ 1 seqs), K (q8_0):  165.75 MiB, V (q8_0):  165.75 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:      CUDA0 compute buffer size =   522.62 MiB
llama_context:  CUDA_Host compute buffer size =    29.51 MiB
llama_context: graph nodes  = 2613
llama_context: graph splits = 2
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_model_loader: model name:   Gemma-3-27B-It
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    439
clip_model_loader: n_kv:         21

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          gemma3
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     5376

--- vision hparams ---
load_hparams: image_size:         896
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0

load_hparams: model size:         817.98 MiB
load_hparams: metadata size:      0.15 MiB
srv  log_server_r: request: GET / 192.168.0.7 503
srv  log_server_r: request: GET /favicon.ico 192.168.0.7 503
srv  log_server_r: request: GET / 192.168.0.7 503
srv  log_server_r: request: GET /favicon.ico 192.168.0.7 503
srv  log_server_r: request: GET / 192.168.0.7 503
srv  log_server_r: request: GET /favicon.ico 192.168.0.7 503
alloc_compute_meta:      CUDA0 compute buffer size =  1132.00 MiB
alloc_compute_meta:        CPU compute buffer size =     9.19 MiB
srv    load_model: loaded multimodal model, 'C:\Temp\mmproj-BF16.gguf'
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
main: model loaded
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://192.168.0.7:8083 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET / 192.168.0.7 200
srv  log_server_r: request: GET /favicon.ico 192.168.0.7 404
srv  log_server_r: request: GET /props 192.168.0.7 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 272
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 5, n_tokens = 5, progress = 0.018382
slot update_slots: id  0 | task 0 | kv cache rm [5, end)
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 5935 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 92 ms
srv  process_chun: image processed in 6027 ms
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 272, n_tokens = 11, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 272, n_tokens = 11
slot      release: id  0 | task 0 | stop processing: n_past = 1537, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    6450.65 ms /   272 tokens (   23.72 ms per token,    42.17 tokens per second)
       eval time =   44240.74 ms /  1266 tokens (   34.95 ms per token,    28.62 tokens per second)
      total time =   50691.39 ms /  1538 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.0.7 200

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions