Incorrect Output in Visual Grounding Task

Hello! I've encountered an issue with the visual grounding task where the model frequently outputs meaningless bounding boxes. I'm testing the `deepseek-vl2-small` model. Below, I'll provide the test code and a wrong case.

## Test Code

```python
expression = img_info['caption']
prompt = f"<image>\n<|ref|>{expression}<|/ref|>."
conversation = [
    {
        "role": "<|User|>",
        "content": prompt,
        "images": [image_path],
    },
    {"role": "<|Assistant|>", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)
```

## Wrong Case

**Image:**  
![2350216](https://github.com/user-attachments/assets/a91222b1-b30e-4219-870c-29fe5bea82c2)

**Prompt:**  
`<image>\n<|ref|>The hat which is white.<|/ref|>.`

**Response:**  
`<|ref|>The hat which is white. .<|/ref|><|det|>[[2, 159, 4, 43958, 970]]<|/det|>`

Sometimes, the response is even more nonsensical, such as:
`<|ref|>Small and silver, this mirror gle.<|/ref|><|det|>[[0, 60, 30, 999999, 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999`

There are many similar errors where the model outputs nonsensical bounding boxes. I would appreciate any guidance on how to resolve this issue.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect Output in Visual Grounding Task #33

Test Code

Wrong Case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect Output in Visual Grounding Task #33

Description

Test Code

Wrong Case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions