Description
Hello! I've encountered an issue with the visual grounding task where the model frequently outputs meaningless bounding boxes. I'm testing the deepseek-vl2-small
model. Below, I'll provide the test code and a wrong case.
Test Code
expression = img_info['caption']
prompt = f"<image>\n<|ref|>{expression}<|/ref|>."
conversation = [
{
"role": "<|User|>",
"content": prompt,
"images": [image_path],
},
{"role": "<|Assistant|>", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation,
images=pil_images,
force_batchify=True,
system_prompt=""
).to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language.generate(
inputs_embeds=inputs_embeds,
attention_mask=prepare_inputs.attention_mask,
pad_token_id=tokenizer.eos_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
do_sample=False,
use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)
Wrong Case
Prompt:
<image>\n<|ref|>The hat which is white.<|/ref|>.
Response:
<|ref|>The hat which is white. .<|/ref|><|det|>[[2, 159, 4, 43958, 970]]<|/det|>
Sometimes, the response is even more nonsensical, such as:
<|ref|>Small and silver, this mirror gle.<|/ref|><|det|>[[0, 60, 30, 999999, 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999
There are many similar errors where the model outputs nonsensical bounding boxes. I would appreciate any guidance on how to resolve this issue.
Thank you!