Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Output in Visual Grounding Task #33

Open
sleepyshep opened this issue Jan 9, 2025 · 1 comment
Open

Incorrect Output in Visual Grounding Task #33

sleepyshep opened this issue Jan 9, 2025 · 1 comment

Comments

@sleepyshep
Copy link

Hello! I've encountered an issue with the visual grounding task where the model frequently outputs meaningless bounding boxes. I'm testing the deepseek-vl2-small model. Below, I'll provide the test code and a wrong case.

Test Code

expression = img_info['caption']
prompt = f"<image>\n<|ref|>{expression}<|/ref|>."
conversation = [
    {
        "role": "<|User|>",
        "content": prompt,
        "images": [image_path],
    },
    {"role": "<|Assistant|>", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation,
    images=pil_images,
    force_batchify=True,
    system_prompt=""
).to(vl_gpt.device)

inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

outputs = vl_gpt.language.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(f"{prepare_inputs['sft_format'][0]}", answer)

Wrong Case

Image:
2350216

Prompt:
<image>\n<|ref|>The hat which is white.<|/ref|>.

Response:
<|ref|>The hat which is white. .<|/ref|><|det|>[[2, 159, 4, 43958, 970]]<|/det|>

Sometimes, the response is even more nonsensical, such as:
<|ref|>Small and silver, this mirror gle.<|/ref|><|det|>[[0, 60, 30, 999999, 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999

There are many similar errors where the model outputs nonsensical bounding boxes. I would appreciate any guidance on how to resolve this issue.

Thank you!

@deepML2020
Copy link

deepML2020 commented Jan 9, 2025

Hello, I've also encountered the same problem when I use the small version.
<|User|>:
<|ref|>The giraffe at the back.<|/ref|>.

DeepSeek-VL2-small reponse:
<|Assistant|>: <|ref|>The giraffe at back.<|/ref|><|det|>[[555560, 269, 9617895, 173171, 199999, 19999, 19, 99, 999, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 。<|end▁of▁sentence|>

Expected response:
<|Assistant|>: <|ref|>The giraffe at the back.<|/ref|><|det|>[[580, 270, 999, 900]]<|/det|><|end▁of▁sentence|>

visual_grounding_1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants