Skip to content

Conversation

@onuralpszr
Copy link
Contributor

@onuralpszr onuralpszr commented Feb 19, 2025

πŸš€ from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯

Open in Colab

✨ New Functionality:

@onuralpszr onuralpszr self-assigned this Feb 19, 2025
@onuralpszr onuralpszr requested a review from SkalskiP as a code owner February 19, 2025 06:26
@onuralpszr onuralpszr changed the title Feature/gemini object detection πŸš€ from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 Feb 19, 2025
@onuralpszr
Copy link
Contributor Author

onuralpszr commented Mar 22, 2025

cc @SkalskiP friendly ping for review please

@soumik12345
Copy link
Contributor

Hi @onuralpszr, I was unable to reproduce the results using the attached colab, the bounding boxes don't look correct.

Result with gemini-2.0-flash πŸ‘‡
image

Result with gemini-2.5-flash πŸ‘‡
image

Result with gemini-2.5-pro πŸ‘‡
image

@onuralpszr
Copy link
Contributor Author

Hi @onuralpszr, I was unable to reproduce the results using the attached colab, the bounding boxes don't look correct.

Result with gemini-2.0-flash πŸ‘‡ image

Result with gemini-2.5-flash πŸ‘‡ image

Result with gemini-2.5-pro πŸ‘‡ image

Let me re-check/work it is been a while.

@onuralpszr onuralpszr changed the title πŸš€ from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 WIP - πŸš€ from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 Jul 9, 2025
@onuralpszr onuralpszr force-pushed the feature/gemini-object-detection branch from a9947cb to 1742537 Compare July 9, 2025 16:23
@onuralpszr onuralpszr changed the title WIP - πŸš€ from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 πŸš€ from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 Jul 9, 2025
@onuralpszr
Copy link
Contributor Author

image

image

image

image

@onuralpszr
Copy link
Contributor Author

cc @SkalskiP @soumik12345 fixes are added and new results pictures are also added. I also updated colab for testing multiple different gemini model easily

Copy link
Contributor

@soumik12345 soumik12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@onuralpszr onuralpszr merged commit 096b9a5 into develop Jul 10, 2025
24 checks passed
Comment on lines +24 to +29
GOOGLE_GEMINI_2_0 = "gemini_2_0"
GOOGLE_GEMINI_2_0_FLASH_LITE = "gemini_2_0_flash_lite"
GOOGLE_GEMINI_2_0_FLASH = "gemini_2_0_flash"
GOOGLE_GEMINI_2_5 = "gemini_2_5"
GOOGLE_GEMINI_2_5_FLASH_PREVIEW = "gemini_2_5_flash_preview"
GOOGLE_GEMINI_2_5_PRO_PREVIEW = "gemini_2_5_pro_preview"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just add GOOGLE_GEMINI_2_0 and GOOGLE_GEMINI_2_5 and just add 2 models instead of 6? looks like there is no difference in processing

Comment on lines +360 to +362
Parse and scale bounding boxes from Google Gemini style JSON output.
https://aistudio.google.com/
https://ai.google.dev/gemini-api/docs/vision?lang=python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with:

Parse and scale bounding boxes from Google Gemini style JSON output.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include example of such JSON in docs so people can actually see how it looks:

[
    {"box_2d": [10, 20, 110, 120], "label": "cat"},
    {"box_2d": [50, 100, 150, 200], "label": "dog"}
]

Comment on lines +849 to +892
```python
from google import genai
from google.genai import types
import supervision as sv
from PIL import Image
IMAGE = Image.open(<SOURCE_IMAGE_PATH>)
GENAI_CLIENT = genai.Client(api_key=<API_KEY>)
system_instructions = '''
Return bounding boxes as a JSON array with labels and ids. Never return masks or code fencing. Limit to 25 objects.
If an object is present multiple times, name them according to their unique characteristic (colors, size, position, unique characteristics, etc..).
'''
safety_settings = [
types.SafetySetting(
category="HARM_CATEGORY_DANGEROUS_CONTENT",
threshold="BLOCK_ONLY_HIGH",
),
]
response = GENAI_CLIENT.models.generate_content(
model="gemini-2.0-flash-exp",
contents=[prompt, IMAGE],
config = types.GenerateContentConfig(
system_instruction=system_instructions,
temperature=0.5,
safety_settings=safety_settings,
)
)
detections = sv.Detections.from_lmm(
sv.LMM.GOOGLE_GEMINI_2_0,
response.text,
resolution_wh=(IMAGE.size[0], IMAGE.size[1]),
)
detections.xyxy
# array([[250., 250., 750., 750.]])
detections.class_id
# array([0])
detections.data
# {'class_name': ['cat', 'dog']}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this code snippet a lot shorter. Instead of showing the whole process of acquiring Gemini output. Lets start with actual response just like we did with paligemma above.

return result.astype(float)


def normalized_xyxy_to_absolute_xyxy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename this function to denormalize_boxes to match existing clip_boxes, pad_boxes and scale_boxes

# array([0])
```
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_lmm is actually deprecated; while working on from_lmm docs changes make sure to copy the docsstring to from_vlm with all proper changes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include qwen_2_5_vl example as well

Comment on lines +355 to +358
def from_google_gemini(
result: str,
resolution_wh: Tuple[int, int],
) -> Tuple[np.ndarray, np.ndarray]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is inconsistent with the from_paligemma and from_qwen_2_5_vl implementations.

Currently, it is not possible to resolve class_id values. To address this, we should allow users to optionally provide a classes: Optional[List[str]] = None argument. If this argument is given, we should attempt to resolve the class_id for each detection, following the same approach as in from_paligemma and from_qwen_2_5_vl.

The function should return Tuple[np.ndarray, Optional[np.ndarray], np.ndarray], consistent with the return type of from_paligemma and from_qwen_2_5_vl, where Optional[np.ndarray] corresponds to class_id. This value should be None if classes is not provided, and an np.ndarray if it is.

or vlm == VLM.GOOGLE_GEMINI_2_5_FLASH_PREVIEW
or vlm == VLM.GOOGLE_GEMINI_2_5_PRO_PREVIEW
):
xyxy, class_name = from_google_gemini(result, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once from_google_gemini is updated, please make sure to propagate the class_id values into the Detections object.

Comment on lines +455 to +505
"""
Convert normalized xyxy coordinates to absolute XYXY coordinates. By default, assumes
normalized values are between 0 and 1, but supports custom ranges via normalization_factor parameter.
Args:
normalized_xyxy (np.ndarray): A numpy array of shape `(N, 4)` where each row contains
normalized coordinates in format `(x1, y1, x2, y2)` with values between 0 and normalization_factor.
resolution_wh (Tuple[int, int]): A tuple of the form `(width, height)` representing
the target resolution.
normalization_factor (float): The maximum value of the normalization range. For example:
- normalization_factor=1.0 means input coordinates are normalized between 0 and 1
- normalization_factor=100.0 means input coordinates are normalized between 0 and 100
- normalization_factor=1000.0 means input coordinates are normalized between 0 and 1000
Returns:
np.ndarray: A numpy array of shape `(N, 4)` containing the absolute coordinates
in format `(x1, y1, x2, y2)`.
Examples:
```python
import numpy as np
import supervision as sv
# Example with default normalization (0-1)
normalized_xyxy = np.array([
[0.1, 0.2, 0.5, 0.6],
[0.3, 0.4, 0.7, 0.8]
])
resolution_wh = (100, 200)
sv.normalized_xyxy_to_absolute_xyxy(normalized_xyxy, resolution_wh)
# array([
# [ 10., 40., 50., 120.],
# [ 30., 80., 70., 160.]
# ])
# Example with custom normalization (0-100)
normalized_xyxy = np.array([
[10., 20., 50., 60.],
[30., 40., 70., 80.]
])
sv.normalized_xyxy_to_absolute_xyxy(normalized_xyxy, resolution_wh, max_value=100.0)
# array([
# [ 10., 40., 50., 120.],
# [ 30., 80., 70., 160.]
# ])
```
""" # noqa E501 // docs
width, height = resolution_wh
result = normalized_xyxy.copy()

result[[0, 2]] = (result[[0, 2]] * width) / normalization_factor
result[[1, 3]] = (result[[1, 3]] * height) / normalization_factor

return result


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the docstrings for both from_lmm and from_vlm to include examples demonstrating how to use the classes argument, similar to what we did for from_paligemma. I also noticed that our Qwen2.5VL example is missing the classes argument. Let’s add it there as well.

Comment on lines +362 to +377
def test_from_google_gemini() -> None:
result = """```json
[
{"box_2d": [10, 20, 110, 120], "label": "cat"},
{"box_2d": [50, 100, 150, 200], "label": "dog"}
]
```"""
resolution_wh = (640, 480)
xyxy, class_name = from_google_gemini(
result=result,
resolution_wh=resolution_wh,
)
np.testing.assert_array_equal(
xyxy, np.array([[12.8, 4.8, 76.8, 52.8], [64.0, 24.0, 128.0, 72.0]])
)
np.testing.assert_array_equal(class_name, np.array(["cat", "dog"]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please parametrize this test to cover cases both with and without the classes argument. You can use test_from_paligemma as a reference for how to structure these scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants