🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 #1792

onuralpszr · 2025-02-19T06:26:36Z

🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯

✨ New Functionality:

supervision/detection/core.py: Added support for from_google_gemini and included an example in the from_lmm method documentation. [1] [2] [3] [4]
supervision/detection/vlm.py: Added GOOGLE_GEMINI_2_0 to LMM and VLM enums, and implemented the from_google_gemini function. [1] [2] [3] [4]

Signed-off-by: Onuralp SEZER <[email protected]>

…rate into Google Gemini processing

onuralpszr · 2025-03-22T18:34:43Z

cc @SkalskiP friendly ping for review please

soumik12345 · 2025-07-03T10:09:18Z

Hi @onuralpszr, I was unable to reproduce the results using the attached colab, the bounding boxes don't look correct.

Result with gemini-2.0-flash 👇

Result with gemini-2.5-flash 👇

Result with gemini-2.5-pro 👇

onuralpszr · 2025-07-03T10:11:33Z

Hi @onuralpszr, I was unable to reproduce the results using the attached colab, the bounding boxes don't look correct.

Result with gemini-2.0-flash 👇

Result with gemini-2.5-flash 👇

Result with gemini-2.5-pro 👇

Let me re-check/work it is been a while.

…enums for feature models to come

…rdinates

onuralpszr · 2025-07-09T16:26:34Z

onuralpszr · 2025-07-09T16:27:38Z

cc @SkalskiP @soumik12345 fixes are added and new results pictures are also added. I also updated colab for testing multiple different gemini model easily

soumik12345

LGTM!

SkalskiP · 2025-07-10T11:57:47Z

supervision/detection/vlm.py

+    GOOGLE_GEMINI_2_0 = "gemini_2_0"
+    GOOGLE_GEMINI_2_0_FLASH_LITE = "gemini_2_0_flash_lite"
+    GOOGLE_GEMINI_2_0_FLASH = "gemini_2_0_flash"
+    GOOGLE_GEMINI_2_5 = "gemini_2_5"
+    GOOGLE_GEMINI_2_5_FLASH_PREVIEW = "gemini_2_5_flash_preview"
+    GOOGLE_GEMINI_2_5_PRO_PREVIEW = "gemini_2_5_pro_preview"


can we just add GOOGLE_GEMINI_2_0 and GOOGLE_GEMINI_2_5 and just add 2 models instead of 6? looks like there is no difference in processing

SkalskiP · 2025-07-10T12:00:50Z

supervision/detection/vlm.py

+    Parse and scale bounding boxes from Google Gemini style JSON output.
+    https://aistudio.google.com/
+    https://ai.google.dev/gemini-api/docs/vision?lang=python


replace this with:

Parse and scale bounding boxes from Google Gemini style JSON output.

Include example of such JSON in docs so people can actually see how it looks:

[ {"box_2d": [10, 20, 110, 120], "label": "cat"}, {"box_2d": [50, 100, 150, 200], "label": "dog"} ]

SkalskiP · 2025-07-10T12:03:44Z

supervision/detection/core.py

+            ```python
+            from google import genai
+            from google.genai import types
+            import supervision as sv
+            from PIL import Image
+
+            IMAGE = Image.open(<SOURCE_IMAGE_PATH>)
+            GENAI_CLIENT = genai.Client(api_key=<API_KEY>)
+
+            system_instructions = '''
+                Return bounding boxes as a JSON array with labels and ids. Never return masks or code fencing. Limit to 25 objects.
+                If an object is present multiple times, name them according to their unique characteristic (colors, size, position, unique characteristics, etc..).
+                '''
+
+            safety_settings = [
+                types.SafetySetting(
+                    category="HARM_CATEGORY_DANGEROUS_CONTENT",
+                    threshold="BLOCK_ONLY_HIGH",
+                ),
+            ]
+
+            response = GENAI_CLIENT.models.generate_content(
+                model="gemini-2.0-flash-exp",
+                contents=[prompt, IMAGE],
+                config = types.GenerateContentConfig(
+                    system_instruction=system_instructions,
+                    temperature=0.5,
+                    safety_settings=safety_settings,
+                )
+            )
+
+            detections = sv.Detections.from_lmm(
+                sv.LMM.GOOGLE_GEMINI_2_0,
+                response.text,
+                resolution_wh=(IMAGE.size[0], IMAGE.size[1]),
+            )
+
+            detections.xyxy
+            # array([[250., 250., 750., 750.]])
+            detections.class_id
+            # array([0])
+            detections.data
+            # {'class_name': ['cat', 'dog']}
+            ```


Let's make this code snippet a lot shorter. Instead of showing the whole process of acquiring Gemini output. Lets start with actual response just like we did with paligemma above.

SkalskiP · 2025-07-10T12:06:22Z

supervision/detection/utils.py

    return result.astype(float)


+def normalized_xyxy_to_absolute_xyxy(


please rename this function to denormalize_boxes to match existing clip_boxes, pad_boxes and scale_boxes

SkalskiP · 2025-07-10T12:08:14Z

supervision/detection/core.py

            # array([0])
            ```
-        """
+


from_lmm is actually deprecated; while working on from_lmm docs changes make sure to copy the docsstring to from_vlm with all proper changes

include qwen_2_5_vl example as well

SkalskiP · 2025-07-11T13:00:46Z

supervision/detection/vlm.py

+def from_google_gemini(
+    result: str,
+    resolution_wh: Tuple[int, int],
+) -> Tuple[np.ndarray, np.ndarray]:


This API is inconsistent with the from_paligemma and from_qwen_2_5_vl implementations.

Currently, it is not possible to resolve class_id values. To address this, we should allow users to optionally provide a classes: Optional[List[str]] = None argument. If this argument is given, we should attempt to resolve the class_id for each detection, following the same approach as in from_paligemma and from_qwen_2_5_vl.

The function should return Tuple[np.ndarray, Optional[np.ndarray], np.ndarray], consistent with the return type of from_paligemma and from_qwen_2_5_vl, where Optional[np.ndarray] corresponds to class_id. This value should be None if classes is not provided, and an np.ndarray if it is.

SkalskiP · 2025-07-11T13:02:21Z

supervision/detection/core.py

+            or vlm == VLM.GOOGLE_GEMINI_2_5_FLASH_PREVIEW
+            or vlm == VLM.GOOGLE_GEMINI_2_5_PRO_PREVIEW
+        ):
+            xyxy, class_name = from_google_gemini(result, **kwargs)


Once from_google_gemini is updated, please make sure to propagate the class_id values into the Detections object.

SkalskiP · 2025-07-11T13:04:17Z

supervision/detection/utils.py

+    """
+    Convert normalized xyxy coordinates to absolute XYXY coordinates. By default, assumes
+    normalized values are between 0 and 1, but supports custom ranges via normalization_factor parameter.
+    Args:
+        normalized_xyxy (np.ndarray): A numpy array of shape `(N, 4)` where each row contains
+            normalized coordinates in format `(x1, y1, x2, y2)` with values between 0 and normalization_factor.
+        resolution_wh (Tuple[int, int]): A tuple of the form `(width, height)` representing
+            the target resolution.
+        normalization_factor (float): The maximum value of the normalization range. For example:
+            - normalization_factor=1.0 means input coordinates are normalized between 0 and 1
+            - normalization_factor=100.0 means input coordinates are normalized between 0 and 100
+            - normalization_factor=1000.0 means input coordinates are normalized between 0 and 1000
+    Returns:
+        np.ndarray: A numpy array of shape `(N, 4)` containing the absolute coordinates
+            in format `(x1, y1, x2, y2)`.
+    Examples:
+        ```python
+        import numpy as np
+        import supervision as sv
+        # Example with default normalization (0-1)
+        normalized_xyxy = np.array([
+            [0.1, 0.2, 0.5, 0.6],
+            [0.3, 0.4, 0.7, 0.8]
+        ])
+        resolution_wh = (100, 200)
+        sv.normalized_xyxy_to_absolute_xyxy(normalized_xyxy, resolution_wh)
+        # array([
+        #     [ 10.,  40.,  50., 120.],
+        #     [ 30.,  80.,  70., 160.]
+        # ])
+        # Example with custom normalization (0-100)
+        normalized_xyxy = np.array([
+            [10., 20., 50., 60.],
+            [30., 40., 70., 80.]
+        ])
+        sv.normalized_xyxy_to_absolute_xyxy(normalized_xyxy, resolution_wh, max_value=100.0)
+        # array([
+        #     [ 10.,  40.,  50., 120.],
+        #     [ 30.,  80.,  70., 160.]
+        # ])
+        ```
+    """  # noqa E501 // docs
+    width, height = resolution_wh
+    result = normalized_xyxy.copy()
+
+    result[[0, 2]] = (result[[0, 2]] * width) / normalization_factor
+    result[[1, 3]] = (result[[1, 3]] * height) / normalization_factor
+
+    return result
+
+


Please update the docstrings for both from_lmm and from_vlm to include examples demonstrating how to use the classes argument, similar to what we did for from_paligemma. I also noticed that our Qwen2.5VL example is missing the classes argument. Let’s add it there as well.

SkalskiP · 2025-07-11T13:05:29Z

test/detection/test_vlm.py

+def test_from_google_gemini() -> None:
+    result = """```json
+    [
+        {"box_2d": [10, 20, 110, 120], "label": "cat"},
+        {"box_2d": [50, 100, 150, 200], "label": "dog"}
+    ]
+    ```"""
+    resolution_wh = (640, 480)
+    xyxy, class_name = from_google_gemini(
+        result=result,
+        resolution_wh=resolution_wh,
+    )
+    np.testing.assert_array_equal(
+        xyxy, np.array([[12.8, 4.8, 76.8, 52.8], [64.0, 24.0, 128.0, 72.0]])
+    )
+    np.testing.assert_array_equal(class_name, np.array(["cat", "dog"]))


Please parametrize this test to cover cases both with and without the classes argument. You can use test_from_paligemma as a reference for how to structure these scenarios.

onuralpszr added 3 commits February 19, 2025 08:29

feat: ✨ add support for Google Gemini model in detection module

65b48f6

Signed-off-by: Onuralp SEZER <[email protected]>

style: 💄 improve code formatting and readability in various files

0cec129

docs(refactor): ♻️ rename variables for clarity in detection module

766467a

onuralpszr added the api:detection label Feb 19, 2025

onuralpszr self-assigned this Feb 19, 2025

onuralpszr requested a review from SkalskiP as a code owner February 19, 2025 06:26

onuralpszr changed the title ~~Feature/gemini object detection~~ 🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 Feb 19, 2025

onuralpszr added 3 commits February 19, 2025 18:14

feat: ✨ implement normalized_xyxy_to_absolute_xyxy function and integ…

0c4a156

…rate into Google Gemini processing

Merge branch 'develop' into feature/gemini-object-detection

7f1e0ea

Merge branch 'develop' into feature/gemini-object-detection

1f692a0

onuralpszr and others added 2 commits July 9, 2025 15:54

Merge branch 'develop' into feature/gemini-object-detection

84d7fb2

fix(pre_commit): 🎨 auto format pre-commit hooks

031a029

onuralpszr changed the title ~~🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯~~ WIP - 🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 Jul 9, 2025

onuralpszr added 2 commits July 9, 2025 19:04

fix: 🐞 Google Gemini bbox order for normaliztion convert and add new …

e608d82

…enums for feature models to come

fix: 🐞 update expected output in test_from_google_gemini for bbox coo…

1742537

…rdinates

onuralpszr force-pushed the feature/gemini-object-detection branch from a9947cb to 1742537 Compare July 9, 2025 16:23

onuralpszr changed the title ~~WIP - 🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯~~ 🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 Jul 9, 2025

soumik12345 approved these changes Jul 10, 2025

View reviewed changes

onuralpszr merged commit 096b9a5 into develop Jul 10, 2025
24 checks passed

SkalskiP reviewed Jul 10, 2025

View reviewed changes

soumik12345 mentioned this pull request Jul 10, 2025

update: google gemini support #1876

Merged

SkalskiP reviewed Jul 11, 2025

View reviewed changes

soumik12345 mentioned this pull request Jul 11, 2025

chore: address feedback for google gemini support #1883

Merged

		return result.astype(float)


		def normalized_xyxy_to_absolute_xyxy(

🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 #1792

🚀 from_vlm now has Google gemini 2D spatial understanding support for Detection class 🎯 #1792

Conversation

onuralpszr commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!