-
Notifications
You must be signed in to change notification settings - Fork 3.1k
π from_vlm now has Google gemini 2D spatial understanding support for Detection class π― #1792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @SkalskiP friendly ping for review please |
|
Hi @onuralpszr, I was unable to reproduce the results using the attached colab, the bounding boxes don't look correct. Result with gemini-2.0-flash π |
Let me re-check/work it is been a while. |
β¦enums for feature models to come
a9947cb to
1742537
Compare
|
cc @SkalskiP @soumik12345 fixes are added and new results pictures are also added. I also updated colab for testing multiple different gemini model easily |
soumik12345
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| GOOGLE_GEMINI_2_0 = "gemini_2_0" | ||
| GOOGLE_GEMINI_2_0_FLASH_LITE = "gemini_2_0_flash_lite" | ||
| GOOGLE_GEMINI_2_0_FLASH = "gemini_2_0_flash" | ||
| GOOGLE_GEMINI_2_5 = "gemini_2_5" | ||
| GOOGLE_GEMINI_2_5_FLASH_PREVIEW = "gemini_2_5_flash_preview" | ||
| GOOGLE_GEMINI_2_5_PRO_PREVIEW = "gemini_2_5_pro_preview" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just add GOOGLE_GEMINI_2_0 and GOOGLE_GEMINI_2_5 and just add 2 models instead of 6? looks like there is no difference in processing
| Parse and scale bounding boxes from Google Gemini style JSON output. | ||
| https://aistudio.google.com/ | ||
| https://ai.google.dev/gemini-api/docs/vision?lang=python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace this with:
Parse and scale bounding boxes from Google Gemini style JSON output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include example of such JSON in docs so people can actually see how it looks:
[
{"box_2d": [10, 20, 110, 120], "label": "cat"},
{"box_2d": [50, 100, 150, 200], "label": "dog"}
]| ```python | ||
| from google import genai | ||
| from google.genai import types | ||
| import supervision as sv | ||
| from PIL import Image | ||
| IMAGE = Image.open(<SOURCE_IMAGE_PATH>) | ||
| GENAI_CLIENT = genai.Client(api_key=<API_KEY>) | ||
| system_instructions = ''' | ||
| Return bounding boxes as a JSON array with labels and ids. Never return masks or code fencing. Limit to 25 objects. | ||
| If an object is present multiple times, name them according to their unique characteristic (colors, size, position, unique characteristics, etc..). | ||
| ''' | ||
| safety_settings = [ | ||
| types.SafetySetting( | ||
| category="HARM_CATEGORY_DANGEROUS_CONTENT", | ||
| threshold="BLOCK_ONLY_HIGH", | ||
| ), | ||
| ] | ||
| response = GENAI_CLIENT.models.generate_content( | ||
| model="gemini-2.0-flash-exp", | ||
| contents=[prompt, IMAGE], | ||
| config = types.GenerateContentConfig( | ||
| system_instruction=system_instructions, | ||
| temperature=0.5, | ||
| safety_settings=safety_settings, | ||
| ) | ||
| ) | ||
| detections = sv.Detections.from_lmm( | ||
| sv.LMM.GOOGLE_GEMINI_2_0, | ||
| response.text, | ||
| resolution_wh=(IMAGE.size[0], IMAGE.size[1]), | ||
| ) | ||
| detections.xyxy | ||
| # array([[250., 250., 750., 750.]]) | ||
| detections.class_id | ||
| # array([0]) | ||
| detections.data | ||
| # {'class_name': ['cat', 'dog']} | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this code snippet a lot shorter. Instead of showing the whole process of acquiring Gemini output. Lets start with actual response just like we did with paligemma above.
| return result.astype(float) | ||
|
|
||
|
|
||
| def normalized_xyxy_to_absolute_xyxy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename this function to denormalize_boxes to match existing clip_boxes, pad_boxes and scale_boxes
| # array([0]) | ||
| ``` | ||
| """ | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_lmm is actually deprecated; while working on from_lmm docs changes make sure to copy the docsstring to from_vlm with all proper changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include qwen_2_5_vl example as well
| def from_google_gemini( | ||
| result: str, | ||
| resolution_wh: Tuple[int, int], | ||
| ) -> Tuple[np.ndarray, np.ndarray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is inconsistent with the from_paligemma and from_qwen_2_5_vl implementations.
Currently, it is not possible to resolve class_id values. To address this, we should allow users to optionally provide a classes: Optional[List[str]] = None argument. If this argument is given, we should attempt to resolve the class_id for each detection, following the same approach as in from_paligemma and from_qwen_2_5_vl.
The function should return Tuple[np.ndarray, Optional[np.ndarray], np.ndarray], consistent with the return type of from_paligemma and from_qwen_2_5_vl, where Optional[np.ndarray] corresponds to class_id. This value should be None if classes is not provided, and an np.ndarray if it is.
| or vlm == VLM.GOOGLE_GEMINI_2_5_FLASH_PREVIEW | ||
| or vlm == VLM.GOOGLE_GEMINI_2_5_PRO_PREVIEW | ||
| ): | ||
| xyxy, class_name = from_google_gemini(result, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once from_google_gemini is updated, please make sure to propagate the class_id values into the Detections object.
| """ | ||
| Convert normalized xyxy coordinates to absolute XYXY coordinates. By default, assumes | ||
| normalized values are between 0 and 1, but supports custom ranges via normalization_factor parameter. | ||
| Args: | ||
| normalized_xyxy (np.ndarray): A numpy array of shape `(N, 4)` where each row contains | ||
| normalized coordinates in format `(x1, y1, x2, y2)` with values between 0 and normalization_factor. | ||
| resolution_wh (Tuple[int, int]): A tuple of the form `(width, height)` representing | ||
| the target resolution. | ||
| normalization_factor (float): The maximum value of the normalization range. For example: | ||
| - normalization_factor=1.0 means input coordinates are normalized between 0 and 1 | ||
| - normalization_factor=100.0 means input coordinates are normalized between 0 and 100 | ||
| - normalization_factor=1000.0 means input coordinates are normalized between 0 and 1000 | ||
| Returns: | ||
| np.ndarray: A numpy array of shape `(N, 4)` containing the absolute coordinates | ||
| in format `(x1, y1, x2, y2)`. | ||
| Examples: | ||
| ```python | ||
| import numpy as np | ||
| import supervision as sv | ||
| # Example with default normalization (0-1) | ||
| normalized_xyxy = np.array([ | ||
| [0.1, 0.2, 0.5, 0.6], | ||
| [0.3, 0.4, 0.7, 0.8] | ||
| ]) | ||
| resolution_wh = (100, 200) | ||
| sv.normalized_xyxy_to_absolute_xyxy(normalized_xyxy, resolution_wh) | ||
| # array([ | ||
| # [ 10., 40., 50., 120.], | ||
| # [ 30., 80., 70., 160.] | ||
| # ]) | ||
| # Example with custom normalization (0-100) | ||
| normalized_xyxy = np.array([ | ||
| [10., 20., 50., 60.], | ||
| [30., 40., 70., 80.] | ||
| ]) | ||
| sv.normalized_xyxy_to_absolute_xyxy(normalized_xyxy, resolution_wh, max_value=100.0) | ||
| # array([ | ||
| # [ 10., 40., 50., 120.], | ||
| # [ 30., 80., 70., 160.] | ||
| # ]) | ||
| ``` | ||
| """ # noqa E501 // docs | ||
| width, height = resolution_wh | ||
| result = normalized_xyxy.copy() | ||
|
|
||
| result[[0, 2]] = (result[[0, 2]] * width) / normalization_factor | ||
| result[[1, 3]] = (result[[1, 3]] * height) / normalization_factor | ||
|
|
||
| return result | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the docstrings for both from_lmm and from_vlm to include examples demonstrating how to use the classes argument, similar to what we did for from_paligemma. I also noticed that our Qwen2.5VL example is missing the classes argument. Letβs add it there as well.
| def test_from_google_gemini() -> None: | ||
| result = """```json | ||
| [ | ||
| {"box_2d": [10, 20, 110, 120], "label": "cat"}, | ||
| {"box_2d": [50, 100, 150, 200], "label": "dog"} | ||
| ] | ||
| ```""" | ||
| resolution_wh = (640, 480) | ||
| xyxy, class_name = from_google_gemini( | ||
| result=result, | ||
| resolution_wh=resolution_wh, | ||
| ) | ||
| np.testing.assert_array_equal( | ||
| xyxy, np.array([[12.8, 4.8, 76.8, 52.8], [64.0, 24.0, 128.0, 72.0]]) | ||
| ) | ||
| np.testing.assert_array_equal(class_name, np.array(["cat", "dog"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please parametrize this test to cover cases both with and without the classes argument. You can use test_from_paligemma as a reference for how to structure these scenarios.










π from_vlm now has Google gemini 2D spatial understanding support for Detection class π―
β¨ New Functionality:
supervision/detection/core.py: Added support forfrom_google_geminiand included an example in thefrom_lmmmethod documentation. [1] [2] [3] [4]supervision/detection/vlm.py: AddedGOOGLE_GEMINI_2_0toLMMandVLMenums, and implemented thefrom_google_geminifunction. [1] [2] [3] [4]