Skip to content

Conversation

@jacobsela
Copy link
Contributor

@jacobsela jacobsela commented Apr 10, 2025

Change log

  1. Formalizes and officially documents the SampleCollection.to_torch() interface for optimized model inference
  2. Generalized FiftyOneTorchDataset by adding a new GetItem class for loading model inputs for samples
  3. Adds a GetItem for TorchImageModel
  4. Refactored methods like apply_model() and compute_embeddings() to use FiftyOneTorchDataset by default when possible
  5. All models can expose themselves to the new implementation by implementing SupportsGetItem

Example usage

The results of apply_model are the same before and after this code change:

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("cifar10", split="test")

model = foz.load_zoo_model("clip-vit-base32-torch", classes=["truck", "dog"])

dataset.apply_model(model, label_field="predictions", batch_size=64)

Benchmarked throughput on OSS to make sure there isn't slowdown

Using same code as above with clip-vit-base32-torch:

  • Before: ~1.1k samples / second
  • After: ~1.4k samples / second

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Apr 10, 2025

Walkthrough

This pull request introduces a new optional parameter field_mapping to several functions in the core models file, updating their signatures and documentation. The data loader flow now passes field_mapping to a newly introduced error-handling class for batch collation, and the dataset creation logic is enhanced to select between different generation methods based on model capabilities. In addition, new classes (GetItem and its subclass ImageGetItem) are added to standardize sample-to-input conversion within the Torch utilities. A new test suite validates the behavior of GetItem and its field mapping features.

Changes

File(s) Change Summary
fiftyone/core/models.py - Added optional parameter field_mapping to functions: apply_model, compute_embeddings, _apply_image_model_data_loader, _make_data_loader, and related internal methods.
- Introduced SupportsGetItem mixin class for models supporting build_get_item.
- Added ErrorHandlingCollate class for batch collation with error handling modes.
- Refactored dataset creation to use model-specific build_get_item method when available.
- Updated DataLoader to set pin_memory=True, persistent_workers=False, and conditionally set worker_init_fn.
- Unified sample field selection logic before inference or embedding computations.
- Adjusted control flow to handle video frame processing with image models.
fiftyone/utils/torch.py - Added abstract base class GetItem for converting sample dictionaries to model inputs, managing required fields, field mappings, and validation.
- Added subclass ImageGetItem specialized for image loading and transformation with support for raw inputs and half precision.
- Extended TorchImageModel with build_get_item method returning an ImageGetItem instance.
- Modified FiftyOneTorchDataset to accept a GetItem instance and added support for vectorized field caching with two batch preparation modes (_prepare_batch_db and _prepare_batch_vectorized).
- Replaced __getitem__ with __getitems__ to support batch indexing and unified single-item access.
- Enhanced error handling during batch preparation.
tests/unittests/get_item_tests.py - Added comprehensive unit tests for GetItem class and its behavior, including required fields management, field mapping validation, sample compatibility checks, and mixin usage.
tests/unittests/torch_dataset_tests.py - Replaced previous get-item function with GetItem subclasses IdentityGetItem and IntIdentityGetItem.
- Updated tests to use these classes, renamed tests to reflect vectorization focus, and adjusted expected outputs accordingly.
fiftyone/core/collections.py - Added a new to_torch method to SampleCollection for creating FiftyOneTorchDataset instances with explicit parameters for vectorization and failure skipping.
- Removed an older to_torch method with a simpler signature.

Sequence Diagram(s)

sequenceDiagram
    participant S as Sample
    participant AM as apply_model()
    participant DL as _make_data_loader()
    participant DS as Dataset Creator
    participant EC as ErrorHandlingCollate
    participant M as Model

    S->>AM: Provide sample (+ field_mapping)
    AM->>DL: Invoke _make_data_loader(field_mapping)
    DL->>DS: Create dataset (check for SupportsGetItem)
    DS->>M: Use model.build_get_item(field_mapping) to create GetItem
    DS->>EC: Pass data for batch collation with error handling
    EC->>M: Process batch for inference
Loading
sequenceDiagram
    participant T as TorchImageModel
    participant IG as ImageGetItem
    participant S as Sample Dict
    participant MI as Model Input

    T->>IG: build_get_item(transform, raw_inputs, half_precision, field_mapping)
    S->>IG: Provide sample dictionary
    IG->>IG: Validate required keys and apply transforms
    IG->>MI: Return processed input for the model
Loading

Possibly related PRs

Suggested reviewers

  • manushreegangwar

Poem

In the code garden I hop along,
Mapping fields where they belong.
Errors handled with gentle grace,
Data flows in a seamless race.
With GetItem's art and insight so keen,
My rabbit heart sings for changes seen!
🐇✨ Happy code, happy dreams!

Tip

⚡️ Faster reviews with caching
  • CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.

Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
fiftyone/utils/torch.py (3)

258-275: Docstring provides good context, but consider examples.
The docstring thoroughly describes the intended usage. However, consider adding a simple usage example for clarity, especially showing how to implement sample_dict_to_input.


361-374: Sample validation could be expensive for large collections.
The method calls samples.first() and tries a forward pass on one sample. This is usually enough to detect major issues but be mindful of performance in tight loops. Otherwise, logic is sound.


640-656: Subclass docstring is clear, but consider extended usage details.
ImageGetItem clarifies usage for image-based inputs. A simple demonstration of transforming and returning custom data fields might help future maintainers.

fiftyone/core/models.py (1)

836-837: Extended field_mapping usage in embedding pipelines.
The optional parameter is now threaded into data loader creation for embedding operations, maintaining consistency across the codebase. Documenting edge cases (e.g., conflicting fields) would be helpful.

Also applies to: 1003-1004, 1122-1122

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9305909 and 65d3d71.

📒 Files selected for processing (2)
  • fiftyone/core/models.py (10 hunks)
  • fiftyone/utils/torch.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py

341-341: Loop control variable v not used within loop body

(B007)


673-673: Local variable imgs is assigned to but never used

Remove assignment to unused variable imgs

(F841)


673-673: Undefined name imgs

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-app
  • GitHub Check: build / build
  • GitHub Check: lint / eslint
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (13)
fiftyone/utils/torch.py (9)

26-26: No immediate issues with the new import.
The added import for torchvision.transforms.functional looks standard and consistent with existing code.


38-38: Import usage appears fine.
Importing fiftyone.core.validation under a shorter alias (foval) is clean and consistent with the rest of the style.


277-280: Initialization is concise.
The constructor properly sets the field_mapping via the property setter. No immediate issues found.


281-290: Abstract method placeholders.
The sample_dict_to_input method raises NotImplementedError, which is appropriate for an abstract base. Ensure you have unit tests in subclasses to confirm actual implementations override this method.

Do you plan to write a dedicated test for each subclass to verify correct input handling?


291-293: Wrapper call is straightforward.
Calling self.sample_dict_to_input(sample) in __call__ is a clear and idiomatic approach.


294-316: Required fields logic is robust.
The property and its setter handle duplicates carefully and provide helpful validation. This should prevent accidental field mismatches. Looks good.


318-352: Field mapping property setter.
Enforcing that mapping keys belong to required_fields is a good safety measure. The approach of defaulting to identity (k -> k) is also intuitive.

🧰 Tools
🪛 Ruff (0.8.2)

341-341: Loop control variable v not used within loop body

(B007)


353-360: Field mapping updates are consistent.
update_field_mapping() ensures newly introduced required fields have default mappings. No red flags here.


748-754: Factory method for DataLoader usage.
_build_get_item returns a specialized ImageGetItem, enabling consistent data loading for Torch-based models. Implementation looks good.

fiftyone/core/models.py (4)

64-65: Added field_mapping parameter documentation.
The new parameter is clearly introduced and described, ensuring that users know how to map sample fields to model inputs. Documentation is concise.

Also applies to: 109-110


257-257: Passing field_mapping through the data loader path.
The call sites _apply_image_model_data_loader and subsequent usage in _make_data_loader handle the new parameter consistently, preserving backward compatibility.

Also applies to: 417-418


698-757: ErrorHandlingCollate class logic.
This new collate function gracefully handles batch-level exceptions and returns a single Exception or the stacked data. The approach is clear. Just ensure any partial data is properly handled or logged when multiple errors occur.


759-811: _make_data_loader method.

  1. Integrates field_mapping cleanly with _build_get_item.
  2. Provides a fallback to TorchImageDataset if _build_get_item isn't available.
  3. The use_fotd logic is straightforward.
    Overall, the enhancements appear robust.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/unittests/get_item_tests.py (2)

37-46: Remove redundant method call on line 42

There appears to be a redundant call to add_required_fields("baz") on line 42 that doesn't include any assertion. The same method is called again on line 44 followed by an assertion, making line 42 unnecessary.

        gi = DummyGetItem()
        self.assertEqual(set(gi.required_fields), set(["foo", "bar"]))
        gi.add_required_fields("baz")
        self.assertEqual(set(gi.required_fields), set(["foo", "bar", "baz"]))
-       gi.add_required_fields("baz")

        gi.add_required_fields("baz")
        self.assertEqual(set(gi.required_fields), set(["foo", "bar", "baz"]))

82-82: Remove unnecessary f-string prefix

The string doesn't contain any placeholders, so the f prefix is unnecessary.

-       samples = [fo.Sample(filepath=f"sample.jpg")]
+       samples = [fo.Sample(filepath="sample.jpg")]
🧰 Tools
🪛 Ruff (0.8.2)

82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65d3d71 and 70d2a95.

📒 Files selected for processing (1)
  • tests/unittests/get_item_tests.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/unittests/get_item_tests.py

82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-app
  • GitHub Check: lint / eslint
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: build / build
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: e2e / test-e2e
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: build
🔇 Additional comments (1)
tests/unittests/get_item_tests.py (1)

1-117: Well-structured test suite for the new GetItem class

This test suite thoroughly covers the functionality of the new GetItem class, including initialization, field mapping, validation, and extensibility with mixins. The tests are comprehensive and will help ensure the reliability of this new component.

🧰 Tools
🪛 Ruff (0.8.2)

82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fiftyone/utils/torch.py (3)

258-275: Fix minor grammatical inconsistencies in the docstring.

  1. Replace "It's :method:fiftyone.utils.torch.GetItem.samples_dict_to_input" with "Its :meth:sample_dict_to_input".
  2. Update the method name reference from "samples_dict_to_input" to "sample_dict_to_input" to match the actual function signature.
-    It's :method:`fiftyone.utils.torch.GetItem.samples_dict_to_input` method
+    Its :meth:`sample_dict_to_input` method

277-290: Clarify the documentation wording for sample_dict_to_input.

In the docstring at line 282, consider rephrasing "Return model input from a list if samples' dicts" for clarity. For example:

-        """Return model input from a list if samples' dicts
+        """Return model input from a sample's dictionary or a sample view

341-341: Remove unused loop variable v.

Since v is never used inside the loop, consider renaming it to _ to avoid confusion and to comply with best practices:

-for k, v in value.items():
+for k, _ in value.items():
🧰 Tools
🪛 Ruff (0.8.2)

341-341: Loop control variable v not used within loop body

(B007)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70d2a95 and e5446f1.

📒 Files selected for processing (1)
  • fiftyone/utils/torch.py (5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (1)
fiftyone/core/models.py (1)
  • transforms (2078-2082)
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py

341-341: Loop control variable v not used within loop body

(B007)

⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: build / build
  • GitHub Check: e2e / test-e2e
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test / test-app
  • GitHub Check: build
🔇 Additional comments (9)
fiftyone/utils/torch.py (9)

26-26: No issues found for the new import.


38-38: Import looks good.


291-317: Logic for handling required_fields looks good.


318-340: Field mapping setup is properly structured.


342-357: Field mapping validation flow is correct.


361-374: Sample compatibility check is straightforward.


640-657: Header logic for ImageGetItem is well-organized.


658-675: Confirm that transform always yields a tensor before calling .half().

If transform might return PIL images or arrays, .half() would fail. Ensure that your transforms consistently output torch.Tensor objects here, or add a conditional check.


747-755: Method _build_get_item nicely integrates with ImageGetItem.

@jacobsela jacobsela force-pushed the torch-image-model-get-item branch from e5c605f to d038d68 Compare April 16, 2025 15:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
fiftyone/utils/torch.py (1)

673-673: Half-precision conversion should check if img is a tensor.

Line 673 converts img to half precision without checking if it's a tensor, which could cause errors with some inputs. This was fixed in a previous commit (mentioned in the past review comments), but I'm flagging it to ensure it stays fixed.

🧹 Nitpick comments (2)
tests/unittests/get_item_tests.py (1)

79-95: Field validation test should use a more realistic filepath.

The test uses f"sample.jpg" but the f-string prefix is unnecessary since there are no placeholders.

-        samples = [fo.Sample(filepath=f"sample.jpg")]
+        samples = [fo.Sample(filepath="sample.jpg")]
🧰 Tools
🪛 Ruff (0.8.2)

82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)

fiftyone/utils/torch.py (1)

318-353: Unused variable in field mapping setter loop.

On line 341, the loop variable v is not used in the body of the loop. This doesn't affect functionality but could be cleaned up for better readability.

-        for k, v in value.items():
-            if k not in self.required_fields:
+        for k in value:
+            if k not in self.required_fields:
🧰 Tools
🪛 Ruff (0.8.2)

341-341: Loop control variable v not used within loop body

(B007)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5446f1 and d038d68.

📒 Files selected for processing (3)
  • fiftyone/core/models.py (10 hunks)
  • fiftyone/utils/torch.py (5 hunks)
  • tests/unittests/get_item_tests.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py

341-341: Loop control variable v not used within loop body

(B007)

tests/unittests/get_item_tests.py

82-82: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: e2e / test-e2e
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: build / build
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test / test-app
  • GitHub Check: lint / eslint
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: build
🔇 Additional comments (15)
tests/unittests/get_item_tests.py (4)

8-17: This class extends GetItem to test field mapping and required fields management.

The DummyGetItem implementation demonstrates how to properly subclass GetItem by specifying required fields and implementing the abstract sample_dict_to_input method. This is a good test utility class.


20-46: Test coverage is comprehensive for required_fields functionality.

Good test cases covering initialization, querying and extending required fields, including repeated additions (lines: 42-45) which validate that duplicates are properly handled.


47-77: Field mapping validation tests are thorough.

Good test coverage for field mapping default behavior, validation, and partial updates. The tests verify that field mappings properly reject invalid configurations and handle partial mappings by preserving defaults.


96-113: Mixin test verifies extensibility of the GetItem framework.

This test effectively demonstrates how to use mixins to extend the GetItem class, specifically how additional required fields are added and reflected in subclasses. This confirms that the inheritance model works as expected.

fiftyone/utils/torch.py (5)

258-290: Well-designed base class with clear documentation.

The GetItem class provides a robust abstraction for extracting model inputs from samples, with a clear API and thorough documentation explaining the role of required fields and field mappings.


291-317: Required fields implementation is clean and maintainable.

The use of property getters/setters ensures that required fields are properly managed and provides appropriate lazily-initialized defaults when needed.


354-374: Good error handling in validate_compatible_samples method.

Effective validation approach that fails fast with meaningful error messages when samples aren't compatible with the GetItem requirements. The TODO comment on line 365 should be addressed in future work.

Is there a planned timeline for implementing the type checking mentioned in the TODO comment on line 365?


640-676: ImageGetItem implementation follows single responsibility principle.

The ImageGetItem class is a clean implementation of the GetItem interface specific to image models. It handles loading images, applying transforms, and manages half-precision conversion when needed.


747-755: Good integration of GetItem with TorchImageModel.

The _build_get_item method provides a clean way to create an ImageGetItem instance configured with the model's transforms and settings. This enables the model to leverage the flexible sample-to-input conversion provided by the GetItem framework.

fiftyone/core/models.py (6)

52-110: Clear documentation for the field_mapping parameter.

The field_mapping parameter is well-documented in apply_model, explaining its purpose and relationship to the GetItem class.


247-257: Proper propagation of field_mapping to data loader creation.

The field_mapping parameter is correctly passed from _apply_image_model_data_loader to _make_data_loader, ensuring that field mappings are properly used when loading data for model application.


699-757: Well-structured ErrorHandlingCollate class.

The new ErrorHandlingCollate class encapsulates batch collation error handling logic that was previously inline, making it more reusable and testable. The class handles different collation strategies based on whether batches are ragged or numpy-based.


759-810: Improved data loader creation with GetItem integration.

The _make_data_loader function now intelligently selects between using the model's _build_get_item method or falling back to TorchImageDataset based on model capabilities. The addition of worker initialization and explicit setting of persistent_workers=False helps avoid resource leaks.


991-1000: Proper propagation of field_mapping to compute_embeddings.

The field_mapping parameter is correctly passed from _compute_image_embeddings_data_loader to _make_data_loader, ensuring consistent behavior between model prediction and embedding computation.


813-822:

Details

❓ Verification inconclusive

Warning about skip_failures implementation in _make_fo_torch_dataset.

The code currently has a comment stating skip_failures needs to be implemented, but it is passing the parameter directly to to_torch(). Ensure that this parameter is actually being handled by the dataset creation method.


🏁 Script executed:

#!/bin/bash
# Check if samples.to_torch() accepts and handles skip_failures parameter

# Search for the method definition
rg --type python "def to_torch" -A 10

Length of output: 70


Attention: Manual Verification Required for skip_failures Parameter Handling

The code still passes the skip_failures parameter directly to samples.to_torch(), but our automated search for the to_torch() definition was inconclusive due to file-type recognition issues. Please confirm that the to_torch() method actually implements handling for skip_failures.

  • Location: fiftyone/core/models.py (lines 813–822)
  • Action Required:
    • Manually inspect the implementation of to_torch() to ensure that the skip_failures flag is processed appropriately.
    • If necessary, update the dataset creation logic to correctly handle failure cases as hinted by the inline comment.

To aid further verification, you could run this shell script from your repository root:

#!/bin/bash
# Re-run the search for the `to_torch` method definition in Python files
fd -e py | xargs rg "def to_torch(" -A 15

Please verify the output manually to ensure the skip_failures parameter is managed as expected.

Copy link
Contributor

@manushreegangwar manushreegangwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Few comments. I will test the changes on my end this week.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fiftyone/utils/torch.py (3)

341-342: Simplify dictionary key check.

The .keys() call is unnecessary when checking for key existence in a dictionary.

-        for k in value.keys():
+        for k in value:
🧰 Tools
🪛 Ruff (0.8.2)

341-341: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


660-662: Consider using field_mapping for filepath lookup.

The filepath is accessed directly with a hardcoded key rather than using the field mapping mechanism, which is one of the key features of the GetItem class.

-        img = _load_image(
-            # hardcoded because that's how it was in fiftyone.core.models._make_data_loader
-            sample_dict["filepath"],
+        # Use the field mapping to get the correct field name
+        filepath_field = self.field_mapping.get("filepath", "filepath") 
+        img = _load_image(
+            sample_dict[filepath_field],

749-749: Consider using property accessor for transforms.

As noted in a previous review, there's inconsistency between using self._transforms and self.transforms in the codebase. This can be addressed in a future refactor.

-            transform=self._transforms,
+            transform=self.transforms,
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d038d68 and 0d037e7.

📒 Files selected for processing (2)
  • fiftyone/core/models.py (10 hunks)
  • fiftyone/utils/torch.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • fiftyone/core/models.py
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py

341-341: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: build / build
  • GitHub Check: lint / eslint
  • GitHub Check: test / test-app
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (5)
fiftyone/utils/torch.py (5)

258-374: Good addition of the GetItem abstract class for standardized sample conversion.

This well-designed abstraction provides a clean interface for converting sample dictionaries to model inputs, with clear field management and validation capabilities.

🧰 Tools
🪛 Ruff (0.8.2)

341-341: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


369-371: Note: Compatibility check only validates against the first sample.

This validation approach is a reasonable heuristic for quick error detection, but won't catch inconsistencies in later samples. This is a good trade-off between thoroughness and performance.


640-676: Good implementation of ImageGetItem for TorchImageModels.

The implementation correctly handles image loading, transformation, and tensor conversion with support for half precision.


747-754: New method to build image GetItem instances.

This internal method provides a clean way to create ImageGetItem instances configured with the model's parameters.


756-766: Good public interface for building GetItem instances.

This method provides a clean public interface for creating GetItem instances with optional parameter overrides.

Copy link
Contributor

@manushreegangwar manushreegangwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on teams

19.5s for 10000 samples

@jacobsela
Copy link
Contributor Author

jacobsela commented Apr 23, 2025

@manushreegangwar This doesn't work out of the box in teams as shown in the demo. Please see the branch on teams that I DMed you. You should see substantially more speedup.

Also, please limit the discussion here to the code changes or performance in OSS. Teams performance can be discussed on that branch.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/unittests/get_item_tests.py (1)

46-48: Use set comparison instead of Counter

Since add_required_fields should handle deduplication internally, it's clearer to verify this using set comparison rather than Counter.

        self.assertEqual(
-            Counter(gi.required_fields), Counter(["foo", "bar", "baz"])
+            set(gi.required_fields), set(["foo", "bar", "baz"])
        )

This addresses the previous reviewer's comment about checking against gi.required_fields since add_required_fields is expected to handle the set part.

🧹 Nitpick comments (2)
tests/unittests/get_item_tests.py (2)

43-46: Remove redundant call to add_required_fields

There's a duplicate call to gi.add_required_fields("baz") on consecutive lines. While the test correctly verifies that duplicate fields aren't added multiple times, one of these calls is redundant for the test itself.

        gi.add_required_fields("baz")
-
-        gi.add_required_fields("baz")
        self.assertEqual(
            Counter(gi.required_fields), Counter(["foo", "bar", "baz"])
        )

85-85: Remove unnecessary f-string

This line uses an f-string without any placeholders, which is unnecessary.

-        samples = [fo.Sample(filepath=f"sample.jpg")]
+        samples = [fo.Sample(filepath="sample.jpg")]
🧰 Tools
🪛 Ruff (0.8.2)

85-85: f-string without any placeholders

Remove extraneous f prefix

(F541)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d037e7 and d3e51d7.

📒 Files selected for processing (1)
  • tests/unittests/get_item_tests.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/unittests/get_item_tests.py

85-85: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (1)
tests/unittests/get_item_tests.py (1)

1-120: Overall test coverage looks comprehensive

This test file provides excellent coverage of the GetItem class functionality, including testing:

  • Basic operation
  • Field management and validation
  • Field mapping capabilities
  • Compatibility validation
  • Extension through mixins

The tests are well-structured and cover both normal usage patterns and edge cases.

🧰 Tools
🪛 Ruff (0.8.2)

85-85: f-string without any placeholders

Remove extraneous f prefix

(F541)

@manushreegangwar
Copy link
Contributor

Performance on OSS with cifar-10 dataset and clip:
on this branch: 1.5K samples/s
on develop: 981.9 samples/s

@jacobsela jacobsela requested review from a team as code owners May 2, 2025 18:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fiftyone/utils/torch.py (3)

261-262: Fix the method name in documentation.

The documentation refers to samples_dict_to_input method, but the actual implemented method name is sample_dict_to_input (singular, not plural). This inconsistency could confuse users.

-    It's :method:`fiftyone.utils.torch.GetItem.samples_dict_to_input` method
+    Its :method:`fiftyone.utils.torch.GetItem.sample_dict_to_input` method

366-366: Address the TODO comment before release.

This TODO comment about adding type checking for fields should be addressed before the final release, or converted to a tracked issue if it's meant for future work.


658-658: Consider making the required field name configurable.

The filepath field is hardcoded as a required field. For greater flexibility, consider making this configurable, allowing users to specify alternative field names when their data structure differs.

 def __init__(
     self,
     transform=None,
     raw_inputs=False,
     using_half_precision=False,
     use_numpy=False,
+    image_field="filepath",
     **kwargs,
 ):
     super().__init__(**kwargs)
     self.transform = transform
     self.raw_inputs = raw_inputs
     self.using_half_precision = using_half_precision
     self.use_numpy = use_numpy
-    self.add_required_fields("filepath")
+    self.image_field = image_field
+    self.add_required_fields(image_field)

And then update the sample_dict_to_input method:

 def sample_dict_to_input(self, sample_dict):
     img = _load_image(
-        # hardcoded because that's how it was in fiftyone.core.models._make_data_loader
-        sample_dict["filepath"],
+        sample_dict[self.field_mapping.get(self.image_field, self.image_field)],
         use_numpy=self.use_numpy,
         force_rgb=True,
     )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3e51d7 and e255c3c.

📒 Files selected for processing (1)
  • fiftyone/utils/torch.py (4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py

342-342: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test / test-app
  • GitHub Check: build / build
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: lint / eslint
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (7)
fiftyone/utils/torch.py (7)

278-280: Good implementation of super().init.

The super call is correctly implemented to avoid MRO (Method Resolution Order) issues when this class is used with mixins, as noted in a previous comment.


302-317: Well-implemented field management.

The design for adding required fields is robust, handling both string and list inputs, and ensures uniqueness through the use of a set. Good practice to maintain tuple immutability for the stored fields.


331-352: Verify field mapping validation behavior.

The field mapping setter correctly validates that keys are in required fields, but it might be worth adding a test case to confirm the behavior when field_mapping is None and when mixins have already set some mappings.

🧰 Tools
🪛 Ruff (0.8.2)

342-342: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


369-374: Effective validation heuristic.

The validation approach is pragmatic - checking compatibility with the first sample provides a good balance between early error detection and performance. As noted in previous comments, this doesn't guarantee compatibility for all samples, but is a reasonable trade-off.


673-674: Bug fix for half precision application.

This code correctly applies half precision to the image tensor, addressing a bug mentioned in previous reviews where an undefined variable imgs was referenced.


748-755: Good implementation of GetItem factory method.

The _build_get_item method correctly constructs an ImageGetItem instance with the model's configuration settings. This implementation enables the performance improvements targeted by this PR.


757-766: Well-designed public interface.

The public build_get_item method provides a clean interface for users to customize the GetItem instance, supporting the documented feature that allows TorchImageModels to benefit from more efficient data loading.

@jacobsela jacobsela requested a review from brimoor May 2, 2025 21:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
fiftyone/core/models.py (4)

253-253: Simplify the default None parameter in kwargs.get

The default value of None is redundant as it's already the default return value for dict.get() when the key is not found.

-                field_mapping=kwargs.get("field_mapping", None),
+                field_mapping=kwargs.get("field_mapping"),
🧰 Tools
🪛 Ruff (0.8.2)

253-253: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)


810-818: Clean implementation of FiftyOneTorchDataset creation

This new function creates a torch dataset using the model's build_get_item method, which aligns with the PR objective of integrating with the new GetItem abstraction. The implementation is concise and focused.

However, consider adding a brief docstring to explain the purpose and parameters of this function for better maintainability.

 def _make_fo_torch_dataset(
     samples,
     model,
     skip_failures,
     field_mapping=None,
 ):
+    """Creates a FiftyOneTorchDataset using the model's GetItem interface.
+    
+    Args:
+        samples: a SampleCollection
+        model: a model that implements build_get_item()
+        skip_failures: whether to skip failed item processing
+        field_mapping: optional mapping from sample fields to model inputs
+        
+    Returns:
+        a FiftyOneTorchDataset
+    """
     get_item = model.build_get_item(field_mapping=field_mapping)
     dataset = samples.to_torch(get_item=get_item, skip_failures=skip_failures)
     return dataset

993-993: Simplify the default None parameter in kwargs.get

Similar to the earlier instance, the default value of None is redundant here.

-                field_mapping=kwargs.get("field_mapping", None),
+                field_mapping=kwargs.get("field_mapping"),
🧰 Tools
🪛 Ruff (0.8.2)

993-993: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)


775-784: Excellent conditional dataset creation strategy

This adds support for models that implement the build_get_item method to leverage the more efficient FiftyOneTorchDataset. This is the core enhancement that enables the performance improvements mentioned in the PR objectives.

It would be helpful to add a comment explaining what FOTD stands for (presumably FiftyOneTorchDataset) for better code readability.

-    use_fotd = hasattr(model, "build_get_item")
+    # Check if model supports FiftyOneTorchDataset (FOTD) via GetItem interface
+    use_fotd = hasattr(model, "build_get_item") 
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e255c3c and 89c4e7e.

📒 Files selected for processing (1)
  • fiftyone/core/models.py (6 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/models.py

253-253: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)


993-993: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)

⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: lint / eslint
  • GitHub Check: test / test-app
  • GitHub Check: build / build
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (4)
fiftyone/core/models.py (4)

695-754: Well-structured error handling for batch collation

The implementation of ErrorHandlingCollate provides a robust approach to handling errors during batch processing with three distinct collation strategies. This is a good abstraction that centralizes error handling logic and supports different batch processing needs.

The use of a static method for handle_errors is good for reuse across the collation strategies, though as mentioned in previous code reviews, consider alternatives to staticmethod for better testability in the future.


756-807: Good refactoring of data loader with enhanced configuration

The refactored _make_data_loader function now provides better flexibility by:

  1. Supporting a conditional dataset creation strategy based on model capabilities
  2. Using the new error handling collation
  3. Setting optimal DataLoader parameters like pin_memory=True
  4. Using appropriate worker initialization based on dataset type

The explicit setting of persistent_workers=False with a comment explaining its importance is particularly good practice.


799-807: Optimal DataLoader configuration

Setting pin_memory=True can significantly improve data transfer performance with GPU models, and the explicit declaration of persistent_workers=False with comments explaining the rationale is excellent practice.

The conditional worker initialization function is also a nice touch, ensuring that any specialized initialization is only applied when needed.


764-770:

Details

✅ Verification successful

Verify behavior when ragged_batches is True

The new code path correctly handles the model's ragged_batches property, but let's ensure we understand the end-to-end implications when this is set to True. When ragged_batches=True, _ragged_batches collation will be used, which simply returns the batch as-is without stacking.


🏁 Script executed:

#!/bin/bash
# Check model implementations for ragged_batches usage patterns

# Find definition of ragged_batches in model classes
echo "Models defining ragged_batches property:"
rg -A 3 -B 3 "\s+def ragged_batches" --glob "*.py"

# Find where ragged_batches might be set
echo -e "\nPlaces where ragged_batches might be set:"
rg "\.ragged_batches\s*=" --glob "*.py"

# Find usage of ragged_batches in conditions
echo -e "\nConditional uses of ragged_batches:"
rg "if.*ragged_batches" --glob "*.py"

Length of output: 4362


🏁 Script executed:

#!/bin/bash
# Locate the definition of ErrorHandlingCollate and inspect how ragged_batches is handled
rg -R "class ErrorHandlingCollate" -A50 --glob "*.py"

Length of output: 460


🏁 Script executed:

#!/bin/bash
# Find and inspect the definition of ErrorHandlingCollate
rg "class ErrorHandlingCollate" -A50 --glob "*.py"

# Search for any helper functions selected based on ragged_batches
rg "def _ragged_batches" -A30 --glob "*.py"

Length of output: 4202


ragged_batches path verified
Inspection of ErrorHandlingCollate confirms that when ragged_batches=True, its constructor binds _ragged_batches as the collate function, which simply runs error handling and returns the batch list unchanged (no stacking or numpy conversion). Downstream code must accept a list of tensors rather than a stacked tensor, but no changes to the current implementation are required.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobsela I know you've been working on this for awhile and you want to wrap it up. At the same time, as this is a core interface, I want to make sure we're happy with the interface precedent that we're adding here.

I'm happy to chat live if helpful to finalize any design details so we can get this merged and done ASAP.

@jacobsela jacobsela requested a review from brimoor May 5, 2025 23:36
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
fiftyone/core/models.py (3)

253-253: Simplify the parameters by removing the redundant default value.

The None default in kwargs.get("field_mapping", None) is redundant since get() already returns None when the key doesn't exist.

-                field_mapping=kwargs.get("field_mapping", None),
+                field_mapping=kwargs.get("field_mapping"),
🧰 Tools
🪛 Ruff (0.8.2)

253-253: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)


774-777: Consider using a ternary operator for cleaner code.

This conditional assignment could be simplified with a ternary operator.

-    if model.has_collate_fn:
-        user_collate_fn = model.collate_fn
-    else:
-        user_collate_fn = None
+    user_collate_fn = model.collate_fn if model.has_collate_fn else None
🧰 Tools
🪛 Ruff (0.8.2)

774-777: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)


1007-1007: Same simplification applies here.

As with line 253, the None default is redundant.

-                field_mapping=kwargs.get("field_mapping", None),
+                field_mapping=kwargs.get("field_mapping"),
🧰 Tools
🪛 Ruff (0.8.2)

1007-1007: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)

fiftyone/utils/torch.py (1)

590-629: Good implementation of GetItem for image models.

The ImageGetItem class provides a concrete implementation for image models with appropriate options for transformations and precision.

The filepath key is hardcoded on line 610, as noted in the comment. While this preserves existing behavior, consider making this configurable in a future update for better flexibility:

-            # hardcoded because that's how it was in fiftyone.core.models._make_data_loader
-            d["filepath"],
+            d[self.field_mapping.get("filepath", "filepath")],
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ce1bbe and 5bb33ac.

📒 Files selected for processing (2)
  • fiftyone/core/models.py (7 hunks)
  • fiftyone/utils/torch.py (8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (4)
tests/unittests/get_item_tests.py (1)
  • required_keys (14-15)
fiftyone/core/models.py (6)
  • required_keys (2400-2402)
  • SupportsGetItem (2395-2408)
  • TorchModelMixin (2353-2392)
  • LogitsMixin (2143-2170)
  • Model (2015-2140)
  • build_get_item (2404-2408)
tests/unittests/torch_dataset_tests.py (2)
  • required_keys (24-25)
  • required_keys (37-38)
fiftyone/core/view.py (1)
  • make_optimized_select_view (1950-2022)
🪛 Ruff (0.8.2)
fiftyone/core/models.py

253-253: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)


774-777: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)


1007-1007: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)

Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")

(SIM910)

⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-app
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: build / build
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (13)
fiftyone/core/models.py (5)

695-759: Well-structured collation class with clear separation of concerns.

The new ErrorHandlingCollate class provides a clean interface for handling different collation strategies (ragged batches, numpy stacking, or default PyTorch collation) with proper error handling. This supports the throughput improvements mentioned in the PR objectives.


762-789: Data loader function now supports GetItem abstraction.

The updated _make_data_loader function now provides support for the new SupportsGetItem mixin, appropriately configuring data loading based on model capabilities. The field_mapping parameter provides the desired flexibility.

🧰 Tools
🪛 Ruff (0.8.2)

774-777: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)


789-821: Data loader configuration optimizes for performance.

The explicit configuration of pin_memory=True and persistent_workers=False with conditional worker initialization is a good optimization. The comment about avoiding "zombies" when only iterating the dataset once is helpful context.


824-832: Clean implementation of FiftyOneTorchDataset creation.

The _make_fo_torch_dataset function elegantly delegates to the model's build_get_item method and the sample collection's to_torch method, making the code extensible and maintainable.


2395-2409: Well-designed mixin for extending model capabilities.

The SupportsGetItem mixin provides a clean interface for models to integrate with the new FiftyOneTorchDataset. The property required_keys effectively delegates to the GetItem instance, avoiding duplication, and the build_get_item method provides the necessary customization point through the field_mapping parameter.

This is the key enabler for the performance improvements mentioned in the PR objectives, allowing TorchImageModels to benefit from more efficient data loading via FOTD.

fiftyone/utils/torch.py (8)

257-324: Well-designed abstraction for model input preparation.

The GetItem abstract class establishes a clean interface for converting sample dictionaries to model inputs, encapsulating the extraction logic and field mapping functionality.

However, I have a minor observation about line 275:

-        super().__init__(**kwargs)
+        # Initialize any subclass attributes with kwargs

The super() call is unnecessary in a base class without explicit inheritance, but it's likely there to support mixins as mentioned in the PR context.


631-637: Appropriate inheritance for TorchImageModel.

Adding fom.SupportsGetItem to the inheritance chain enables the model to properly integrate with the new GetItem-based data pipeline.


712-720: Good implementation of build_get_item method.

The method correctly configures an ImageGetItem with the model's transform and precision settings, enabling efficient data pipeline integration.

Note: Based on the past review comments, there's a future refactoring opportunity to standardize on either self._transforms or self.transforms property throughout the codebase, but this is appropriately maintained for consistency with the existing implementation.


1644-1766: Enhanced dataset implementation with GetItem integration.

The refactored FiftyOneTorchDataset now leverages the GetItem abstraction for more structured sample processing, with added support for vectorized field caching.

However, there's a potential improvement for line 1750-1751:

-        self.cached_fields = None
-        self.cache_field_names = self.field_mapping.values()
+        self.cache_field_names = list(self.field_mapping.values())
+        self.cached_fields = None

Using list() ensures we're not working with a view that could change if the underlying dictionary is modified. Also, for clarity, it's better to set cache_field_names before cached_fields.

🧰 Tools
🪛 Ruff (0.8.2)

1730-1730: No explicit stacklevel keyword argument found

(B028)


1864-1882: Well-structured batch preparation with error handling.

The method provides good per-field and per-sample error handling, allowing for graceful failure during batch preparation from the database.


1883-1891: Efficient vectorized batch preparation.

This method efficiently retrieves pre-cached fields for batch preparation, providing a performance optimization path.


1893-1895: Streamlined single-item access.

By having __getitem__ call __getitems__ with a single index, the code maintains consistency in batch preparation logic between single-item and multi-item access patterns.


1896-1910: Unified batch retrieval implementation.

The __getitems__ method correctly dispatches to the appropriate batch preparation method based on whether vectorization is enabled, and handles errors at both the sample dictionary and item processing levels.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
fiftyone/utils/torch.py (5)

275-275: Super call exists without an explicit parent class.

The call to super().__init__(**kwargs) is technically valid since all classes inherit from object, but it's unusual to see it when no explicit parent class is defined. Based on previous comments, I understand this is to avoid MRO bugs with mixins.

Consider adding a brief comment explaining why this super call exists to avoid confusing future maintainers:

def __init__(self, field_mapping=None, **kwargs):
-    super().__init__(**kwargs)
+    # Call to super() ensures proper MRO resolution when used with mixins
+    super().__init__(**kwargs)
    self.field_mapping = field_mapping

305-318: Field mapping validation could be more comprehensive.

The current implementation validates that each provided key is in required_keys, but doesn't verify that all required keys are actually mapped.

Consider adding a check to ensure all required keys are present in the field mapping:

def field_mapping(self, value):
    if not hasattr(self, "_field_mapping_dict"):
        self._field_mapping_dict = {}
    if value is None:  # generally on init
        value = {k: k for k in self.required_keys}
    if not isinstance(value, dict):
        raise ValueError("field_mapping must be a dictionary.")

+    # Ensure all required keys are present
+    missing_keys = set(self.required_keys) - set(value.keys())
+    if missing_keys:
+        raise ValueError(f"Missing required keys in field_mapping: {missing_keys}")

    for k, v in value.items():
        if k not in self.required_keys:
            raise ValueError(
                f"Field '{k}' is not in the required keys: {self.required_keys}"
            )
        self._field_mapping_dict[k] = v

313-313: The value.items() iteration could be optimized.

As noted in the past review comments, only keys are used in this validation loop, not values.

Replace value.items() with value.keys() since the values aren't used in this loop:

-    for k, v in value.items():
+    for k in value.keys():
        if k not in self.required_keys:
            raise ValueError(
                f"Field '{k}' is not in the required keys: {self.required_keys}"
            )
        self._field_mapping_dict[k] = v

609-610: Consider making "filepath" field configurable rather than hardcoding it.

The current implementation hardcodes "filepath" as the key to access the image path, but this limits flexibility.

Consider making the field name configurable:

def __call__(self, d):
    img = _load_image(
-        # hardcoded because that's how it was in fiftyone.core.models._make_data_loader
-        d["filepath"],
+        d[self.field_mapping.get("filepath", "filepath")],
        use_numpy=self.use_numpy,
        force_rgb=True,
    )

714-714: Use self.transforms property instead of directly accessing self._transforms.

As noted in past comments, some models override the transforms property but not self._transforms, which could lead to inconsistent behavior.

Use the property to ensure consistent behavior:

def build_get_item(self, field_mapping=None):
    return ImageGetItem(
-        transform=self._transforms,
+        transform=self.transforms,
        raw_inputs=self.config.raw_inputs,
        using_half_precision=self._using_half_precision,
        use_numpy=False,
        field_mapping=field_mapping,
    )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5bb33ac and f2b9f70.

📒 Files selected for processing (1)
  • fiftyone/utils/torch.py (10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: modified-files
  • GitHub Check: build
🔇 Additional comments (9)
fiftyone/utils/torch.py (9)

257-325: Well-designed abstract class for standardizing sample-to-input conversion.

The GetItem class provides a clean interface for defining how to convert sample dictionaries to model inputs, with proper field mapping validation. It's a good abstraction that will help standardize this functionality across the codebase.

The approach helps separate model logic from data loading, improving maintainability.


590-629: The ImageGetItem implementation is appropriately tailored for images.

This concrete implementation of GetItem handles image loading and preprocessing according to the model's requirements. The half-precision handling is properly guarded with type checking to avoid errors on non-tensor objects.


631-637: Good integration with the existing class hierarchy.

Adding fom.SupportsGetItem to the inheritance list allows TorchImageModel to advertise its support for the new GetItem interface, making the implementation discoverable through the type system.


712-720: Well-implemented factory method to build the appropriate GetItem.

The build_get_item method properly configures an ImageGetItem instance with the model's transforms and settings, ensuring consistent behavior between direct model usage and the new dataset-based approach.


1734-1735: Good field selection optimization.

Selecting only the fields needed by the GetItem instance is an important optimization, especially for datasets with many fields. This reduces the amount of data loaded from the database and improves performance.


1804-1805: Critical fix for database connection handling in multiprocessing.

Disconnecting the database before establishing new connections in worker processes is essential to prevent deadlocks, especially with MongoDB which is not fork-safe.


1869-1887: Well-designed error handling for batch preparation.

The error handling in _prepare_batch_db properly catches and reports errors when loading fields, which is essential for debugging and robustness, especially when running with multiple workers.


1888-1897: Efficient vectorized batch preparation.

The _prepare_batch_vectorized method provides an efficient alternative to loading data from the database repeatedly. Pre-caching fields can significantly improve performance for datasets that fit in memory.


1898-1914: Good implementation of batch indexing.

The new __getitems__ method properly dispatches to the appropriate batch preparation method and applies the GetItem conversion to each sample. This enables efficient batch processing without sacrificing flexibility.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fiftyone/core/models.py (1)

806-859: Refactored data loader creation with unified error handling.

The refactored _make_data_loader function now:

  1. Uses the new ErrorHandlingCollate class for consistent error handling
  2. Conditionally creates datasets based on model capabilities
  3. Configures appropriate worker initialization functions
  4. Sets optimized DataLoader parameters like pin_memory

I noticed there's an opportunity to use a ternary operator for the user_collate_fn assignment as suggested by static analysis.

-    if model.has_collate_fn:
-        user_collate_fn = model.collate_fn
-    else:
-        user_collate_fn = None
+    user_collate_fn = model.collate_fn if model.has_collate_fn else None
🧰 Tools
🪛 Ruff (0.8.2)

822-825: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2b9f70 and c597f97.

📒 Files selected for processing (1)
  • fiftyone/core/models.py (27 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/models.py

822-825: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test / test-app
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build / build
  • GitHub Check: build
🔇 Additional comments (8)
fiftyone/core/models.py (8)

155-160: Support for field mapping is a nice enhancement.

This new parameter allows models to customize how sample fields are mapped to model inputs, giving more flexibility to model implementers. Good use of the kwargs dictionary to avoid adding new parameters to the function signature.


168-171: Better control flow with explicit video frame processing flag.

The new process_video_frames flag makes the code more readable and maintainable by clearly indicating when an image model is being applied to video frames. This improves the architecture by making the control flow more explicit and easier to follow.

Also applies to: 213-213, 226-226, 972-975, 1010-1010, 1017-1017, 1525-1525, 1584-1584


172-175: Extended data loader support for models implementing SupportsGetItem.

The use_data_loader condition now correctly includes models implementing the new SupportsGetItem interface, but properly excludes video frame processing which requires special handling.

Also applies to: 976-979, 1527-1530


332-337: Consistent field selection across methods.

The refactored field selection pattern ensures consistent behavior across all methods, explicitly filtering fields when samples implement SamplesMixin or including only the needed fields for specialized operations. This reduces redundancy and improves maintainability.

Also applies to: 379-384, 448-453, 511-516, 578-583, 649-654, 1062-1062, 1106-1106, 1174-1174, 1234-1234, 1300-1300, 1370-1370, 1639-1639, 1754-1754


740-804: New ErrorHandlingCollate class improves batch processing robustness.

This well-designed class centralizes batch collation logic with three strategies:

  1. Handling ragged batches (variable-sized tensors)
  2. Using numpy stacking when appropriate
  3. Using default PyTorch collation

The error handling mechanism gracefully handles exceptions based on the skip_failures flag. This is essential for production systems that need to process large datasets with potentially problematic samples.


837-841: Key implementation for GetItem integration with torch datasets.

This critical block enables the performance improvements mentioned in the PR objectives. For models that support GetItem:

  1. The model can build its own GetItem instance with custom field mapping
  2. The dataset is created with to_torch() which likely enables more efficient data loading
  3. Worker initialization is properly configured for multiprocessing

This change is what enables the throughput improvement from 1.1k to 1.4k samples per second.


2442-2456: New SupportsGetItem mixin enables efficient and customizable data loading.

This well-designed mixin class:

  1. Marks models compatible with the new structured data retrieval mechanism
  2. Defines a clear API for building GetItem instances
  3. Provides access to required keys for validation
  4. Makes the integration with FiftyOneTorchDataset seamless

This is a clean, object-oriented approach to extending model capabilities without breaking existing code.


856-859: Improved DataLoader configuration.

Setting pin_memory=True can significantly improve performance when transferring data to CUDA devices by using pinned memory. Setting persistent_workers=False avoids keeping workers alive between data iterations which is appropriate for most inference scenarios. The conditional setting of worker_init_fn ensures proper worker initialization when needed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
fiftyone/utils/torch.py (1)

257-324: Well-designed abstraction, but consider simplifying field_mapping initialization

The GetItem class provides a good abstraction for converting sample dictionaries to model inputs. It enables consistent field selection and handling across different model types.

However, the field_mapping property's initialization could be simplified:

-    @property
-    def field_mapping(self):
-        """
-        a dictionary mapping the fields in the sample to the input fields for the model
-        if not defined, it's assumed that the fields are the same as the `required_keys`
-        of the `GetItem` instance.
-
-        When this is updated, the behavior of __call__ should be updated accordingly.
-        This is what allows the user to specify the fields in the sample that are use
-        as input to the model.
-        """
-        if not hasattr(self, "_field_mapping_dict"):
-            self._field_mapping_dict = {k: k for k in self.required_keys}
-        return self._field_mapping_dict
+    @property
+    def field_mapping(self):
+        """
+        a dictionary mapping the fields in the sample to the input fields for the model
+        if not defined, it's assumed that the fields are the same as the `required_keys`
+        of the `GetItem` instance.
+
+        When this is updated, the behavior of __call__ should be updated accordingly.
+        This is what allows the user to specify the fields in the sample that are use
+        as input to the model.
+        """
+        return self._field_mapping_dict

And initialize self._field_mapping_dict directly in __init__:

     def __init__(self, field_mapping=None, **kwargs):
         super().__init__(**kwargs)
+        self._field_mapping_dict = {k: k for k in self.required_keys}
         self.field_mapping = field_mapping

This would make the code more straightforward and avoid the need for hasattr checks.

fiftyone/core/models.py (3)

766-777: Consider making this a class method

Since handle_errors doesn't use any instance attributes, consider converting it to a @classmethod rather than a @staticmethod to improve readability and allow for more natural subclassing.

-    @staticmethod
-    def handle_errors(batch):
+    @classmethod
+    def handle_errors(cls, batch):

828-831: Use a ternary operator for clarity

Replace the if-else block with a more concise ternary expression.

-    if model.has_collate_fn:
-        user_collate_fn = model.collate_fn
-    else:
-        user_collate_fn = None
+    user_collate_fn = model.collate_fn if model.has_collate_fn else None
🧰 Tools
🪛 Ruff (0.8.2)

828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)


857-865: DataLoader configuration improvements

The addition of pin_memory=True will improve data transfer to GPU. Setting persistent_workers=False is a reasonable default but consider making this configurable for long-running workloads.

Consider adding a parameter to allow configuring persistent_workers for scenarios with continuous inference on large datasets, where the worker startup overhead could be significant.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b35d9e1 and 2dfd701.

📒 Files selected for processing (2)
  • fiftyone/core/models.py (27 hunks)
  • fiftyone/utils/torch.py (10 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/models.py

828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)

⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: test / test-app
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build / build
  • GitHub Check: build
🔇 Additional comments (18)
fiftyone/utils/torch.py (11)

13-13: LGTM: Improved typing imports

The addition of specific type imports (Any, Optional, List) is good practice and enhances type checking capabilities.


590-629: LGTM: Clean implementation of ImageGetItem

The ImageGetItem class provides a clean, specific implementation for image models with appropriate error handling and half-precision support.

I especially appreciate the proper tensor type checking on line 621 before applying half-precision, which avoids potential errors when img might not be a tensor.


631-633: LGTM: Added support for GetItem interface

Good addition of the SupportsGetItem mixin to TorchImageModel, which aligns with the PR objective to enable models to benefit from more efficient data loading.


712-720: LGTM: Well-implemented GetItem builder method

The build_get_item method properly constructs an ImageGetItem with all the necessary configuration from the model. This enables the data loading optimization mentioned in the PR objectives.


1707-1711: LGTM: Improved dataset initialization with GetItem

The updated parameter signature using a strongly-typed GetItem instance and the new vectorize flag enables the performance improvements mentioned in the PR objectives.


1727-1729: Strong improvement: Field selection optimization

This code now selects only the needed fields from samples using the field mapping from GetItem, which is a significant performance optimization when datasets have many fields.


1742-1751: LGTM: Efficient field caching strategy

The conditional caching of fields based on the vectorize flag provides a good balance between memory usage and performance. When enabled, this should significantly improve throughput as mentioned in the PR objectives.


1796-1798: Important fix for worker safety

Adding the database disconnect before initializing workers is crucial for preventing connection issues in multiprocessing environments. This is a good safety enhancement.


1858-1890: LGTM: Well-implemented batch preparation

The code now offers two different batch preparation strategies (_prepare_batch_db and _prepare_batch_vectorized) depending on whether fields are cached. The implementation includes proper error handling for individual fields, which is a good defensive practice.


1891-1893: Simple but effective delegation pattern

The single-item getter now delegates to the batch getter, which is a good code reuse pattern. This ensures consistent behavior between single-item and batch accesses.


1894-1907: LGTM: Robust batch retrieval implementation

The __getitems__ method properly dispatches to the appropriate batch preparation method based on the vectorize flag and handles errors consistently. It's a clean implementation that should deliver the performance improvements targeted by this PR.

fiftyone/core/models.py (7)

155-172: Enhanced model capability detection with field mapping support

This code introduces structured handling of field mappings based on model capabilities. It separates concerns between field mapping (for SupportsGetItem models) and needed fields (for SamplesMixin models) with clean logic branches.


746-765: Good implementation of the error-handling collate class

The ErrorHandlingCollate class provides a clean abstraction for handling different collation strategies with proper error management. It intelligently switches between ragged batches, numpy stacking, and default collation based on model requirements.


812-842: Robust data loader configuration

The refactored _make_data_loader function now properly handles field mappings and model-specific requirements. This change is central to the PR's performance improvements.

🧰 Tools
🪛 Ruff (0.8.2)

828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)


833-838: Comprehensive error handling collation setup

The collate function is well-configured with all necessary parameters based on model requirements.


843-856: Clean conditional dataset creation

This section efficiently switches between the new GetItem-based approach and the traditional torch dataset based on model capabilities.


972-981: Consistent field mapping implementation in compute_embeddings

This follows the same pattern as in apply_model, ensuring consistent behavior across different entry points.


2448-2462: Well-designed SupportsGetItem mixin interface

The mixin cleanly defines the interface for models that support the new GetItem approach, with a clear contract for implementation.

However, it may be helpful to add a short docstring example showing how to implement the build_get_item method for clarity.

@jacobsela jacobsela requested review from brimoor and kaixi-wang May 9, 2025 04:27
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fiftyone/utils/torch.py (1)

590-629: Good implementation of image model input processing with proper error handling.

The ImageGetItem class correctly handles image loading, transformation, and half-precision conversion with proper type checking. The implementation addresses previously identified issues with non-tensor inputs.

Note that the comment on line 609 indicates "hardcoded" filepath field, which might be worth making configurable in the future for more flexibility.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2dfd701 and e7c4f2b.

📒 Files selected for processing (1)
  • fiftyone/utils/torch.py (10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: test / test-app
  • GitHub Check: build / build
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (8)
fiftyone/utils/torch.py (8)

257-324: Well-designed base abstraction for model inputs.

The GetItem class defines a clean interface for transforming sample dictionaries into model inputs, with proper validation and sensible defaults. The field mapping system provides flexible field name mapping with strong validation.


631-633: Proper integration of the GetItem support in the model hierarchy.

Adding SupportsGetItem to the class inheritance hierarchy enables the model to work with the new data loading mechanism while maintaining backward compatibility.


712-720: Well-implemented GetItem factory method.

The build_get_item method correctly configures an ImageGetItem with the model's transforms, precision settings, and optional field mapping. This provides a standardized way to create model-specific item processors.


1707-1729: Integrated field selection optimization in constructor.

The updated constructor correctly extracts the field mapping from the GetItem instance and performs early field selection to limit the data loaded from the database. This addresses the performance concern mentioned in previous review comments.

🧰 Tools
🪛 Ruff (0.8.2)

1723-1723: No explicit stacklevel keyword argument found

(B028)


1742-1760: Added efficient vectorized data loading support.

The vectorize mode provides a performance optimization by preloading required fields into memory. The implementation properly handles field caching while maintaining compatibility with the existing dataset interface.


1797-1798: Fixed database connection handling for multiprocessing safety.

Adding explicit database disconnection in worker initialization addresses the potential deadlock issues when using fork-based multiprocessing, as highlighted in previous review comments.


1858-1890: Well-implemented batch preparation methods with error handling.

The implementation provides two batch preparation strategies with proper error handling:

  1. _prepare_batch_db: Loads samples from the database on demand
  2. _prepare_batch_vectorized: Uses cached fields for faster access

The error handling approach respects the skip_failures flag and provides informative error messages.


1891-1907: Improved batch access with efficient dispatching.

The implementation unifies single-item and batch access through the __getitems__ method, which properly dispatches to the appropriate batch preparation method based on the vectorize flag. This provides consistent behavior while enabling performance optimizations.

@brimoor brimoor changed the title apply_model: use FOTD for TorchImageModels Use FiftyOneTorchDataset when applying TorchImageModels May 9, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
fiftyone/core/collections.py (1)

9586-9624: Well-designed integration with PyTorch data loading.

The new to_torch() method provides a clean interface to create PyTorch datasets from FiftyOne collections, leveraging the new GetItem abstraction. The implementation is straightforward and the docstring is comprehensive.

This change aligns with the PR objective to improve data loading performance for TorchImageModels by using the more efficient FiftyOneTorchDataset.

Consider adding a usage example in the docstring to demonstrate how to create a GetItem instance and use it with this method, particularly for common model types. This would help users understand how to leverage this optimization path.

fiftyone/core/models.py (4)

746-809: Well-structured batch collation error handling

The ErrorHandlingCollate class effectively encapsulates different collation strategies with robust error handling. It supports ragged batches, numpy stacking, and default PyTorch collation.

Consider using a ternary operator for more concise code:

-        if self.user_collate_fn is None:
-            self.user_collate_fn = tud.dataloader.default_collate
+        self.user_collate_fn = self.user_collate_fn or tud.dataloader.default_collate

812-865: Enhanced data loader creation supporting GetItem models

The function now intelligently handles models implementing SupportsGetItem by creating a dataset through the model's build_get_item method. This enables more efficient data loading paths.

Consider using a ternary operator as suggested by static analysis:

-    if model.has_collate_fn:
-        user_collate_fn = model.collate_fn
-    else:
-        user_collate_fn = None
+    user_collate_fn = model.collate_fn if model.has_collate_fn else None
🧰 Tools
🪛 Ruff (0.8.2)

828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)


2448-2478: Well-designed SupportsGetItem mixin

This mixin provides a clean interface for models to integrate with the new FiftyOneTorchDataset. The required_keys property and build_get_item method establish a clear contract for implementations.

The required_keys property creates a new GetItem instance on each access, which could be inefficient if called frequently. Consider caching the instance or the keys:

def __init__(self):
    self._get_item_cache = None
    self._required_keys_cache = None
    
@property
def required_keys(self):
    """The required keys that must be provided to methods like
    :func:`apply_model` and :func:`compute_embeddings` during inference.
    """
    if self._required_keys_cache is None:
        self._required_keys_cache = self.build_get_item().required_keys
    return self._required_keys_cache

2458-2464: Potential inefficiency in required_keys implementation

The required_keys property creates a new GetItem instance on every access by calling build_get_item() without parameters, which could be inefficient if accessed frequently.

Consider caching the result or implementing a more efficient approach to avoid creating unnecessary objects:

@property
def required_keys(self):
    """The required keys that must be provided to methods like
    :func:`apply_model` and :func:`compute_embeddings` during inference.
    """
    if not hasattr(self, '_required_keys_cache'):
        self._required_keys_cache = self.build_get_item().required_keys
    return self._required_keys_cache
fiftyone/utils/torch.py (4)

13-13: Clean up unused import statements.

The static analysis shows that Any, Optional, and List are imported but never used in this file.

-from typing import Any, Optional, List
+from typing import Any  # Import specific types only when needed

If these types will be used in future iterations, consider adding a # noqa: F401 comment to suppress the linter warning.

🧰 Tools
🪛 Ruff (0.8.2)

13-13: typing.Any imported but unused

Remove unused import

(F401)


13-13: typing.Optional imported but unused

Remove unused import

(F401)


13-13: typing.List imported but unused

Remove unused import

(F401)


316-326: Consider additional validation for field mapping values.

The setter only validates that keys exist in required_keys, but doesn't verify that the mapped field names actually exist in the sample collection. This could lead to runtime errors when the dataset doesn't contain the expected fields.

You could add a validate_field_names method that gets called when a dataset is bound to the GetItem instance:

def validate_field_names(self, sample_collection):
    """Validates that all fields referenced in field_mapping exist in the collection.
    
    Args:
        sample_collection: A SampleCollection instance
    
    Raises:
        ValueError: If any mapped field doesn't exist in the collection
    """
    for key, field_name in self.field_mapping.items():
        if not sample_collection.has_field(field_name):
            raise ValueError(f"Field mapping references non-existent field '{field_name}' for key '{key}'")

1736-1737: Consider implementing the TODO for further optimization.

Loading all fields via a single values() call could improve performance by reducing the number of database queries.

-        # @todo load all fields via a single `values()` call
-        for field_name in fields_to_load:
-            self.cached_fields[field_name] = self._load_field(
-                samples, field_name, local_process_group=local_process_group
-            )
+        if fields_to_load:
+            field_values = samples.values(*fields_to_load)
+            for i, field_name in enumerate(fields_to_load):
+                self.cached_fields[field_name] = self._load_field_from_values(
+                    field_values[i], local_process_group=local_process_group
+                )

This would require adding a new helper method to convert values to a serialized list.


1850-1852: Refine error handling for individual fields.

Currently, if loading one field fails, the entire sample is replaced with an error object. Consider a more granular approach that allows partial field failures.

-                    d = error
-                    break
+                    # Either set just this field to an error or add it to a list of errors
+                    d[key] = None  # Set failed field to None
+                    if not hasattr(d, "_errors"):
+                        d._errors = []
+                    d._errors.append((key, error))  # Track the error for later handling

This would allow the GetItem to still process samples with partial field failures, with appropriate error handling.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7c4f2b and 9ad62f4.

📒 Files selected for processing (3)
  • fiftyone/core/collections.py (1 hunks)
  • fiftyone/core/models.py (27 hunks)
  • fiftyone/utils/torch.py (11 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/core/collections.py (1)
fiftyone/utils/torch.py (1)
  • FiftyOneTorchDataset (1664-1888)
🪛 Ruff (0.8.2)
fiftyone/core/models.py

828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)

fiftyone/utils/torch.py

13-13: typing.Any imported but unused

Remove unused import

(F401)


13-13: typing.Optional imported but unused

Remove unused import

(F401)


13-13: typing.List imported but unused

Remove unused import

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: build / build
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (15)
fiftyone/core/models.py (7)

155-172: Effective model capability detection logic

The code intelligently handles field mapping based on model implementation capabilities, distinguishing between models that implement SupportsGetItem, SamplesMixin, or both.


174-176: Good addition of process_video_frames flag

This flag clearly indicates when an image model is being applied to video frames, making the code more readable and maintainable.


178-181: Condition updated to support new model types

The updated condition properly accounts for models implementing SupportsGetItem in addition to the existing TorchModelMixin check.


184-185: Improved warning message clarity

The simplified warning message is more concise while maintaining the necessary information about ignoring the unsupported parameter.


843-856: Clean conditional dataset creation

This section elegantly handles the creation of appropriate datasets based on model capabilities, properly passing necessary parameters and setting worker initialization functions.


972-981: Consistent field_mapping handling in compute_embeddings

The implementation correctly mirrors the field_mapping handling from apply_model, ensuring consistency across different model application functions.


1531-1536: Consistent process_video_frames use in compute_patch_embeddings

The same approach for detecting video frame processing is applied here, maintaining consistency with other functions.

fiftyone/utils/torch.py (8)

257-278: Well-designed abstraction for model input preparation.

The GetItem abstract class provides a clean interface for standardizing how sample data is converted to model inputs. The inheritance pattern with super().__init__(**kwargs) helps avoid method resolution order issues when used with mixins.


593-628: Good implementation of ImageGetItem with clear responsibility.

This implementation provides all the necessary functionality for handling image data, including transformations and half-precision support when applicable.


642-644: Proper handling of half-precision for tensor types.

The code correctly checks if the image is a torch tensor before applying half-precision, avoiding potential AttributeError exceptions when working with PIL Images or numpy arrays.


733-741: Good TorchImageModel integration with GetItem abstraction.

The build_get_item method effectively bridges the TorchImageModel implementation with the new GetItem abstraction, reusing existing model properties and configurations.


1696-1701: Effective field selection optimization.

Selecting only the required fields from the sample collection is an important optimization that reduces the amount of data loaded from the database, particularly for large datasets with many fields.


1763-1780: Good handling of database connections in worker processes.

The explicit disconnection of MongoDB clients in worker initialization addresses potential deadlock issues with fork-based multiprocessing, as recommended in the PyMongo documentation.


1869-1886: Efficient batch processing implementation.

Implementing __getitem__ in terms of __getitems__ creates a consistent interface for both single-item and batch retrieval, while the conditional dispatch to either database or vectorized batch preparation provides flexibility.


2541-2572: Improved utility functions for distributed training.

The enhanced docstrings and implementation improvements in the distributed training utility functions make the code more maintainable and easier to understand.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fiftyone/core/models.py (1)

812-865: Enhanced data loader creation with SupportsGetItem support

The _make_data_loader function has been updated to check if a model implements SupportsGetItem and use the appropriate dataset type. The implementation includes performance optimizations like setting pin_memory=True and configuring worker initialization when needed.

Suggestion for a minor code simplification:

-    if model.has_collate_fn:
-        user_collate_fn = model.collate_fn
-    else:
-        user_collate_fn = None
+    user_collate_fn = model.collate_fn if model.has_collate_fn else None
🧰 Tools
🪛 Ruff (0.8.2)

828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)

fiftyone/utils/torch.py (2)

8-15: Consider removing unused typing imports.

The static analysis hints indicate that Any, Optional, and List are imported but not used in the code.

-from typing import Any, Optional, List
+import typing

Or if you prefer keeping explicit imports, remove the unused ones:

-from typing import Any, Optional, List
🧰 Tools
🪛 Ruff (0.8.2)

15-15: typing.Any imported but unused

Remove unused import

(F401)


15-15: typing.Optional imported but unused

Remove unused import

(F401)


15-15: typing.List imported but unused

Remove unused import

(F401)


315-326: Field mapping setter only validates keys but not completeness.

The current implementation validates that provided keys are supported but doesn't check if all required keys are present in the final mapping after updates. This could lead to silent failures if a required key is missing.

Consider adding validation to ensure all required keys have mappings:

def field_mapping(self, value):
    if value is None:
        return

    for k, v in value.items():
        if k not in self.required_keys:
            raise ValueError(
                f"Unknown key '{k}'. The supported keys are {self.required_keys}"
            )

        self._field_mapping[k] = v
+
+    # Ensure all required keys have mappings
+    missing_keys = set(self.required_keys) - set(self._field_mapping.keys())
+    if missing_keys:
+        raise ValueError(f"Missing field mappings for required keys: {missing_keys}")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad62f4 and c4405f7.

📒 Files selected for processing (3)
  • fiftyone/core/models.py (27 hunks)
  • fiftyone/utils/torch.py (11 hunks)
  • tests/unittests/get_item_tests.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unittests/get_item_tests.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (5)
tests/unittests/get_item_tests.py (1)
  • required_keys (14-15)
fiftyone/core/models.py (6)
  • required_keys (2458-2462)
  • SupportsGetItem (2448-2475)
  • TorchModelMixin (2406-2445)
  • LogitsMixin (2196-2223)
  • Model (2068-2193)
  • build_get_item (2464-2475)
tests/unittests/torch_dataset_tests.py (2)
  • required_keys (24-25)
  • required_keys (37-38)
fiftyone/core/odm/database.py (1)
  • _disconnect (249-253)
fiftyone/core/view.py (1)
  • make_optimized_select_view (1950-2022)
🪛 Ruff (0.8.2)
fiftyone/core/models.py

828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)

fiftyone/utils/torch.py

15-15: typing.Any imported but unused

Remove unused import

(F401)


15-15: typing.Optional imported but unused

Remove unused import

(F401)


15-15: typing.List imported but unused

Remove unused import

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: lint / eslint
  • GitHub Check: test / test-app
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: build / build
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: build
🔇 Additional comments (24)
fiftyone/core/models.py (10)

155-177: New check for SupportsGetItem and field mapping handling added

The code introduces the concept of SupportsGetItem models and handles field mapping appropriately. This is a well-structured implementation that clearly separates concerns between different model types (with SupportsGetItem, SamplesMixin, or both).


179-181: Updated condition for using data loader

The condition for using a data loader has been expanded to include SupportsGetItem models while excluding video frame processing. This change aligns with the PR objective of using FiftyOneTorchDataset with appropriate models.


746-809: New ErrorHandlingCollate class improves batch handling

This new class provides a robust way to handle errors during batch collation with different strategies (ragged batches, numpy stacking, or default PyTorch collation). The implementation is clean and follows good OOP principles with appropriate error handling.

The static method design for handle_errors is appropriate here as it's a utility function that doesn't require instance state.


843-856: Clean conditional dataset creation based on model type

The code correctly creates either a FiftyOneTorchDataset using the model's build_get_item method or falls back to the traditional TorchImageDataset based on the model's capabilities. This is a well-implemented strategy pattern.


972-981: Consistent handling of field mapping and video frames in compute_embeddings

The changes ensure that compute_embeddings handles field mapping and video frame processing consistently with the apply_model function. This maintains a unified approach across the main model application functions.


1169-1181: Field mapping parameter added to embedding function

The updated _compute_image_embeddings_data_loader function now accepts the field_mapping parameter and passes it to the data loader, ensuring consistency with the main model application patterns.


1531-1536: Consistent implementation in compute_patch_embeddings

The same pattern for checking video frames and data loader usage is applied to compute_patch_embeddings, maintaining consistency across all model application functions in the codebase.


2448-2476: New SupportsGetItem interface enables improved dataset integration

This new interface is the core addition that enables models to work with FiftyOneTorchDataset by implementing a build_get_item method. The implementation is clean with well-documented methods.

The required_keys property intelligently delegates to the GetItem instance, avoiding code duplication.


2457-2463: Smart implementation of required_keys property

The property provides a convenient way to access the required keys from the GetItem instance without duplicating the logic. This is a good example of proper delegation.


1-2520: Overall assessment of the implementation

The changes effectively implement support for the FiftyOneTorchDataset when applying TorchImageModels through the introduction of the SupportsGetItem interface and proper handling of field mapping. The code is well-structured, maintains consistency across similar functions, and includes performance optimizations.

The implementation aligns perfectly with the PR objectives and the reported performance improvements (from 1.1k to 1.4k samples per second).

I've verified that the PR correctly implements:

  1. A new interface (SupportsGetItem) for models to integrate with FiftyOneTorchDataset
  2. Field mapping support for all relevant functions
  3. Improved batch handling with the new ErrorHandlingCollate class
  4. Consistent implementation across all model application functions

The code quality is high with good error handling, performance considerations, and clean abstractions.

🧰 Tools
🪛 Ruff (0.8.2)

203-203: Do not use bare except

(E722)


530-533: Use ternary operator frames = etaf.FrameRange(*sample.support) if is_clips else None instead of if-else-block

Replace if-else-block with frames = etaf.FrameRange(*sample.support) if is_clips else None

(SIM108)


597-600: Use ternary operator frames = etaf.FrameRange(*sample.support) if is_clips else None instead of if-else-block

Replace if-else-block with frames = etaf.FrameRange(*sample.support) if is_clips else None

(SIM108)


666-669: Use ternary operator frames = etaf.FrameRange(*sample.support) if is_clips else None instead of if-else-block

Replace if-else-block with frames = etaf.FrameRange(*sample.support) if is_clips else None

(SIM108)


828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block

Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None

(SIM108)


1254-1257: Use ternary operator frames = etaf.FrameRange(*sample.support) if is_clips else None instead of if-else-block

Replace if-else-block with frames = etaf.FrameRange(*sample.support) if is_clips else None

(SIM108)


1284-1287: Use ternary operator embeddings = np.stack(embeddings) if embeddings else None instead of if-else-block

Replace if-else-block with embeddings = np.stack(embeddings) if embeddings else None

(SIM108)


1320-1323: Use ternary operator frames = etaf.FrameRange(*sample.support) if is_clips else None instead of if-else-block

Replace if-else-block with frames = etaf.FrameRange(*sample.support) if is_clips else None

(SIM108)


1355-1358: Use ternary operator embeddings = np.stack(embeddings) if embeddings else None instead of if-else-block

Replace if-else-block with embeddings = np.stack(embeddings) if embeddings else None

(SIM108)


1387-1390: Use ternary operator frames = etaf.FrameRange(*sample.support) if is_clips else None instead of if-else-block

Replace if-else-block with frames = etaf.FrameRange(*sample.support) if is_clips else None

(SIM108)


1840-1843: Use ternary operator frames = etaf.FrameRange(*sample.support) if is_clips else None instead of if-else-block

Replace if-else-block with frames = etaf.FrameRange(*sample.support) if is_clips else None

(SIM108)

fiftyone/utils/torch.py (14)

257-306: Well-designed abstract base class for data loading.

The new GetItem class provides a clean interface for standardizing sample-to-input conversion. The abstract method pattern ensures that subclasses implement the required functionality while providing common field mapping logic.

A minor observation: Consider adding a method to validate that all required keys have corresponding fields in the mapping after updates, not just that the provided keys are valid.


592-627: Good implementation of GetItem for image models.

The ImageGetItem class provides a concrete implementation for handling image data with appropriate options for transforms, precision, and output format.

Note that this class provides a hard-coded value for force_rgb=True in the _load_image call. If there's a need to support grayscale or RGBA images in the future, consider making this configurable through the constructor.


628-649: Safe handling of half precision conversion.

The implementation correctly checks if the image is a tensor before applying half precision, avoiding potential errors with non-tensor inputs.


732-740: TorchImageModel now supports the GetItem interface.

The build_get_item method provides a clean way to create an ImageGetItem instance configured with the model's transforms and precision settings.

Note that per the comment in line 734, this is using self._transforms which might differ from self.transforms in subclasses as noted in a prior review comment. This is acknowledged as a potential separate refactor.


1664-1696: Enhanced FiftyOneTorchDataset with vectorization support.

The refactored dataset class now supports both on-demand and vectorized data loading strategies, allowing efficient trade-offs between memory usage and performance.

The vectorization approach provides significant performance benefits by pre-loading required fields, which aligns well with the PR's objective to improve throughput from 1.1k to 1.4k samples per second.


1697-1721: Optimized field selection and caching.

The implementation selects only the specific fields needed for the model, reducing database load. The vectorize option enables preloading fields for better performance.


1726-1753: Efficient field caching with distributed training support.

The implementation handles field caching and distributed training scenarios well, with separate code paths for local and distributed execution.

A note for future optimization: There's a TODO comment at line 1735 about loading all fields via a single values() call, which could further improve performance.


1754-1779: Well-implemented worker initialization with database connection handling.

The worker initialization properly disconnects and reconnects to the database to avoid multiprocessing issues, addressing a common pitfall in distributed environments.

This implementation aligns with MongoDB's recommendation for handling connections in multiprocessing environments and should prevent deadlocks that could occur when forking processes with active database connections.


1835-1867: Efficient batch preparation strategies.

The implementation provides two batch preparation strategies: database-driven (_prepare_batch_db) and vectorized (_prepare_batch_vectorized), allowing for flexibility in performance optimization based on dataset size and memory constraints.


1868-1885: Extended dataset interface with batch retrieval support.

The addition of __getitems__ enables efficient batch retrieval, with single-item access delegating to it. This provides a consistent interface while supporting batch operations.

This approach is more efficient than repeated calls to __getitem__ for batch retrieval, particularly when using the database-driven strategy.


2470-2484: Changed logging level for serialization steps.

Logging for serialization steps has been changed from info to debug level, which is more appropriate for these detailed operations that don't require regular user attention.


2540-2571: Improved docstrings for distributed training utilities.

The docstrings for get_local_size and related functions have been enhanced with clear descriptions of their purpose and return values.


2573-2604: Well-implemented local rank and process group utilities.

The functions for managing distributed training process groups and ranks have clear implementations with appropriate fallbacks for non-distributed scenarios.


2606-2629: Enhanced local scatter implementation with improved docstring.

The local_scatter function now has a clearer docstring explaining its purpose and parameters.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fiftyone/utils/torch.py (1)

15-15: Remove unused typing imports.

The imported types Any, Optional, and List from the typing module aren't used in this file.

-from typing import Any, Optional, List
+from typing import Any  # Remove if not needed elsewhere in future changes
🧰 Tools
🪛 Ruff (0.8.2)

15-15: typing.Any imported but unused

Remove unused import

(F401)


15-15: typing.Optional imported but unused

Remove unused import

(F401)


15-15: typing.List imported but unused

Remove unused import

(F401)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4405f7 and fa00d88.

📒 Files selected for processing (1)
  • fiftyone/utils/torch.py (11 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (5)
tests/unittests/torch_dataset_tests.py (2)
  • required_keys (24-25)
  • required_keys (37-38)
tests/unittests/get_item_tests.py (1)
  • required_keys (14-15)
fiftyone/core/models.py (4)
  • required_keys (2458-2462)
  • SupportsGetItem (2448-2475)
  • Model (2068-2193)
  • build_get_item (2464-2475)
fiftyone/core/odm/database.py (1)
  • _disconnect (249-253)
fiftyone/core/view.py (1)
  • make_optimized_select_view (1950-2022)
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py

15-15: typing.Any imported but unused

Remove unused import

(F401)


15-15: typing.Optional imported but unused

Remove unused import

(F401)


15-15: typing.List imported but unused

Remove unused import

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-windows / test-python (windows-latest, 3.12)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
  • GitHub Check: test-windows / test-python (windows-latest, 3.11)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
  • GitHub Check: test-windows / test-python (windows-latest, 3.10)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
  • GitHub Check: test-windows / test-python (windows-latest, 3.9)
  • GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
  • GitHub Check: lint / eslint
  • GitHub Check: test / test-app
  • GitHub Check: build / build
  • GitHub Check: e2e / test-e2e
  • GitHub Check: build
🔇 Additional comments (7)
fiftyone/utils/torch.py (7)

257-308: LGTM: Well-designed base class for model input loading.

The GetItem abstract base class provides a clear interface for converting sample data dictionaries to model inputs, with good handling of field mapping between model keys and dataset fields.


592-649: LGTM: Clean implementation of image input loading.

The ImageGetItem class properly implements the GetItem interface for images, with good handling of different input formats, transformations, and tensor conversion. The half-precision check is correctly applied only to torch tensors.


651-739: LGTM: Proper implementation of SupportsGetItem.

TorchImageModel now correctly implements the SupportsGetItem interface, providing a build_get_item method that creates a properly configured ImageGetItem instance. This integration enables the model to work efficiently with FiftyOneTorchDataset.


1664-1775: LGTM: Well-designed dataset refactoring for performance.

The FiftyOneTorchDataset refactoring properly uses the new GetItem abstraction and adds vectorized field loading options. The implementation includes proper field selection optimization, batch preparation methods, and error handling.


1775-1776: Good addition: Database connection handling for multiprocessing.

The explicit database disconnection in worker initialization prevents potential deadlocks in forked processes, as recommended in the PyMongo documentation. This is crucial for reliable operation with multiple workers.


1839-1892: LGTM: Effective batch preparation implementation.

The implementation provides two strategies for batch preparation: database-driven and vectorized (memory-cached). This gives users the flexibility to choose between memory efficiency and performance based on their dataset size and available resources.


2566-2612: Documentation improvement for utility functions.

The improved docstrings for distributed training utility functions make the code more maintainable and easier to understand for contributors.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobsela LGTM! I cleaned up the documentation for us 😄

Also, please squash merge to make it easier to cherry-pick this into release/v1.6.0

@jacobsela jacobsela enabled auto-merge (squash) May 9, 2025 22:36
@jacobsela jacobsela merged commit 36ae261 into develop May 9, 2025
19 checks passed
@jacobsela jacobsela deleted the torch-image-model-get-item branch May 9, 2025 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants