-
Notifications
You must be signed in to change notification settings - Fork 703
Use FiftyOneTorchDataset when applying TorchImageModels
#5711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request introduces a new optional parameter Changes
Sequence Diagram(s)sequenceDiagram
participant S as Sample
participant AM as apply_model()
participant DL as _make_data_loader()
participant DS as Dataset Creator
participant EC as ErrorHandlingCollate
participant M as Model
S->>AM: Provide sample (+ field_mapping)
AM->>DL: Invoke _make_data_loader(field_mapping)
DL->>DS: Create dataset (check for SupportsGetItem)
DS->>M: Use model.build_get_item(field_mapping) to create GetItem
DS->>EC: Pass data for batch collation with error handling
EC->>M: Process batch for inference
sequenceDiagram
participant T as TorchImageModel
participant IG as ImageGetItem
participant S as Sample Dict
participant MI as Model Input
T->>IG: build_get_item(transform, raw_inputs, half_precision, field_mapping)
S->>IG: Provide sample dictionary
IG->>IG: Validate required keys and apply transforms
IG->>MI: Return processed input for the model
Possibly related PRs
Suggested reviewers
Poem
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
fiftyone/utils/torch.py (3)
258-275: Docstring provides good context, but consider examples.
The docstring thoroughly describes the intended usage. However, consider adding a simple usage example for clarity, especially showing how to implementsample_dict_to_input.
361-374: Sample validation could be expensive for large collections.
The method callssamples.first()and tries a forward pass on one sample. This is usually enough to detect major issues but be mindful of performance in tight loops. Otherwise, logic is sound.
640-656: Subclass docstring is clear, but consider extended usage details.
ImageGetItemclarifies usage for image-based inputs. A simple demonstration of transforming and returning custom data fields might help future maintainers.fiftyone/core/models.py (1)
836-837: Extendedfield_mappingusage in embedding pipelines.
The optional parameter is now threaded into data loader creation for embedding operations, maintaining consistency across the codebase. Documenting edge cases (e.g., conflicting fields) would be helpful.Also applies to: 1003-1004, 1122-1122
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fiftyone/core/models.py(10 hunks)fiftyone/utils/torch.py(5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py
341-341: Loop control variable v not used within loop body
(B007)
673-673: Local variable imgs is assigned to but never used
Remove assignment to unused variable imgs
(F841)
673-673: Undefined name imgs
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: lint / eslint
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (13)
fiftyone/utils/torch.py (9)
26-26: No immediate issues with the new import.
The added import fortorchvision.transforms.functionallooks standard and consistent with existing code.
38-38: Import usage appears fine.
Importingfiftyone.core.validationunder a shorter alias (foval) is clean and consistent with the rest of the style.
277-280: Initialization is concise.
The constructor properly sets thefield_mappingvia the property setter. No immediate issues found.
281-290: Abstract method placeholders.
Thesample_dict_to_inputmethod raisesNotImplementedError, which is appropriate for an abstract base. Ensure you have unit tests in subclasses to confirm actual implementations override this method.Do you plan to write a dedicated test for each subclass to verify correct input handling?
291-293: Wrapper call is straightforward.
Callingself.sample_dict_to_input(sample)in__call__is a clear and idiomatic approach.
294-316: Required fields logic is robust.
The property and its setter handle duplicates carefully and provide helpful validation. This should prevent accidental field mismatches. Looks good.
318-352: Field mapping property setter.
Enforcing that mapping keys belong torequired_fieldsis a good safety measure. The approach of defaulting to identity (k -> k) is also intuitive.🧰 Tools
🪛 Ruff (0.8.2)
341-341: Loop control variable
vnot used within loop body(B007)
353-360: Field mapping updates are consistent.
update_field_mapping()ensures newly introduced required fields have default mappings. No red flags here.
748-754: Factory method for DataLoader usage.
_build_get_itemreturns a specializedImageGetItem, enabling consistent data loading for Torch-based models. Implementation looks good.fiftyone/core/models.py (4)
64-65: Addedfield_mappingparameter documentation.
The new parameter is clearly introduced and described, ensuring that users know how to map sample fields to model inputs. Documentation is concise.Also applies to: 109-110
257-257: Passingfield_mappingthrough the data loader path.
The call sites_apply_image_model_data_loaderand subsequent usage in_make_data_loaderhandle the new parameter consistently, preserving backward compatibility.Also applies to: 417-418
698-757:ErrorHandlingCollateclass logic.
This new collate function gracefully handles batch-level exceptions and returns a singleExceptionor the stacked data. The approach is clear. Just ensure any partial data is properly handled or logged when multiple errors occur.
759-811:_make_data_loadermethod.
- Integrates
field_mappingcleanly with_build_get_item.- Provides a fallback to
TorchImageDatasetif_build_get_itemisn't available.- The
use_fotdlogic is straightforward.
Overall, the enhancements appear robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/unittests/get_item_tests.py (2)
37-46: Remove redundant method call on line 42There appears to be a redundant call to
add_required_fields("baz")on line 42 that doesn't include any assertion. The same method is called again on line 44 followed by an assertion, making line 42 unnecessary.gi = DummyGetItem() self.assertEqual(set(gi.required_fields), set(["foo", "bar"])) gi.add_required_fields("baz") self.assertEqual(set(gi.required_fields), set(["foo", "bar", "baz"])) - gi.add_required_fields("baz") gi.add_required_fields("baz") self.assertEqual(set(gi.required_fields), set(["foo", "bar", "baz"]))
82-82: Remove unnecessary f-string prefixThe string doesn't contain any placeholders, so the
fprefix is unnecessary.- samples = [fo.Sample(filepath=f"sample.jpg")] + samples = [fo.Sample(filepath="sample.jpg")]🧰 Tools
🪛 Ruff (0.8.2)
82-82: f-string without any placeholders
Remove extraneous
fprefix(F541)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/unittests/get_item_tests.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/unittests/get_item_tests.py
82-82: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-app
- GitHub Check: lint / eslint
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: build / build
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: e2e / test-e2e
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: build
🔇 Additional comments (1)
tests/unittests/get_item_tests.py (1)
1-117: Well-structured test suite for the new GetItem classThis test suite thoroughly covers the functionality of the new
GetItemclass, including initialization, field mapping, validation, and extensibility with mixins. The tests are comprehensive and will help ensure the reliability of this new component.🧰 Tools
🪛 Ruff (0.8.2)
82-82: f-string without any placeholders
Remove extraneous
fprefix(F541)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fiftyone/utils/torch.py (3)
258-275: Fix minor grammatical inconsistencies in the docstring.
- Replace "It's :method:
fiftyone.utils.torch.GetItem.samples_dict_to_input" with "Its :meth:sample_dict_to_input".- Update the method name reference from "samples_dict_to_input" to "sample_dict_to_input" to match the actual function signature.
- It's :method:`fiftyone.utils.torch.GetItem.samples_dict_to_input` method + Its :meth:`sample_dict_to_input` method
277-290: Clarify the documentation wording forsample_dict_to_input.In the docstring at line 282, consider rephrasing "Return model input from a list if samples' dicts" for clarity. For example:
- """Return model input from a list if samples' dicts + """Return model input from a sample's dictionary or a sample view
341-341: Remove unused loop variablev.Since
vis never used inside the loop, consider renaming it to_to avoid confusion and to comply with best practices:-for k, v in value.items(): +for k, _ in value.items():🧰 Tools
🪛 Ruff (0.8.2)
341-341: Loop control variable
vnot used within loop body(B007)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/utils/torch.py(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (1)
fiftyone/core/models.py (1)
transforms(2078-2082)
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py
341-341: Loop control variable v not used within loop body
(B007)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: build
🔇 Additional comments (9)
fiftyone/utils/torch.py (9)
26-26: No issues found for the new import.
38-38: Import looks good.
291-317: Logic for handlingrequired_fieldslooks good.
318-340: Field mapping setup is properly structured.
342-357: Field mapping validation flow is correct.
361-374: Sample compatibility check is straightforward.
640-657: Header logic forImageGetItemis well-organized.
658-675: Confirm thattransformalways yields a tensor before calling.half().If
transformmight return PIL images or arrays,.half()would fail. Ensure that your transforms consistently outputtorch.Tensorobjects here, or add a conditional check.
747-755: Method_build_get_itemnicely integrates withImageGetItem.
e5c605f to
d038d68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
fiftyone/utils/torch.py (1)
673-673: Half-precision conversion should check if img is a tensor.Line 673 converts
imgto half precision without checking if it's a tensor, which could cause errors with some inputs. This was fixed in a previous commit (mentioned in the past review comments), but I'm flagging it to ensure it stays fixed.
🧹 Nitpick comments (2)
tests/unittests/get_item_tests.py (1)
79-95: Field validation test should use a more realistic filepath.The test uses
f"sample.jpg"but the f-string prefix is unnecessary since there are no placeholders.- samples = [fo.Sample(filepath=f"sample.jpg")] + samples = [fo.Sample(filepath="sample.jpg")]🧰 Tools
🪛 Ruff (0.8.2)
82-82: f-string without any placeholders
Remove extraneous
fprefix(F541)
fiftyone/utils/torch.py (1)
318-353: Unused variable in field mapping setter loop.On line 341, the loop variable
vis not used in the body of the loop. This doesn't affect functionality but could be cleaned up for better readability.- for k, v in value.items(): - if k not in self.required_fields: + for k in value: + if k not in self.required_fields:🧰 Tools
🪛 Ruff (0.8.2)
341-341: Loop control variable
vnot used within loop body(B007)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fiftyone/core/models.py(10 hunks)fiftyone/utils/torch.py(5 hunks)tests/unittests/get_item_tests.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py
341-341: Loop control variable v not used within loop body
(B007)
tests/unittests/get_item_tests.py
82-82: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: e2e / test-e2e
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: build / build
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: lint / eslint
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: build
🔇 Additional comments (15)
tests/unittests/get_item_tests.py (4)
8-17: This class extends GetItem to test field mapping and required fields management.The
DummyGetItemimplementation demonstrates how to properly subclassGetItemby specifying required fields and implementing the abstractsample_dict_to_inputmethod. This is a good test utility class.
20-46: Test coverage is comprehensive for required_fields functionality.Good test cases covering initialization, querying and extending required fields, including repeated additions (lines: 42-45) which validate that duplicates are properly handled.
47-77: Field mapping validation tests are thorough.Good test coverage for field mapping default behavior, validation, and partial updates. The tests verify that field mappings properly reject invalid configurations and handle partial mappings by preserving defaults.
96-113: Mixin test verifies extensibility of the GetItem framework.This test effectively demonstrates how to use mixins to extend the GetItem class, specifically how additional required fields are added and reflected in subclasses. This confirms that the inheritance model works as expected.
fiftyone/utils/torch.py (5)
258-290: Well-designed base class with clear documentation.The
GetItemclass provides a robust abstraction for extracting model inputs from samples, with a clear API and thorough documentation explaining the role of required fields and field mappings.
291-317: Required fields implementation is clean and maintainable.The use of property getters/setters ensures that required fields are properly managed and provides appropriate lazily-initialized defaults when needed.
354-374: Good error handling in validate_compatible_samples method.Effective validation approach that fails fast with meaningful error messages when samples aren't compatible with the GetItem requirements. The TODO comment on line 365 should be addressed in future work.
Is there a planned timeline for implementing the type checking mentioned in the TODO comment on line 365?
640-676: ImageGetItem implementation follows single responsibility principle.The
ImageGetItemclass is a clean implementation of theGetIteminterface specific to image models. It handles loading images, applying transforms, and manages half-precision conversion when needed.
747-755: Good integration of GetItem with TorchImageModel.The
_build_get_itemmethod provides a clean way to create anImageGetIteminstance configured with the model's transforms and settings. This enables the model to leverage the flexible sample-to-input conversion provided by theGetItemframework.fiftyone/core/models.py (6)
52-110: Clear documentation for the field_mapping parameter.The
field_mappingparameter is well-documented inapply_model, explaining its purpose and relationship to theGetItemclass.
247-257: Proper propagation of field_mapping to data loader creation.The
field_mappingparameter is correctly passed from_apply_image_model_data_loaderto_make_data_loader, ensuring that field mappings are properly used when loading data for model application.
699-757: Well-structured ErrorHandlingCollate class.The new
ErrorHandlingCollateclass encapsulates batch collation error handling logic that was previously inline, making it more reusable and testable. The class handles different collation strategies based on whether batches are ragged or numpy-based.
759-810: Improved data loader creation with GetItem integration.The
_make_data_loaderfunction now intelligently selects between using the model's_build_get_itemmethod or falling back toTorchImageDatasetbased on model capabilities. The addition of worker initialization and explicit setting ofpersistent_workers=Falsehelps avoid resource leaks.
991-1000: Proper propagation of field_mapping to compute_embeddings.The
field_mappingparameter is correctly passed from_compute_image_embeddings_data_loaderto_make_data_loader, ensuring consistent behavior between model prediction and embedding computation.
813-822:Details
❓ Verification inconclusive
Warning about skip_failures implementation in _make_fo_torch_dataset.
The code currently has a comment stating
skip_failuresneeds to be implemented, but it is passing the parameter directly toto_torch(). Ensure that this parameter is actually being handled by the dataset creation method.
🏁 Script executed:
#!/bin/bash # Check if samples.to_torch() accepts and handles skip_failures parameter # Search for the method definition rg --type python "def to_torch" -A 10Length of output: 70
Attention: Manual Verification Required for
skip_failuresParameter HandlingThe code still passes the
skip_failuresparameter directly tosamples.to_torch(), but our automated search for theto_torch()definition was inconclusive due to file-type recognition issues. Please confirm that theto_torch()method actually implements handling forskip_failures.
- Location:
fiftyone/core/models.py(lines 813–822)- Action Required:
- Manually inspect the implementation of
to_torch()to ensure that theskip_failuresflag is processed appropriately.- If necessary, update the dataset creation logic to correctly handle failure cases as hinted by the inline comment.
To aid further verification, you could run this shell script from your repository root:
#!/bin/bash # Re-run the search for the `to_torch` method definition in Python files fd -e py | xargs rg "def to_torch(" -A 15Please verify the output manually to ensure the
skip_failuresparameter is managed as expected.
manushreegangwar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Few comments. I will test the changes on my end this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fiftyone/utils/torch.py (3)
341-342: Simplify dictionary key check.The
.keys()call is unnecessary when checking for key existence in a dictionary.- for k in value.keys(): + for k in value:🧰 Tools
🪛 Ruff (0.8.2)
341-341: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
660-662: Consider using field_mapping for filepath lookup.The filepath is accessed directly with a hardcoded key rather than using the field mapping mechanism, which is one of the key features of the GetItem class.
- img = _load_image( - # hardcoded because that's how it was in fiftyone.core.models._make_data_loader - sample_dict["filepath"], + # Use the field mapping to get the correct field name + filepath_field = self.field_mapping.get("filepath", "filepath") + img = _load_image( + sample_dict[filepath_field],
749-749: Consider using property accessor for transforms.As noted in a previous review, there's inconsistency between using
self._transformsandself.transformsin the codebase. This can be addressed in a future refactor.- transform=self._transforms, + transform=self.transforms,
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fiftyone/core/models.py(10 hunks)fiftyone/utils/torch.py(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- fiftyone/core/models.py
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py
341-341: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: build / build
- GitHub Check: lint / eslint
- GitHub Check: test / test-app
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (5)
fiftyone/utils/torch.py (5)
258-374: Good addition of the GetItem abstract class for standardized sample conversion.This well-designed abstraction provides a clean interface for converting sample dictionaries to model inputs, with clear field management and validation capabilities.
🧰 Tools
🪛 Ruff (0.8.2)
341-341: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
369-371: Note: Compatibility check only validates against the first sample.This validation approach is a reasonable heuristic for quick error detection, but won't catch inconsistencies in later samples. This is a good trade-off between thoroughness and performance.
640-676: Good implementation of ImageGetItem for TorchImageModels.The implementation correctly handles image loading, transformation, and tensor conversion with support for half precision.
747-754: New method to build image GetItem instances.This internal method provides a clean way to create ImageGetItem instances configured with the model's parameters.
756-766: Good public interface for building GetItem instances.This method provides a clean public interface for creating GetItem instances with optional parameter overrides.
manushreegangwar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on teams
19.5s for 10000 samples
|
@manushreegangwar This doesn't work out of the box in teams as shown in the demo. Please see the branch on teams that I DMed you. You should see substantially more speedup. Also, please limit the discussion here to the code changes or performance in OSS. Teams performance can be discussed on that branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/unittests/get_item_tests.py (1)
46-48: Use set comparison instead of CounterSince
add_required_fieldsshould handle deduplication internally, it's clearer to verify this using set comparison rather than Counter.self.assertEqual( - Counter(gi.required_fields), Counter(["foo", "bar", "baz"]) + set(gi.required_fields), set(["foo", "bar", "baz"]) )This addresses the previous reviewer's comment about checking against
gi.required_fieldssinceadd_required_fieldsis expected to handle thesetpart.
🧹 Nitpick comments (2)
tests/unittests/get_item_tests.py (2)
43-46: Remove redundant call toadd_required_fieldsThere's a duplicate call to
gi.add_required_fields("baz")on consecutive lines. While the test correctly verifies that duplicate fields aren't added multiple times, one of these calls is redundant for the test itself.gi.add_required_fields("baz") - - gi.add_required_fields("baz") self.assertEqual( Counter(gi.required_fields), Counter(["foo", "bar", "baz"]) )
85-85: Remove unnecessary f-stringThis line uses an f-string without any placeholders, which is unnecessary.
- samples = [fo.Sample(filepath=f"sample.jpg")] + samples = [fo.Sample(filepath="sample.jpg")]🧰 Tools
🪛 Ruff (0.8.2)
85-85: f-string without any placeholders
Remove extraneous
fprefix(F541)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/unittests/get_item_tests.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/unittests/get_item_tests.py
85-85: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (1)
tests/unittests/get_item_tests.py (1)
1-120: Overall test coverage looks comprehensiveThis test file provides excellent coverage of the
GetItemclass functionality, including testing:
- Basic operation
- Field management and validation
- Field mapping capabilities
- Compatibility validation
- Extension through mixins
The tests are well-structured and cover both normal usage patterns and edge cases.
🧰 Tools
🪛 Ruff (0.8.2)
85-85: f-string without any placeholders
Remove extraneous
fprefix(F541)
|
Performance on OSS with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fiftyone/utils/torch.py (3)
261-262: Fix the method name in documentation.The documentation refers to
samples_dict_to_inputmethod, but the actual implemented method name issample_dict_to_input(singular, not plural). This inconsistency could confuse users.- It's :method:`fiftyone.utils.torch.GetItem.samples_dict_to_input` method + Its :method:`fiftyone.utils.torch.GetItem.sample_dict_to_input` method
366-366: Address the TODO comment before release.This TODO comment about adding type checking for fields should be addressed before the final release, or converted to a tracked issue if it's meant for future work.
658-658: Consider making the required field name configurable.The
filepathfield is hardcoded as a required field. For greater flexibility, consider making this configurable, allowing users to specify alternative field names when their data structure differs.def __init__( self, transform=None, raw_inputs=False, using_half_precision=False, use_numpy=False, + image_field="filepath", **kwargs, ): super().__init__(**kwargs) self.transform = transform self.raw_inputs = raw_inputs self.using_half_precision = using_half_precision self.use_numpy = use_numpy - self.add_required_fields("filepath") + self.image_field = image_field + self.add_required_fields(image_field)And then update the sample_dict_to_input method:
def sample_dict_to_input(self, sample_dict): img = _load_image( - # hardcoded because that's how it was in fiftyone.core.models._make_data_loader - sample_dict["filepath"], + sample_dict[self.field_mapping.get(self.image_field, self.image_field)], use_numpy=self.use_numpy, force_rgb=True, )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/utils/torch.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py
342-342: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: lint / eslint
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (7)
fiftyone/utils/torch.py (7)
278-280: Good implementation of super().init.The super call is correctly implemented to avoid MRO (Method Resolution Order) issues when this class is used with mixins, as noted in a previous comment.
302-317: Well-implemented field management.The design for adding required fields is robust, handling both string and list inputs, and ensures uniqueness through the use of a set. Good practice to maintain tuple immutability for the stored fields.
331-352: Verify field mapping validation behavior.The field mapping setter correctly validates that keys are in required fields, but it might be worth adding a test case to confirm the behavior when field_mapping is None and when mixins have already set some mappings.
🧰 Tools
🪛 Ruff (0.8.2)
342-342: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
369-374: Effective validation heuristic.The validation approach is pragmatic - checking compatibility with the first sample provides a good balance between early error detection and performance. As noted in previous comments, this doesn't guarantee compatibility for all samples, but is a reasonable trade-off.
673-674: Bug fix for half precision application.This code correctly applies half precision to the image tensor, addressing a bug mentioned in previous reviews where an undefined variable
imgswas referenced.
748-755: Good implementation of GetItem factory method.The
_build_get_itemmethod correctly constructs anImageGetIteminstance with the model's configuration settings. This implementation enables the performance improvements targeted by this PR.
757-766: Well-designed public interface.The public
build_get_itemmethod provides a clean interface for users to customize the GetItem instance, supporting the documented feature that allows TorchImageModels to benefit from more efficient data loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
fiftyone/core/models.py (4)
253-253: Simplify the default None parameter in kwargs.getThe default value of
Noneis redundant as it's already the default return value fordict.get()when the key is not found.- field_mapping=kwargs.get("field_mapping", None), + field_mapping=kwargs.get("field_mapping"),🧰 Tools
🪛 Ruff (0.8.2)
253-253: Use
kwargs.get("field_mapping")instead ofkwargs.get("field_mapping", None)Replace
kwargs.get("field_mapping", None)withkwargs.get("field_mapping")(SIM910)
810-818: Clean implementation of FiftyOneTorchDataset creationThis new function creates a torch dataset using the model's
build_get_itemmethod, which aligns with the PR objective of integrating with the newGetItemabstraction. The implementation is concise and focused.However, consider adding a brief docstring to explain the purpose and parameters of this function for better maintainability.
def _make_fo_torch_dataset( samples, model, skip_failures, field_mapping=None, ): + """Creates a FiftyOneTorchDataset using the model's GetItem interface. + + Args: + samples: a SampleCollection + model: a model that implements build_get_item() + skip_failures: whether to skip failed item processing + field_mapping: optional mapping from sample fields to model inputs + + Returns: + a FiftyOneTorchDataset + """ get_item = model.build_get_item(field_mapping=field_mapping) dataset = samples.to_torch(get_item=get_item, skip_failures=skip_failures) return dataset
993-993: Simplify the default None parameter in kwargs.getSimilar to the earlier instance, the default value of
Noneis redundant here.- field_mapping=kwargs.get("field_mapping", None), + field_mapping=kwargs.get("field_mapping"),🧰 Tools
🪛 Ruff (0.8.2)
993-993: Use
kwargs.get("field_mapping")instead ofkwargs.get("field_mapping", None)Replace
kwargs.get("field_mapping", None)withkwargs.get("field_mapping")(SIM910)
775-784: Excellent conditional dataset creation strategyThis adds support for models that implement the
build_get_itemmethod to leverage the more efficientFiftyOneTorchDataset. This is the core enhancement that enables the performance improvements mentioned in the PR objectives.It would be helpful to add a comment explaining what FOTD stands for (presumably FiftyOneTorchDataset) for better code readability.
- use_fotd = hasattr(model, "build_get_item") + # Check if model supports FiftyOneTorchDataset (FOTD) via GetItem interface + use_fotd = hasattr(model, "build_get_item")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/core/models.py(6 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/models.py
253-253: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)
Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")
(SIM910)
993-993: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)
Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")
(SIM910)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: lint / eslint
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (4)
fiftyone/core/models.py (4)
695-754: Well-structured error handling for batch collationThe implementation of
ErrorHandlingCollateprovides a robust approach to handling errors during batch processing with three distinct collation strategies. This is a good abstraction that centralizes error handling logic and supports different batch processing needs.The use of a static method for
handle_errorsis good for reuse across the collation strategies, though as mentioned in previous code reviews, consider alternatives tostaticmethodfor better testability in the future.
756-807: Good refactoring of data loader with enhanced configurationThe refactored
_make_data_loaderfunction now provides better flexibility by:
- Supporting a conditional dataset creation strategy based on model capabilities
- Using the new error handling collation
- Setting optimal DataLoader parameters like
pin_memory=True- Using appropriate worker initialization based on dataset type
The explicit setting of
persistent_workers=Falsewith a comment explaining its importance is particularly good practice.
799-807: Optimal DataLoader configurationSetting
pin_memory=Truecan significantly improve data transfer performance with GPU models, and the explicit declaration ofpersistent_workers=Falsewith comments explaining the rationale is excellent practice.The conditional worker initialization function is also a nice touch, ensuring that any specialized initialization is only applied when needed.
764-770:Details
✅ Verification successful
Verify behavior when ragged_batches is True
The new code path correctly handles the model's
ragged_batchesproperty, but let's ensure we understand the end-to-end implications when this is set toTrue. Whenragged_batches=True,_ragged_batchescollation will be used, which simply returns the batch as-is without stacking.
🏁 Script executed:
#!/bin/bash # Check model implementations for ragged_batches usage patterns # Find definition of ragged_batches in model classes echo "Models defining ragged_batches property:" rg -A 3 -B 3 "\s+def ragged_batches" --glob "*.py" # Find where ragged_batches might be set echo -e "\nPlaces where ragged_batches might be set:" rg "\.ragged_batches\s*=" --glob "*.py" # Find usage of ragged_batches in conditions echo -e "\nConditional uses of ragged_batches:" rg "if.*ragged_batches" --glob "*.py"Length of output: 4362
🏁 Script executed:
#!/bin/bash # Locate the definition of ErrorHandlingCollate and inspect how ragged_batches is handled rg -R "class ErrorHandlingCollate" -A50 --glob "*.py"Length of output: 460
🏁 Script executed:
#!/bin/bash # Find and inspect the definition of ErrorHandlingCollate rg "class ErrorHandlingCollate" -A50 --glob "*.py" # Search for any helper functions selected based on ragged_batches rg "def _ragged_batches" -A30 --glob "*.py"Length of output: 4202
ragged_batches path verified
Inspection ofErrorHandlingCollateconfirms that whenragged_batches=True, its constructor binds_ragged_batchesas the collate function, which simply runs error handling and returns the batch list unchanged (no stacking or numpy conversion). Downstream code must accept a list of tensors rather than a stacked tensor, but no changes to the current implementation are required.
brimoor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobsela I know you've been working on this for awhile and you want to wrap it up. At the same time, as this is a core interface, I want to make sure we're happy with the interface precedent that we're adding here.
I'm happy to chat live if helpful to finalize any design details so we can get this merged and done ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
fiftyone/core/models.py (3)
253-253: Simplify the parameters by removing the redundant default value.The
Nonedefault inkwargs.get("field_mapping", None)is redundant sinceget()already returnsNonewhen the key doesn't exist.- field_mapping=kwargs.get("field_mapping", None), + field_mapping=kwargs.get("field_mapping"),🧰 Tools
🪛 Ruff (0.8.2)
253-253: Use
kwargs.get("field_mapping")instead ofkwargs.get("field_mapping", None)Replace
kwargs.get("field_mapping", None)withkwargs.get("field_mapping")(SIM910)
774-777: Consider using a ternary operator for cleaner code.This conditional assignment could be simplified with a ternary operator.
- if model.has_collate_fn: - user_collate_fn = model.collate_fn - else: - user_collate_fn = None + user_collate_fn = model.collate_fn if model.has_collate_fn else None🧰 Tools
🪛 Ruff (0.8.2)
774-777: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
1007-1007: Same simplification applies here.As with line 253, the
Nonedefault is redundant.- field_mapping=kwargs.get("field_mapping", None), + field_mapping=kwargs.get("field_mapping"),🧰 Tools
🪛 Ruff (0.8.2)
1007-1007: Use
kwargs.get("field_mapping")instead ofkwargs.get("field_mapping", None)Replace
kwargs.get("field_mapping", None)withkwargs.get("field_mapping")(SIM910)
fiftyone/utils/torch.py (1)
590-629: Good implementation of GetItem for image models.The
ImageGetItemclass provides a concrete implementation for image models with appropriate options for transformations and precision.The filepath key is hardcoded on line 610, as noted in the comment. While this preserves existing behavior, consider making this configurable in a future update for better flexibility:
- # hardcoded because that's how it was in fiftyone.core.models._make_data_loader - d["filepath"], + d[self.field_mapping.get("filepath", "filepath")],
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fiftyone/core/models.py(7 hunks)fiftyone/utils/torch.py(8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (4)
tests/unittests/get_item_tests.py (1)
required_keys(14-15)fiftyone/core/models.py (6)
required_keys(2400-2402)SupportsGetItem(2395-2408)TorchModelMixin(2353-2392)LogitsMixin(2143-2170)Model(2015-2140)build_get_item(2404-2408)tests/unittests/torch_dataset_tests.py (2)
required_keys(24-25)required_keys(37-38)fiftyone/core/view.py (1)
make_optimized_select_view(1950-2022)
🪛 Ruff (0.8.2)
fiftyone/core/models.py
253-253: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)
Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")
(SIM910)
774-777: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block
Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None
(SIM108)
1007-1007: Use kwargs.get("field_mapping") instead of kwargs.get("field_mapping", None)
Replace kwargs.get("field_mapping", None) with kwargs.get("field_mapping")
(SIM910)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-app
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (13)
fiftyone/core/models.py (5)
695-759: Well-structured collation class with clear separation of concerns.The new
ErrorHandlingCollateclass provides a clean interface for handling different collation strategies (ragged batches, numpy stacking, or default PyTorch collation) with proper error handling. This supports the throughput improvements mentioned in the PR objectives.
762-789: Data loader function now supports GetItem abstraction.The updated
_make_data_loaderfunction now provides support for the newSupportsGetItemmixin, appropriately configuring data loading based on model capabilities. Thefield_mappingparameter provides the desired flexibility.🧰 Tools
🪛 Ruff (0.8.2)
774-777: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
789-821: Data loader configuration optimizes for performance.The explicit configuration of
pin_memory=Trueandpersistent_workers=Falsewith conditional worker initialization is a good optimization. The comment about avoiding "zombies" when only iterating the dataset once is helpful context.
824-832: Clean implementation of FiftyOneTorchDataset creation.The
_make_fo_torch_datasetfunction elegantly delegates to the model'sbuild_get_itemmethod and the sample collection'sto_torchmethod, making the code extensible and maintainable.
2395-2409: Well-designed mixin for extending model capabilities.The
SupportsGetItemmixin provides a clean interface for models to integrate with the newFiftyOneTorchDataset. The propertyrequired_keyseffectively delegates to theGetIteminstance, avoiding duplication, and thebuild_get_itemmethod provides the necessary customization point through thefield_mappingparameter.This is the key enabler for the performance improvements mentioned in the PR objectives, allowing
TorchImageModels to benefit from more efficient data loading via FOTD.fiftyone/utils/torch.py (8)
257-324: Well-designed abstraction for model input preparation.The
GetItemabstract class establishes a clean interface for converting sample dictionaries to model inputs, encapsulating the extraction logic and field mapping functionality.However, I have a minor observation about line 275:
- super().__init__(**kwargs) + # Initialize any subclass attributes with kwargsThe
super()call is unnecessary in a base class without explicit inheritance, but it's likely there to support mixins as mentioned in the PR context.
631-637: Appropriate inheritance for TorchImageModel.Adding
fom.SupportsGetItemto the inheritance chain enables the model to properly integrate with the new GetItem-based data pipeline.
712-720: Good implementation of build_get_item method.The method correctly configures an
ImageGetItemwith the model's transform and precision settings, enabling efficient data pipeline integration.Note: Based on the past review comments, there's a future refactoring opportunity to standardize on either
self._transformsorself.transformsproperty throughout the codebase, but this is appropriately maintained for consistency with the existing implementation.
1644-1766: Enhanced dataset implementation with GetItem integration.The refactored
FiftyOneTorchDatasetnow leverages theGetItemabstraction for more structured sample processing, with added support for vectorized field caching.However, there's a potential improvement for line 1750-1751:
- self.cached_fields = None - self.cache_field_names = self.field_mapping.values() + self.cache_field_names = list(self.field_mapping.values()) + self.cached_fields = NoneUsing
list()ensures we're not working with a view that could change if the underlying dictionary is modified. Also, for clarity, it's better to setcache_field_namesbeforecached_fields.🧰 Tools
🪛 Ruff (0.8.2)
1730-1730: No explicit
stacklevelkeyword argument found(B028)
1864-1882: Well-structured batch preparation with error handling.The method provides good per-field and per-sample error handling, allowing for graceful failure during batch preparation from the database.
1883-1891: Efficient vectorized batch preparation.This method efficiently retrieves pre-cached fields for batch preparation, providing a performance optimization path.
1893-1895: Streamlined single-item access.By having
__getitem__call__getitems__with a single index, the code maintains consistency in batch preparation logic between single-item and multi-item access patterns.
1896-1910: Unified batch retrieval implementation.The
__getitems__method correctly dispatches to the appropriate batch preparation method based on whether vectorization is enabled, and handles errors at both the sample dictionary and item processing levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
fiftyone/utils/torch.py (5)
275-275: Super call exists without an explicit parent class.The call to
super().__init__(**kwargs)is technically valid since all classes inherit fromobject, but it's unusual to see it when no explicit parent class is defined. Based on previous comments, I understand this is to avoid MRO bugs with mixins.Consider adding a brief comment explaining why this super call exists to avoid confusing future maintainers:
def __init__(self, field_mapping=None, **kwargs): - super().__init__(**kwargs) + # Call to super() ensures proper MRO resolution when used with mixins + super().__init__(**kwargs) self.field_mapping = field_mapping
305-318: Field mapping validation could be more comprehensive.The current implementation validates that each provided key is in
required_keys, but doesn't verify that all required keys are actually mapped.Consider adding a check to ensure all required keys are present in the field mapping:
def field_mapping(self, value): if not hasattr(self, "_field_mapping_dict"): self._field_mapping_dict = {} if value is None: # generally on init value = {k: k for k in self.required_keys} if not isinstance(value, dict): raise ValueError("field_mapping must be a dictionary.") + # Ensure all required keys are present + missing_keys = set(self.required_keys) - set(value.keys()) + if missing_keys: + raise ValueError(f"Missing required keys in field_mapping: {missing_keys}") for k, v in value.items(): if k not in self.required_keys: raise ValueError( f"Field '{k}' is not in the required keys: {self.required_keys}" ) self._field_mapping_dict[k] = v
313-313: Thevalue.items()iteration could be optimized.As noted in the past review comments, only keys are used in this validation loop, not values.
Replace
value.items()withvalue.keys()since the values aren't used in this loop:- for k, v in value.items(): + for k in value.keys(): if k not in self.required_keys: raise ValueError( f"Field '{k}' is not in the required keys: {self.required_keys}" ) self._field_mapping_dict[k] = v
609-610: Consider making "filepath" field configurable rather than hardcoding it.The current implementation hardcodes "filepath" as the key to access the image path, but this limits flexibility.
Consider making the field name configurable:
def __call__(self, d): img = _load_image( - # hardcoded because that's how it was in fiftyone.core.models._make_data_loader - d["filepath"], + d[self.field_mapping.get("filepath", "filepath")], use_numpy=self.use_numpy, force_rgb=True, )
714-714: Useself.transformsproperty instead of directly accessingself._transforms.As noted in past comments, some models override the
transformsproperty but notself._transforms, which could lead to inconsistent behavior.Use the property to ensure consistent behavior:
def build_get_item(self, field_mapping=None): return ImageGetItem( - transform=self._transforms, + transform=self.transforms, raw_inputs=self.config.raw_inputs, using_half_precision=self._using_half_precision, use_numpy=False, field_mapping=field_mapping, )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/utils/torch.py(10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: modified-files
- GitHub Check: build
🔇 Additional comments (9)
fiftyone/utils/torch.py (9)
257-325: Well-designed abstract class for standardizing sample-to-input conversion.The
GetItemclass provides a clean interface for defining how to convert sample dictionaries to model inputs, with proper field mapping validation. It's a good abstraction that will help standardize this functionality across the codebase.The approach helps separate model logic from data loading, improving maintainability.
590-629: TheImageGetItemimplementation is appropriately tailored for images.This concrete implementation of
GetItemhandles image loading and preprocessing according to the model's requirements. The half-precision handling is properly guarded with type checking to avoid errors on non-tensor objects.
631-637: Good integration with the existing class hierarchy.Adding
fom.SupportsGetItemto the inheritance list allowsTorchImageModelto advertise its support for the newGetIteminterface, making the implementation discoverable through the type system.
712-720: Well-implemented factory method to build the appropriate GetItem.The
build_get_itemmethod properly configures anImageGetIteminstance with the model's transforms and settings, ensuring consistent behavior between direct model usage and the new dataset-based approach.
1734-1735: Good field selection optimization.Selecting only the fields needed by the
GetIteminstance is an important optimization, especially for datasets with many fields. This reduces the amount of data loaded from the database and improves performance.
1804-1805: Critical fix for database connection handling in multiprocessing.Disconnecting the database before establishing new connections in worker processes is essential to prevent deadlocks, especially with MongoDB which is not fork-safe.
1869-1887: Well-designed error handling for batch preparation.The error handling in
_prepare_batch_dbproperly catches and reports errors when loading fields, which is essential for debugging and robustness, especially when running with multiple workers.
1888-1897: Efficient vectorized batch preparation.The
_prepare_batch_vectorizedmethod provides an efficient alternative to loading data from the database repeatedly. Pre-caching fields can significantly improve performance for datasets that fit in memory.
1898-1914: Good implementation of batch indexing.The new
__getitems__method properly dispatches to the appropriate batch preparation method and applies theGetItemconversion to each sample. This enables efficient batch processing without sacrificing flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fiftyone/core/models.py (1)
806-859: Refactored data loader creation with unified error handling.The refactored
_make_data_loaderfunction now:
- Uses the new
ErrorHandlingCollateclass for consistent error handling- Conditionally creates datasets based on model capabilities
- Configures appropriate worker initialization functions
- Sets optimized DataLoader parameters like pin_memory
I noticed there's an opportunity to use a ternary operator for the user_collate_fn assignment as suggested by static analysis.
- if model.has_collate_fn: - user_collate_fn = model.collate_fn - else: - user_collate_fn = None + user_collate_fn = model.collate_fn if model.has_collate_fn else None🧰 Tools
🪛 Ruff (0.8.2)
822-825: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/core/models.py(27 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/models.py
822-825: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block
Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-app
- GitHub Check: e2e / test-e2e
- GitHub Check: build / build
- GitHub Check: build
🔇 Additional comments (8)
fiftyone/core/models.py (8)
155-160: Support for field mapping is a nice enhancement.This new parameter allows models to customize how sample fields are mapped to model inputs, giving more flexibility to model implementers. Good use of the kwargs dictionary to avoid adding new parameters to the function signature.
168-171: Better control flow with explicit video frame processing flag.The new
process_video_framesflag makes the code more readable and maintainable by clearly indicating when an image model is being applied to video frames. This improves the architecture by making the control flow more explicit and easier to follow.Also applies to: 213-213, 226-226, 972-975, 1010-1010, 1017-1017, 1525-1525, 1584-1584
172-175: Extended data loader support for models implementing SupportsGetItem.The
use_data_loadercondition now correctly includes models implementing the newSupportsGetIteminterface, but properly excludes video frame processing which requires special handling.Also applies to: 976-979, 1527-1530
332-337: Consistent field selection across methods.The refactored field selection pattern ensures consistent behavior across all methods, explicitly filtering fields when samples implement
SamplesMixinor including only the needed fields for specialized operations. This reduces redundancy and improves maintainability.Also applies to: 379-384, 448-453, 511-516, 578-583, 649-654, 1062-1062, 1106-1106, 1174-1174, 1234-1234, 1300-1300, 1370-1370, 1639-1639, 1754-1754
740-804: New ErrorHandlingCollate class improves batch processing robustness.This well-designed class centralizes batch collation logic with three strategies:
- Handling ragged batches (variable-sized tensors)
- Using numpy stacking when appropriate
- Using default PyTorch collation
The error handling mechanism gracefully handles exceptions based on the
skip_failuresflag. This is essential for production systems that need to process large datasets with potentially problematic samples.
837-841: Key implementation for GetItem integration with torch datasets.This critical block enables the performance improvements mentioned in the PR objectives. For models that support GetItem:
- The model can build its own GetItem instance with custom field mapping
- The dataset is created with to_torch() which likely enables more efficient data loading
- Worker initialization is properly configured for multiprocessing
This change is what enables the throughput improvement from 1.1k to 1.4k samples per second.
2442-2456: New SupportsGetItem mixin enables efficient and customizable data loading.This well-designed mixin class:
- Marks models compatible with the new structured data retrieval mechanism
- Defines a clear API for building GetItem instances
- Provides access to required keys for validation
- Makes the integration with FiftyOneTorchDataset seamless
This is a clean, object-oriented approach to extending model capabilities without breaking existing code.
856-859: Improved DataLoader configuration.Setting
pin_memory=Truecan significantly improve performance when transferring data to CUDA devices by using pinned memory. Settingpersistent_workers=Falseavoids keeping workers alive between data iterations which is appropriate for most inference scenarios. The conditional setting ofworker_init_fnensures proper worker initialization when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
fiftyone/utils/torch.py (1)
257-324: Well-designed abstraction, but consider simplifying field_mapping initializationThe
GetItemclass provides a good abstraction for converting sample dictionaries to model inputs. It enables consistent field selection and handling across different model types.However, the field_mapping property's initialization could be simplified:
- @property - def field_mapping(self): - """ - a dictionary mapping the fields in the sample to the input fields for the model - if not defined, it's assumed that the fields are the same as the `required_keys` - of the `GetItem` instance. - - When this is updated, the behavior of __call__ should be updated accordingly. - This is what allows the user to specify the fields in the sample that are use - as input to the model. - """ - if not hasattr(self, "_field_mapping_dict"): - self._field_mapping_dict = {k: k for k in self.required_keys} - return self._field_mapping_dict + @property + def field_mapping(self): + """ + a dictionary mapping the fields in the sample to the input fields for the model + if not defined, it's assumed that the fields are the same as the `required_keys` + of the `GetItem` instance. + + When this is updated, the behavior of __call__ should be updated accordingly. + This is what allows the user to specify the fields in the sample that are use + as input to the model. + """ + return self._field_mapping_dictAnd initialize
self._field_mapping_dictdirectly in__init__:def __init__(self, field_mapping=None, **kwargs): super().__init__(**kwargs) + self._field_mapping_dict = {k: k for k in self.required_keys} self.field_mapping = field_mappingThis would make the code more straightforward and avoid the need for
hasattrchecks.fiftyone/core/models.py (3)
766-777: Consider making this a class methodSince
handle_errorsdoesn't use any instance attributes, consider converting it to a@classmethodrather than a@staticmethodto improve readability and allow for more natural subclassing.- @staticmethod - def handle_errors(batch): + @classmethod + def handle_errors(cls, batch):
828-831: Use a ternary operator for clarityReplace the if-else block with a more concise ternary expression.
- if model.has_collate_fn: - user_collate_fn = model.collate_fn - else: - user_collate_fn = None + user_collate_fn = model.collate_fn if model.has_collate_fn else None🧰 Tools
🪛 Ruff (0.8.2)
828-831: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
857-865: DataLoader configuration improvementsThe addition of
pin_memory=Truewill improve data transfer to GPU. Settingpersistent_workers=Falseis a reasonable default but consider making this configurable for long-running workloads.Consider adding a parameter to allow configuring
persistent_workersfor scenarios with continuous inference on large datasets, where the worker startup overhead could be significant.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fiftyone/core/models.py(27 hunks)fiftyone/utils/torch.py(10 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
fiftyone/core/models.py
828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block
Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None
(SIM108)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: test / test-app
- GitHub Check: e2e / test-e2e
- GitHub Check: build / build
- GitHub Check: build
🔇 Additional comments (18)
fiftyone/utils/torch.py (11)
13-13: LGTM: Improved typing importsThe addition of specific type imports (
Any,Optional,List) is good practice and enhances type checking capabilities.
590-629: LGTM: Clean implementation of ImageGetItemThe
ImageGetItemclass provides a clean, specific implementation for image models with appropriate error handling and half-precision support.I especially appreciate the proper tensor type checking on line 621 before applying half-precision, which avoids potential errors when
imgmight not be a tensor.
631-633: LGTM: Added support for GetItem interfaceGood addition of the
SupportsGetItemmixin to TorchImageModel, which aligns with the PR objective to enable models to benefit from more efficient data loading.
712-720: LGTM: Well-implemented GetItem builder methodThe
build_get_itemmethod properly constructs anImageGetItemwith all the necessary configuration from the model. This enables the data loading optimization mentioned in the PR objectives.
1707-1711: LGTM: Improved dataset initialization with GetItemThe updated parameter signature using a strongly-typed
GetIteminstance and the newvectorizeflag enables the performance improvements mentioned in the PR objectives.
1727-1729: Strong improvement: Field selection optimizationThis code now selects only the needed fields from samples using the field mapping from GetItem, which is a significant performance optimization when datasets have many fields.
1742-1751: LGTM: Efficient field caching strategyThe conditional caching of fields based on the
vectorizeflag provides a good balance between memory usage and performance. When enabled, this should significantly improve throughput as mentioned in the PR objectives.
1796-1798: Important fix for worker safetyAdding the database disconnect before initializing workers is crucial for preventing connection issues in multiprocessing environments. This is a good safety enhancement.
1858-1890: LGTM: Well-implemented batch preparationThe code now offers two different batch preparation strategies (
_prepare_batch_dband_prepare_batch_vectorized) depending on whether fields are cached. The implementation includes proper error handling for individual fields, which is a good defensive practice.
1891-1893: Simple but effective delegation patternThe single-item getter now delegates to the batch getter, which is a good code reuse pattern. This ensures consistent behavior between single-item and batch accesses.
1894-1907: LGTM: Robust batch retrieval implementationThe
__getitems__method properly dispatches to the appropriate batch preparation method based on thevectorizeflag and handles errors consistently. It's a clean implementation that should deliver the performance improvements targeted by this PR.fiftyone/core/models.py (7)
155-172: Enhanced model capability detection with field mapping supportThis code introduces structured handling of field mappings based on model capabilities. It separates concerns between field mapping (for
SupportsGetItemmodels) and needed fields (forSamplesMixinmodels) with clean logic branches.
746-765: Good implementation of the error-handling collate classThe
ErrorHandlingCollateclass provides a clean abstraction for handling different collation strategies with proper error management. It intelligently switches between ragged batches, numpy stacking, and default collation based on model requirements.
812-842: Robust data loader configurationThe refactored
_make_data_loaderfunction now properly handles field mappings and model-specific requirements. This change is central to the PR's performance improvements.🧰 Tools
🪛 Ruff (0.8.2)
828-831: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
833-838: Comprehensive error handling collation setupThe collate function is well-configured with all necessary parameters based on model requirements.
843-856: Clean conditional dataset creationThis section efficiently switches between the new
GetItem-based approach and the traditional torch dataset based on model capabilities.
972-981: Consistent field mapping implementation in compute_embeddingsThis follows the same pattern as in
apply_model, ensuring consistent behavior across different entry points.
2448-2462: Well-designed SupportsGetItem mixin interfaceThe mixin cleanly defines the interface for models that support the new
GetItemapproach, with a clear contract for implementation.However, it may be helpful to add a short docstring example showing how to implement the
build_get_itemmethod for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fiftyone/utils/torch.py (1)
590-629: Good implementation of image model input processing with proper error handling.The
ImageGetItemclass correctly handles image loading, transformation, and half-precision conversion with proper type checking. The implementation addresses previously identified issues with non-tensor inputs.Note that the comment on line 609 indicates "hardcoded" filepath field, which might be worth making configurable in the future for more flexibility.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/utils/torch.py(10 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (8)
fiftyone/utils/torch.py (8)
257-324: Well-designed base abstraction for model inputs.The
GetItemclass defines a clean interface for transforming sample dictionaries into model inputs, with proper validation and sensible defaults. The field mapping system provides flexible field name mapping with strong validation.
631-633: Proper integration of the GetItem support in the model hierarchy.Adding
SupportsGetItemto the class inheritance hierarchy enables the model to work with the new data loading mechanism while maintaining backward compatibility.
712-720: Well-implemented GetItem factory method.The
build_get_itemmethod correctly configures anImageGetItemwith the model's transforms, precision settings, and optional field mapping. This provides a standardized way to create model-specific item processors.
1707-1729: Integrated field selection optimization in constructor.The updated constructor correctly extracts the field mapping from the GetItem instance and performs early field selection to limit the data loaded from the database. This addresses the performance concern mentioned in previous review comments.
🧰 Tools
🪛 Ruff (0.8.2)
1723-1723: No explicit
stacklevelkeyword argument found(B028)
1742-1760: Added efficient vectorized data loading support.The vectorize mode provides a performance optimization by preloading required fields into memory. The implementation properly handles field caching while maintaining compatibility with the existing dataset interface.
1797-1798: Fixed database connection handling for multiprocessing safety.Adding explicit database disconnection in worker initialization addresses the potential deadlock issues when using fork-based multiprocessing, as highlighted in previous review comments.
1858-1890: Well-implemented batch preparation methods with error handling.The implementation provides two batch preparation strategies with proper error handling:
_prepare_batch_db: Loads samples from the database on demand_prepare_batch_vectorized: Uses cached fields for faster accessThe error handling approach respects the
skip_failuresflag and provides informative error messages.
1891-1907: Improved batch access with efficient dispatching.The implementation unifies single-item and batch access through the
__getitems__method, which properly dispatches to the appropriate batch preparation method based on the vectorize flag. This provides consistent behavior while enabling performance optimizations.
apply_model: use FOTD for TorchImageModelsFiftyOneTorchDataset when applying TorchImageModels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
fiftyone/core/collections.py (1)
9586-9624: Well-designed integration with PyTorch data loading.The new
to_torch()method provides a clean interface to create PyTorch datasets from FiftyOne collections, leveraging the newGetItemabstraction. The implementation is straightforward and the docstring is comprehensive.This change aligns with the PR objective to improve data loading performance for
TorchImageModels by using the more efficientFiftyOneTorchDataset.Consider adding a usage example in the docstring to demonstrate how to create a
GetIteminstance and use it with this method, particularly for common model types. This would help users understand how to leverage this optimization path.fiftyone/core/models.py (4)
746-809: Well-structured batch collation error handlingThe
ErrorHandlingCollateclass effectively encapsulates different collation strategies with robust error handling. It supports ragged batches, numpy stacking, and default PyTorch collation.Consider using a ternary operator for more concise code:
- if self.user_collate_fn is None: - self.user_collate_fn = tud.dataloader.default_collate + self.user_collate_fn = self.user_collate_fn or tud.dataloader.default_collate
812-865: Enhanced data loader creation supporting GetItem modelsThe function now intelligently handles models implementing
SupportsGetItemby creating a dataset through the model'sbuild_get_itemmethod. This enables more efficient data loading paths.Consider using a ternary operator as suggested by static analysis:
- if model.has_collate_fn: - user_collate_fn = model.collate_fn - else: - user_collate_fn = None + user_collate_fn = model.collate_fn if model.has_collate_fn else None🧰 Tools
🪛 Ruff (0.8.2)
828-831: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
2448-2478: Well-designed SupportsGetItem mixinThis mixin provides a clean interface for models to integrate with the new
FiftyOneTorchDataset. The required_keys property and build_get_item method establish a clear contract for implementations.The
required_keysproperty creates a newGetIteminstance on each access, which could be inefficient if called frequently. Consider caching the instance or the keys:def __init__(self): self._get_item_cache = None self._required_keys_cache = None @property def required_keys(self): """The required keys that must be provided to methods like :func:`apply_model` and :func:`compute_embeddings` during inference. """ if self._required_keys_cache is None: self._required_keys_cache = self.build_get_item().required_keys return self._required_keys_cache
2458-2464: Potential inefficiency in required_keys implementationThe
required_keysproperty creates a newGetIteminstance on every access by callingbuild_get_item()without parameters, which could be inefficient if accessed frequently.Consider caching the result or implementing a more efficient approach to avoid creating unnecessary objects:
@property def required_keys(self): """The required keys that must be provided to methods like :func:`apply_model` and :func:`compute_embeddings` during inference. """ if not hasattr(self, '_required_keys_cache'): self._required_keys_cache = self.build_get_item().required_keys return self._required_keys_cachefiftyone/utils/torch.py (4)
13-13: Clean up unused import statements.The static analysis shows that
Any,Optional, andListare imported but never used in this file.-from typing import Any, Optional, List +from typing import Any # Import specific types only when neededIf these types will be used in future iterations, consider adding a
# noqa: F401comment to suppress the linter warning.🧰 Tools
🪛 Ruff (0.8.2)
13-13:
typing.Anyimported but unusedRemove unused import
(F401)
13-13:
typing.Optionalimported but unusedRemove unused import
(F401)
13-13:
typing.Listimported but unusedRemove unused import
(F401)
316-326: Consider additional validation for field mapping values.The setter only validates that keys exist in
required_keys, but doesn't verify that the mapped field names actually exist in the sample collection. This could lead to runtime errors when the dataset doesn't contain the expected fields.You could add a
validate_field_namesmethod that gets called when a dataset is bound to the GetItem instance:def validate_field_names(self, sample_collection): """Validates that all fields referenced in field_mapping exist in the collection. Args: sample_collection: A SampleCollection instance Raises: ValueError: If any mapped field doesn't exist in the collection """ for key, field_name in self.field_mapping.items(): if not sample_collection.has_field(field_name): raise ValueError(f"Field mapping references non-existent field '{field_name}' for key '{key}'")
1736-1737: Consider implementing the TODO for further optimization.Loading all fields via a single
values()call could improve performance by reducing the number of database queries.- # @todo load all fields via a single `values()` call - for field_name in fields_to_load: - self.cached_fields[field_name] = self._load_field( - samples, field_name, local_process_group=local_process_group - ) + if fields_to_load: + field_values = samples.values(*fields_to_load) + for i, field_name in enumerate(fields_to_load): + self.cached_fields[field_name] = self._load_field_from_values( + field_values[i], local_process_group=local_process_group + )This would require adding a new helper method to convert values to a serialized list.
1850-1852: Refine error handling for individual fields.Currently, if loading one field fails, the entire sample is replaced with an error object. Consider a more granular approach that allows partial field failures.
- d = error - break + # Either set just this field to an error or add it to a list of errors + d[key] = None # Set failed field to None + if not hasattr(d, "_errors"): + d._errors = [] + d._errors.append((key, error)) # Track the error for later handlingThis would allow the GetItem to still process samples with partial field failures, with appropriate error handling.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fiftyone/core/collections.py(1 hunks)fiftyone/core/models.py(27 hunks)fiftyone/utils/torch.py(11 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/core/collections.py (1)
fiftyone/utils/torch.py (1)
FiftyOneTorchDataset(1664-1888)
🪛 Ruff (0.8.2)
fiftyone/core/models.py
828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block
Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None
(SIM108)
fiftyone/utils/torch.py
13-13: typing.Any imported but unused
Remove unused import
(F401)
13-13: typing.Optional imported but unused
Remove unused import
(F401)
13-13: typing.List imported but unused
Remove unused import
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (15)
fiftyone/core/models.py (7)
155-172: Effective model capability detection logicThe code intelligently handles field mapping based on model implementation capabilities, distinguishing between models that implement
SupportsGetItem,SamplesMixin, or both.
174-176: Good addition of process_video_frames flagThis flag clearly indicates when an image model is being applied to video frames, making the code more readable and maintainable.
178-181: Condition updated to support new model typesThe updated condition properly accounts for models implementing
SupportsGetItemin addition to the existingTorchModelMixincheck.
184-185: Improved warning message clarityThe simplified warning message is more concise while maintaining the necessary information about ignoring the unsupported parameter.
843-856: Clean conditional dataset creationThis section elegantly handles the creation of appropriate datasets based on model capabilities, properly passing necessary parameters and setting worker initialization functions.
972-981: Consistent field_mapping handling in compute_embeddingsThe implementation correctly mirrors the field_mapping handling from apply_model, ensuring consistency across different model application functions.
1531-1536: Consistent process_video_frames use in compute_patch_embeddingsThe same approach for detecting video frame processing is applied here, maintaining consistency with other functions.
fiftyone/utils/torch.py (8)
257-278: Well-designed abstraction for model input preparation.The
GetItemabstract class provides a clean interface for standardizing how sample data is converted to model inputs. The inheritance pattern withsuper().__init__(**kwargs)helps avoid method resolution order issues when used with mixins.
593-628: Good implementation of ImageGetItem with clear responsibility.This implementation provides all the necessary functionality for handling image data, including transformations and half-precision support when applicable.
642-644: Proper handling of half-precision for tensor types.The code correctly checks if the image is a torch tensor before applying half-precision, avoiding potential AttributeError exceptions when working with PIL Images or numpy arrays.
733-741: Good TorchImageModel integration with GetItem abstraction.The
build_get_itemmethod effectively bridges the TorchImageModel implementation with the new GetItem abstraction, reusing existing model properties and configurations.
1696-1701: Effective field selection optimization.Selecting only the required fields from the sample collection is an important optimization that reduces the amount of data loaded from the database, particularly for large datasets with many fields.
1763-1780: Good handling of database connections in worker processes.The explicit disconnection of MongoDB clients in worker initialization addresses potential deadlock issues with fork-based multiprocessing, as recommended in the PyMongo documentation.
1869-1886: Efficient batch processing implementation.Implementing
__getitem__in terms of__getitems__creates a consistent interface for both single-item and batch retrieval, while the conditional dispatch to either database or vectorized batch preparation provides flexibility.
2541-2572: Improved utility functions for distributed training.The enhanced docstrings and implementation improvements in the distributed training utility functions make the code more maintainable and easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fiftyone/core/models.py (1)
812-865: Enhanced data loader creation withSupportsGetItemsupportThe
_make_data_loaderfunction has been updated to check if a model implementsSupportsGetItemand use the appropriate dataset type. The implementation includes performance optimizations like settingpin_memory=Trueand configuring worker initialization when needed.Suggestion for a minor code simplification:
- if model.has_collate_fn: - user_collate_fn = model.collate_fn - else: - user_collate_fn = None + user_collate_fn = model.collate_fn if model.has_collate_fn else None🧰 Tools
🪛 Ruff (0.8.2)
828-831: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
fiftyone/utils/torch.py (2)
8-15: Consider removing unused typing imports.The static analysis hints indicate that
Any,Optional, andListare imported but not used in the code.-from typing import Any, Optional, List +import typingOr if you prefer keeping explicit imports, remove the unused ones:
-from typing import Any, Optional, List🧰 Tools
🪛 Ruff (0.8.2)
15-15:
typing.Anyimported but unusedRemove unused import
(F401)
15-15:
typing.Optionalimported but unusedRemove unused import
(F401)
15-15:
typing.Listimported but unusedRemove unused import
(F401)
315-326: Field mapping setter only validates keys but not completeness.The current implementation validates that provided keys are supported but doesn't check if all required keys are present in the final mapping after updates. This could lead to silent failures if a required key is missing.
Consider adding validation to ensure all required keys have mappings:
def field_mapping(self, value): if value is None: return for k, v in value.items(): if k not in self.required_keys: raise ValueError( f"Unknown key '{k}'. The supported keys are {self.required_keys}" ) self._field_mapping[k] = v + + # Ensure all required keys have mappings + missing_keys = set(self.required_keys) - set(self._field_mapping.keys()) + if missing_keys: + raise ValueError(f"Missing field mappings for required keys: {missing_keys}")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fiftyone/core/models.py(27 hunks)fiftyone/utils/torch.py(11 hunks)tests/unittests/get_item_tests.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unittests/get_item_tests.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (5)
tests/unittests/get_item_tests.py (1)
required_keys(14-15)fiftyone/core/models.py (6)
required_keys(2458-2462)SupportsGetItem(2448-2475)TorchModelMixin(2406-2445)LogitsMixin(2196-2223)Model(2068-2193)build_get_item(2464-2475)tests/unittests/torch_dataset_tests.py (2)
required_keys(24-25)required_keys(37-38)fiftyone/core/odm/database.py (1)
_disconnect(249-253)fiftyone/core/view.py (1)
make_optimized_select_view(1950-2022)
🪛 Ruff (0.8.2)
fiftyone/core/models.py
828-831: Use ternary operator user_collate_fn = model.collate_fn if model.has_collate_fn else None instead of if-else-block
Replace if-else-block with user_collate_fn = model.collate_fn if model.has_collate_fn else None
(SIM108)
fiftyone/utils/torch.py
15-15: typing.Any imported but unused
Remove unused import
(F401)
15-15: typing.Optional imported but unused
Remove unused import
(F401)
15-15: typing.List imported but unused
Remove unused import
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: lint / eslint
- GitHub Check: test / test-app
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: build / build
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: build
🔇 Additional comments (24)
fiftyone/core/models.py (10)
155-177: New check forSupportsGetItemand field mapping handling addedThe code introduces the concept of
SupportsGetItemmodels and handles field mapping appropriately. This is a well-structured implementation that clearly separates concerns between different model types (with SupportsGetItem, SamplesMixin, or both).
179-181: Updated condition for using data loaderThe condition for using a data loader has been expanded to include
SupportsGetItemmodels while excluding video frame processing. This change aligns with the PR objective of usingFiftyOneTorchDatasetwith appropriate models.
746-809: NewErrorHandlingCollateclass improves batch handlingThis new class provides a robust way to handle errors during batch collation with different strategies (ragged batches, numpy stacking, or default PyTorch collation). The implementation is clean and follows good OOP principles with appropriate error handling.
The static method design for
handle_errorsis appropriate here as it's a utility function that doesn't require instance state.
843-856: Clean conditional dataset creation based on model typeThe code correctly creates either a
FiftyOneTorchDatasetusing the model'sbuild_get_itemmethod or falls back to the traditionalTorchImageDatasetbased on the model's capabilities. This is a well-implemented strategy pattern.
972-981: Consistent handling of field mapping and video frames incompute_embeddingsThe changes ensure that
compute_embeddingshandles field mapping and video frame processing consistently with theapply_modelfunction. This maintains a unified approach across the main model application functions.
1169-1181: Field mapping parameter added to embedding functionThe updated
_compute_image_embeddings_data_loaderfunction now accepts the field_mapping parameter and passes it to the data loader, ensuring consistency with the main model application patterns.
1531-1536: Consistent implementation incompute_patch_embeddingsThe same pattern for checking video frames and data loader usage is applied to
compute_patch_embeddings, maintaining consistency across all model application functions in the codebase.
2448-2476: NewSupportsGetIteminterface enables improved dataset integrationThis new interface is the core addition that enables models to work with
FiftyOneTorchDatasetby implementing abuild_get_itemmethod. The implementation is clean with well-documented methods.The
required_keysproperty intelligently delegates to theGetIteminstance, avoiding code duplication.
2457-2463: Smart implementation ofrequired_keyspropertyThe property provides a convenient way to access the required keys from the
GetIteminstance without duplicating the logic. This is a good example of proper delegation.
1-2520: Overall assessment of the implementationThe changes effectively implement support for the
FiftyOneTorchDatasetwhen applyingTorchImageModels through the introduction of theSupportsGetIteminterface and proper handling of field mapping. The code is well-structured, maintains consistency across similar functions, and includes performance optimizations.The implementation aligns perfectly with the PR objectives and the reported performance improvements (from 1.1k to 1.4k samples per second).
I've verified that the PR correctly implements:
- A new interface (
SupportsGetItem) for models to integrate withFiftyOneTorchDataset- Field mapping support for all relevant functions
- Improved batch handling with the new
ErrorHandlingCollateclass- Consistent implementation across all model application functions
The code quality is high with good error handling, performance considerations, and clean abstractions.
🧰 Tools
🪛 Ruff (0.8.2)
203-203: Do not use bare
except(E722)
530-533: Use ternary operator
frames = etaf.FrameRange(*sample.support) if is_clips else Noneinstead ofif-else-blockReplace
if-else-block withframes = etaf.FrameRange(*sample.support) if is_clips else None(SIM108)
597-600: Use ternary operator
frames = etaf.FrameRange(*sample.support) if is_clips else Noneinstead ofif-else-blockReplace
if-else-block withframes = etaf.FrameRange(*sample.support) if is_clips else None(SIM108)
666-669: Use ternary operator
frames = etaf.FrameRange(*sample.support) if is_clips else Noneinstead ofif-else-blockReplace
if-else-block withframes = etaf.FrameRange(*sample.support) if is_clips else None(SIM108)
828-831: Use ternary operator
user_collate_fn = model.collate_fn if model.has_collate_fn else Noneinstead ofif-else-blockReplace
if-else-block withuser_collate_fn = model.collate_fn if model.has_collate_fn else None(SIM108)
1254-1257: Use ternary operator
frames = etaf.FrameRange(*sample.support) if is_clips else Noneinstead ofif-else-blockReplace
if-else-block withframes = etaf.FrameRange(*sample.support) if is_clips else None(SIM108)
1284-1287: Use ternary operator
embeddings = np.stack(embeddings) if embeddings else Noneinstead ofif-else-blockReplace
if-else-block withembeddings = np.stack(embeddings) if embeddings else None(SIM108)
1320-1323: Use ternary operator
frames = etaf.FrameRange(*sample.support) if is_clips else Noneinstead ofif-else-blockReplace
if-else-block withframes = etaf.FrameRange(*sample.support) if is_clips else None(SIM108)
1355-1358: Use ternary operator
embeddings = np.stack(embeddings) if embeddings else Noneinstead ofif-else-blockReplace
if-else-block withembeddings = np.stack(embeddings) if embeddings else None(SIM108)
1387-1390: Use ternary operator
frames = etaf.FrameRange(*sample.support) if is_clips else Noneinstead ofif-else-blockReplace
if-else-block withframes = etaf.FrameRange(*sample.support) if is_clips else None(SIM108)
1840-1843: Use ternary operator
frames = etaf.FrameRange(*sample.support) if is_clips else Noneinstead ofif-else-blockReplace
if-else-block withframes = etaf.FrameRange(*sample.support) if is_clips else None(SIM108)
fiftyone/utils/torch.py (14)
257-306: Well-designed abstract base class for data loading.The new
GetItemclass provides a clean interface for standardizing sample-to-input conversion. The abstract method pattern ensures that subclasses implement the required functionality while providing common field mapping logic.A minor observation: Consider adding a method to validate that all required keys have corresponding fields in the mapping after updates, not just that the provided keys are valid.
592-627: Good implementation of GetItem for image models.The
ImageGetItemclass provides a concrete implementation for handling image data with appropriate options for transforms, precision, and output format.Note that this class provides a hard-coded value for
force_rgb=Truein the_load_imagecall. If there's a need to support grayscale or RGBA images in the future, consider making this configurable through the constructor.
628-649: Safe handling of half precision conversion.The implementation correctly checks if the image is a tensor before applying half precision, avoiding potential errors with non-tensor inputs.
732-740: TorchImageModel now supports the GetItem interface.The
build_get_itemmethod provides a clean way to create anImageGetIteminstance configured with the model's transforms and precision settings.Note that per the comment in line 734, this is using
self._transformswhich might differ fromself.transformsin subclasses as noted in a prior review comment. This is acknowledged as a potential separate refactor.
1664-1696: Enhanced FiftyOneTorchDataset with vectorization support.The refactored dataset class now supports both on-demand and vectorized data loading strategies, allowing efficient trade-offs between memory usage and performance.
The vectorization approach provides significant performance benefits by pre-loading required fields, which aligns well with the PR's objective to improve throughput from 1.1k to 1.4k samples per second.
1697-1721: Optimized field selection and caching.The implementation selects only the specific fields needed for the model, reducing database load. The vectorize option enables preloading fields for better performance.
1726-1753: Efficient field caching with distributed training support.The implementation handles field caching and distributed training scenarios well, with separate code paths for local and distributed execution.
A note for future optimization: There's a TODO comment at line 1735 about loading all fields via a single
values()call, which could further improve performance.
1754-1779: Well-implemented worker initialization with database connection handling.The worker initialization properly disconnects and reconnects to the database to avoid multiprocessing issues, addressing a common pitfall in distributed environments.
This implementation aligns with MongoDB's recommendation for handling connections in multiprocessing environments and should prevent deadlocks that could occur when forking processes with active database connections.
1835-1867: Efficient batch preparation strategies.The implementation provides two batch preparation strategies: database-driven (
_prepare_batch_db) and vectorized (_prepare_batch_vectorized), allowing for flexibility in performance optimization based on dataset size and memory constraints.
1868-1885: Extended dataset interface with batch retrieval support.The addition of
__getitems__enables efficient batch retrieval, with single-item access delegating to it. This provides a consistent interface while supporting batch operations.This approach is more efficient than repeated calls to
__getitem__for batch retrieval, particularly when using the database-driven strategy.
2470-2484: Changed logging level for serialization steps.Logging for serialization steps has been changed from info to debug level, which is more appropriate for these detailed operations that don't require regular user attention.
2540-2571: Improved docstrings for distributed training utilities.The docstrings for
get_local_sizeand related functions have been enhanced with clear descriptions of their purpose and return values.
2573-2604: Well-implemented local rank and process group utilities.The functions for managing distributed training process groups and ranks have clear implementations with appropriate fallbacks for non-distributed scenarios.
2606-2629: Enhanced local scatter implementation with improved docstring.The
local_scatterfunction now has a clearer docstring explaining its purpose and parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fiftyone/utils/torch.py (1)
15-15: Remove unused typing imports.The imported types
Any,Optional, andListfrom thetypingmodule aren't used in this file.-from typing import Any, Optional, List +from typing import Any # Remove if not needed elsewhere in future changes🧰 Tools
🪛 Ruff (0.8.2)
15-15:
typing.Anyimported but unusedRemove unused import
(F401)
15-15:
typing.Optionalimported but unusedRemove unused import
(F401)
15-15:
typing.Listimported but unusedRemove unused import
(F401)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fiftyone/utils/torch.py(11 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fiftyone/utils/torch.py (5)
tests/unittests/torch_dataset_tests.py (2)
required_keys(24-25)required_keys(37-38)tests/unittests/get_item_tests.py (1)
required_keys(14-15)fiftyone/core/models.py (4)
required_keys(2458-2462)SupportsGetItem(2448-2475)Model(2068-2193)build_get_item(2464-2475)fiftyone/core/odm/database.py (1)
_disconnect(249-253)fiftyone/core/view.py (1)
make_optimized_select_view(1950-2022)
🪛 Ruff (0.8.2)
fiftyone/utils/torch.py
15-15: typing.Any imported but unused
Remove unused import
(F401)
15-15: typing.Optional imported but unused
Remove unused import
(F401)
15-15: typing.List imported but unused
Remove unused import
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-windows / test-python (windows-latest, 3.12)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.12)
- GitHub Check: test-windows / test-python (windows-latest, 3.11)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.11)
- GitHub Check: test-windows / test-python (windows-latest, 3.10)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.10)
- GitHub Check: test-windows / test-python (windows-latest, 3.9)
- GitHub Check: test / test-python (ubuntu-latest-m, 3.9)
- GitHub Check: lint / eslint
- GitHub Check: test / test-app
- GitHub Check: build / build
- GitHub Check: e2e / test-e2e
- GitHub Check: build
🔇 Additional comments (7)
fiftyone/utils/torch.py (7)
257-308: LGTM: Well-designed base class for model input loading.The
GetItemabstract base class provides a clear interface for converting sample data dictionaries to model inputs, with good handling of field mapping between model keys and dataset fields.
592-649: LGTM: Clean implementation of image input loading.The
ImageGetItemclass properly implements theGetIteminterface for images, with good handling of different input formats, transformations, and tensor conversion. The half-precision check is correctly applied only to torch tensors.
651-739: LGTM: Proper implementation of SupportsGetItem.
TorchImageModelnow correctly implements theSupportsGetIteminterface, providing abuild_get_itemmethod that creates a properly configuredImageGetIteminstance. This integration enables the model to work efficiently withFiftyOneTorchDataset.
1664-1775: LGTM: Well-designed dataset refactoring for performance.The
FiftyOneTorchDatasetrefactoring properly uses the newGetItemabstraction and adds vectorized field loading options. The implementation includes proper field selection optimization, batch preparation methods, and error handling.
1775-1776: Good addition: Database connection handling for multiprocessing.The explicit database disconnection in worker initialization prevents potential deadlocks in forked processes, as recommended in the PyMongo documentation. This is crucial for reliable operation with multiple workers.
1839-1892: LGTM: Effective batch preparation implementation.The implementation provides two strategies for batch preparation: database-driven and vectorized (memory-cached). This gives users the flexibility to choose between memory efficiency and performance based on their dataset size and available resources.
2566-2612: Documentation improvement for utility functions.The improved docstrings for distributed training utility functions make the code more maintainable and easier to understand for contributors.
brimoor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobsela LGTM! I cleaned up the documentation for us 😄
Also, please squash merge to make it easier to cherry-pick this into release/v1.6.0
Change log
SampleCollection.to_torch()interface for optimized model inferenceFiftyOneTorchDatasetby adding a newGetItemclass for loading model inputs for samplesGetItemforTorchImageModelapply_model()andcompute_embeddings()to useFiftyOneTorchDatasetby default when possibleSupportsGetItemExample usage
The results of
apply_modelare the same before and after this code change:Benchmarked throughput on OSS to make sure there isn't slowdown
Using same code as above with
clip-vit-base32-torch: