fix(ops): prevent shared mutable _default_kwargs pollution across operator instances#926
Conversation
…rator instances Fix two bugs in face/motion detection operators: 1. `self.extra_kwargs = self._default_kwargs` creates a reference to the class-level dict instead of a copy. When one instance overrides kwargs (e.g., scaleFactor=2.0), it mutates the class dict, silently polluting all subsequent instances with incorrect defaults. Fixed by using .copy() in all 5 affected operators. 2. ImageFaceCountFilter.compute_stats_single wraps detect_faces in a try/except that logs but swallows exceptions. The face_counts dict is then incomplete, causing a secondary KeyError crash with no useful context. Removed the try/except to let errors propagate naturally, matching the pattern used by the sister operator ImageFaceRatioFilter.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves two critical bugs affecting face detection and video processing operators. The primary goal was to ensure that operator instances maintain their intended default configurations independently and that error conditions are properly propagated, preventing silent data corruption and obscure error messages. These changes enhance the robustness and predictability of the data processing pipeline. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly addresses two important bugs: the pollution of shared mutable _default_kwargs across operator instances and a swallowed exception that led to misleading errors. The fixes are sound and improve the robustness of the operators. I've added a few suggestions to refactor the keyword argument handling in a more concise and Pythonic way, which applies to all five modified files.
| self.extra_kwargs = self._default_kwargs.copy() | ||
| for key in kwargs: | ||
| if key in self.extra_kwargs: | ||
| self.extra_kwargs[key] = kwargs[key] |
There was a problem hiding this comment.
While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.
self.extra_kwargs = self._default_kwargs.copy()
self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)There was a problem hiding this comment.
Good suggestion — updated all 5 operators to use dict.update() with a generator expression. Thanks!
| self.extra_kwargs = self._default_kwargs.copy() | ||
| for key in kwargs: | ||
| if key in self.extra_kwargs: | ||
| self.extra_kwargs[key] = kwargs[key] |
There was a problem hiding this comment.
While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.
self.extra_kwargs = self._default_kwargs.copy()
self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)| self.extra_kwargs = self._default_kwargs.copy() | ||
| for key in kwargs: | ||
| if key in self.extra_kwargs: | ||
| self.extra_kwargs[key] = kwargs[key] |
There was a problem hiding this comment.
While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.
self.extra_kwargs = self._default_kwargs.copy()
self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)| self.extra_kwargs = self._default_kwargs.copy() | ||
| for key in kwargs: | ||
| if key in self.extra_kwargs: | ||
| self.extra_kwargs[key] = kwargs[key] |
There was a problem hiding this comment.
While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.
self.extra_kwargs = self._default_kwargs.copy()
self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)| self.extra_kwargs = self._default_kwargs.copy() | ||
| for key in kwargs: | ||
| if key in self.extra_kwargs: | ||
| self.extra_kwargs[key] = kwargs[key] |
There was a problem hiding this comment.
While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.
self.extra_kwargs = self._default_kwargs.copy()
self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)Replace explicit for-loop with dict.update() generator expression for more concise and Pythonic kwargs filtering in all 5 operators. Signed-off-by: dubin555 <dubin555@gmail.com>
Collapse multi-line dict.update() calls to single line to satisfy the project's pre-commit formatter.
Summary
Fix two bugs affecting face detection operators:
Shared mutable
_default_kwargspollution — Five operators assignself.extra_kwargs = self._default_kwargs, which creates a reference to the class-level dict rather than a copy. When one instance overrides kwargs (e.g.,scaleFactor=2.0), it mutates the class dict, silently corrupting defaults for all subsequent instances of that operator class within the same process.Swallowed exception causing misleading
KeyErrorinImageFaceCountFilter.compute_stats_single— A try/except block arounddetect_faces()catches and logs the original error but continues execution. The subsequent dict comprehension then crashes with an unhelpfulKeyErrorfor any image key that wasn't populated, hiding the real failure cause.Problem
Bug 1 — All five affected operators share this pattern:
If a user creates
ImageFaceCountFilter(scaleFactor=2.0)followed byImageFaceCountFilter(), the second instance inheritsscaleFactor=2.0instead of the intended default1.1. In pipelines with multiple face-related operators, one operator's custom settings contaminate another's defaults.Bug 2 —
image_face_count_filter.pywraps detection in try/except:The sister operator
ImageFaceRatioFilterdoes not have this try/except around the same logic, confirming it was unintentional.Fix
self.extra_kwargs = self._default_kwargs.copy()in all 5 operators. Shallow copy is sufficient since all values are primitives.Affected files:
data_juicer/ops/filter/image_face_count_filter.pydata_juicer/ops/filter/image_face_ratio_filter.pydata_juicer/ops/filter/video_motion_score_filter.pydata_juicer/ops/mapper/image_face_blur_mapper.pydata_juicer/ops/mapper/video_face_blur_mapper.pyTest plan
tests/ops/filter/test_image_face_count_filter.py— 5 passedtests/ops/filter/test_image_face_ratio_filter.py— 6 passedtests/ops/mapper/test_image_face_blur_mapper.py— 6 passedtests/ops/filter/test_video_motion_score_filter.py— 12 passed.copy()