Skip to content

fix(ops): prevent shared mutable _default_kwargs pollution across operator instances#926

Open
dubin555 wants to merge 3 commits intodatajuicer:mainfrom
dubin555:oss-scout/verify-fb-data_juicer-ops-filter-image_face_count_
Open

fix(ops): prevent shared mutable _default_kwargs pollution across operator instances#926
dubin555 wants to merge 3 commits intodatajuicer:mainfrom
dubin555:oss-scout/verify-fb-data_juicer-ops-filter-image_face_count_

Conversation

@dubin555
Copy link
Contributor

@dubin555 dubin555 commented Mar 3, 2026

Summary

Fix two bugs affecting face detection operators:

  1. Shared mutable _default_kwargs pollution — Five operators assign self.extra_kwargs = self._default_kwargs, which creates a reference to the class-level dict rather than a copy. When one instance overrides kwargs (e.g., scaleFactor=2.0), it mutates the class dict, silently corrupting defaults for all subsequent instances of that operator class within the same process.

  2. Swallowed exception causing misleading KeyError in ImageFaceCountFilter.compute_stats_single — A try/except block around detect_faces() catches and logs the original error but continues execution. The subsequent dict comprehension then crashes with an unhelpful KeyError for any image key that wasn't populated, hiding the real failure cause.

Problem

Bug 1 — All five affected operators share this pattern:

self.extra_kwargs = self._default_kwargs  # reference, not copy
for key in kwargs:
    self.extra_kwargs[key] = kwargs[key]  # mutates class-level dict

If a user creates ImageFaceCountFilter(scaleFactor=2.0) followed by ImageFaceCountFilter(), the second instance inherits scaleFactor=2.0 instead of the intended default 1.1. In pipelines with multiple face-related operators, one operator's custom settings contaminate another's defaults.

Bug 2image_face_count_filter.py wraps detection in try/except:

try:
    face_counts[key] = detect_faces(img, **self.extra_kwargs)
except Exception as e:
    logger.warning(...)
    # execution continues, face_counts is incomplete

return {key: face_counts[key] for key in loaded_image_keys}  # KeyError

The sister operator ImageFaceRatioFilter does not have this try/except around the same logic, confirming it was unintentional.

Fix

  • Bug 1: self.extra_kwargs = self._default_kwargs.copy() in all 5 operators. Shallow copy is sufficient since all values are primitives.
  • Bug 2: Remove the try/except block entirely, letting exceptions propagate with their original context.

Affected files:

  • data_juicer/ops/filter/image_face_count_filter.py
  • data_juicer/ops/filter/image_face_ratio_filter.py
  • data_juicer/ops/filter/video_motion_score_filter.py
  • data_juicer/ops/mapper/image_face_blur_mapper.py
  • data_juicer/ops/mapper/video_face_blur_mapper.py

Test plan

  • tests/ops/filter/test_image_face_count_filter.py — 5 passed
  • tests/ops/filter/test_image_face_ratio_filter.py — 6 passed
  • tests/ops/mapper/test_image_face_blur_mapper.py — 6 passed
  • tests/ops/filter/test_video_motion_score_filter.py — 12 passed
  • Reproduction script confirms class-level dict is no longer mutated after .copy()
  • Reproduction script confirms detection errors propagate with original context

…rator instances

Fix two bugs in face/motion detection operators:

1. `self.extra_kwargs = self._default_kwargs` creates a reference to the
   class-level dict instead of a copy. When one instance overrides kwargs
   (e.g., scaleFactor=2.0), it mutates the class dict, silently polluting
   all subsequent instances with incorrect defaults. Fixed by using .copy()
   in all 5 affected operators.

2. ImageFaceCountFilter.compute_stats_single wraps detect_faces in a
   try/except that logs but swallows exceptions. The face_counts dict is
   then incomplete, causing a secondary KeyError crash with no useful
   context. Removed the try/except to let errors propagate naturally,
   matching the pattern used by the sister operator ImageFaceRatioFilter.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves two critical bugs affecting face detection and video processing operators. The primary goal was to ensure that operator instances maintain their intended default configurations independently and that error conditions are properly propagated, preventing silent data corruption and obscure error messages. These changes enhance the robustness and predictability of the data processing pipeline.

Highlights

  • Prevented shared mutable default keyword arguments pollution: Addressed a bug where five operators were assigning a reference to a class-level mutable dictionary (_default_kwargs) instead of a copy. This led to unintended mutation of defaults across different instances of the same operator class.
  • Fixed swallowed exception in ImageFaceCountFilter: Removed a try/except block in ImageFaceCountFilter.compute_stats_single that was catching and logging exceptions during face detection but continuing execution, leading to misleading KeyError exceptions later on.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • data_juicer/ops/filter/image_face_count_filter.py
    • Changed self.extra_kwargs = self._default_kwargs to self.extra_kwargs = self._default_kwargs.copy() to prevent mutable default pollution.
    • Removed a try/except block around detect_faces() to allow exceptions to propagate naturally.
  • data_juicer/ops/filter/image_face_ratio_filter.py
    • Changed self.extra_kwargs = self._default_kwargs to self.extra_kwargs = self._default_kwargs.copy() to prevent mutable default pollution.
  • data_juicer/ops/filter/video_motion_score_filter.py
    • Changed self.extra_kwargs = self._default_kwargs to self.extra_kwargs = self._default_kwargs.copy() to prevent mutable default pollution.
  • data_juicer/ops/mapper/image_face_blur_mapper.py
    • Changed self.extra_kwargs = self._default_kwargs to self.extra_kwargs = self._default_kwargs.copy() to prevent mutable default pollution.
  • data_juicer/ops/mapper/video_face_blur_mapper.py
    • Changed self.extra_kwargs = self._default_kwargs to self.extra_kwargs = self._default_kwargs.copy() to prevent mutable default pollution.
Activity
  • The author has confirmed that tests/ops/filter/test_image_face_count_filter.py passed all 5 tests.
  • The author has confirmed that tests/ops/filter/test_image_face_ratio_filter.py passed all 6 tests.
  • The author has confirmed that tests/ops/mapper/test_image_face_blur_mapper.py passed all 6 tests.
  • The author has confirmed that tests/ops/filter/test_video_motion_score_filter.py passed all 12 tests.
  • Reproduction scripts were used to confirm that class-level dictionaries are no longer mutated after the .copy() fix.
  • Reproduction scripts were used to confirm that detection errors now propagate with their original context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses two important bugs: the pollution of shared mutable _default_kwargs across operator instances and a swallowed exception that led to misleading errors. The fixes are sound and improve the robustness of the operators. I've added a few suggestions to refactor the keyword argument handling in a more concise and Pythonic way, which applies to all five modified files.

Comment on lines 70 to 73
self.extra_kwargs = self._default_kwargs.copy()
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.

        self.extra_kwargs = self._default_kwargs.copy()
        self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion — updated all 5 operators to use dict.update() with a generator expression. Thanks!

Comment on lines 70 to 73
self.extra_kwargs = self._default_kwargs.copy()
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.

        self.extra_kwargs = self._default_kwargs.copy()
        self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)

Comment on lines 118 to 121
self.extra_kwargs = self._default_kwargs.copy()
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.

        self.extra_kwargs = self._default_kwargs.copy()
        self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)

Comment on lines 86 to 89
self.extra_kwargs = self._default_kwargs.copy()
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.

        self.extra_kwargs = self._default_kwargs.copy()
        self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)

Comment on lines 92 to 95
self.extra_kwargs = self._default_kwargs.copy()
for key in kwargs:
if key in self.extra_kwargs:
self.extra_kwargs[key] = kwargs[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using .copy() is the correct fix for the mutable default argument issue, the subsequent loop for updating kwargs can be made more concise and Pythonic. You can use a generator expression with dict.update for a cleaner implementation that achieves the same result.

        self.extra_kwargs = self._default_kwargs.copy()
        self.extra_kwargs.update((k, v) for k, v in kwargs.items() if k in self.extra_kwargs)

Replace explicit for-loop with dict.update() generator expression
for more concise and Pythonic kwargs filtering in all 5 operators.

Signed-off-by: dubin555 <dubin555@gmail.com>
Collapse multi-line dict.update() calls to single line to satisfy
the project's pre-commit formatter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant