Skip to content

perf: cache redundant sum() calls in repetition filters#924

Open
dubin555 wants to merge 1 commit intodatajuicer:mainfrom
dubin555:perf/filter-redundant-sum-computation
Open

perf: cache redundant sum() calls in repetition filters#924
dubin555 wants to merge 1 commit intodatajuicer:mainfrom
dubin555:perf/filter-redundant-sum-computation

Conversation

@dubin555
Copy link
Contributor

Summary

Eliminate redundant sum() computations in CharacterRepetitionFilter and WordRepetitionFilter.

Problem

character_repetition_filter.py (line 71-74)

# BEFORE: sum(freq_char_ngrams) computed twice — O(n) each time
samples_stats[idx][StatsKeys.char_rep_ratio] = (
    (sum(freq_char_ngrams[:num_rep_char_ngrams]) / sum(freq_char_ngrams))
    if sum(freq_char_ngrams) != 0
    else 0.0
)

Additional issues on line 65-66:

  • sorted(list(values))list() is redundant since sorted() accepts any iterable
  • len([el for el in freq_char_ngrams if el == 1]) — creates a temporary list just to count; list.count(1) is more efficient

word_repetition_filter.py (line 106-107)

# BEFORE: sum(freq_word_ngrams) computed twice — O(n) each time
samples_stats[idx][StatsKeys.word_rep_ratio] = (
    (sum(rep_more_than_one) / sum(freq_word_ngrams)) if sum(freq_word_ngrams) != 0 else 0.0
)

Fix

Cache the sum() result into a local variable total:

total = sum(freq_char_ngrams)
... / total if total != 0 else 0.0

Test plan

  • Verify syntax with python -m py_compile
  • Run pre-commit checks

🤖 Generated with Claude Code

character_repetition_filter.py:
- Cache sum(freq_char_ngrams) into local variable `total` to avoid
  computing O(n) sum twice (once for zero-check, once for division)
- Remove redundant list() in sorted() call
- Use list.count(1) instead of len([el for el if el == 1])

word_repetition_filter.py:
- Cache sum(freq_word_ngrams) into local variable `total` to avoid
  computing O(n) sum twice

Co-Authored-By: Claude <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @dubin555, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the performance of repetition filters by optimizing how frequency sums are calculated. It addresses redundant sum() calls in both character and word repetition filters, caching these results to improve efficiency. Additionally, it refines list processing in the character repetition filter for further optimization.

Highlights

  • Performance Optimization in Character Repetition Filter: Eliminated redundant sum() calls and optimized list processing for character n-gram frequencies, improving efficiency.
  • Performance Optimization in Word Repetition Filter: Cached the sum() of word n-gram frequencies to prevent redundant computations, enhancing performance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • data_juicer/ops/filter/character_repetition_filter.py
    • Cached the result of sum(freq_char_ngrams) into a local variable total to avoid re-computation.
    • Removed the redundant list() constructor when sorting freq_char_ngrams.values().
    • Replaced a list comprehension for counting elements equal to 1 with the more efficient list.count(1) method.
  • data_juicer/ops/filter/word_repetition_filter.py
    • Cached the result of sum(freq_word_ngrams) into a local variable total to avoid re-computation.
Activity
  • The author dubin555 created this pull request.
  • The author verified syntax with python -m py_compile.
  • The author ran pre-commit checks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively improves performance by eliminating redundant sum() calls in CharacterRepetitionFilter and WordRepetitionFilter, caching the result in a local variable. The changes also include other good optimizations, such as removing a redundant list() conversion and using the more efficient list.count() method. I've added one suggestion in word_repetition_filter.py for a further minor performance gain by using a generator expression, which aligns with the goal of this PR. Overall, these are solid improvements to the codebase.

@yxdyc yxdyc requested a review from HYLcool February 27, 2026 04:31
@dubin555
Copy link
Contributor Author

dubin555 commented Mar 2, 2026

Thanks for the review and approval! Agreed — I'll leave the code as-is and skip the generator expression change. Readability matters more here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants