perf: cache redundant sum() calls in repetition filters by dubin555 · Pull Request #924 · datajuicer/data-juicer

dubin555 · 2026-02-26T14:45:10Z

Summary

Eliminate redundant sum() computations in CharacterRepetitionFilter and WordRepetitionFilter.

Problem

character_repetition_filter.py (line 71-74)

# BEFORE: sum(freq_char_ngrams) computed twice — O(n) each time
samples_stats[idx][StatsKeys.char_rep_ratio] = (
    (sum(freq_char_ngrams[:num_rep_char_ngrams]) / sum(freq_char_ngrams))
    if sum(freq_char_ngrams) != 0
    else 0.0
)

Additional issues on line 65-66:

sorted(list(values)) — list() is redundant since sorted() accepts any iterable
len([el for el in freq_char_ngrams if el == 1]) — creates a temporary list just to count; list.count(1) is more efficient

word_repetition_filter.py (line 106-107)

# BEFORE: sum(freq_word_ngrams) computed twice — O(n) each time
samples_stats[idx][StatsKeys.word_rep_ratio] = (
    (sum(rep_more_than_one) / sum(freq_word_ngrams)) if sum(freq_word_ngrams) != 0 else 0.0
)

Fix

Cache the sum() result into a local variable total:

total = sum(freq_char_ngrams)
... / total if total != 0 else 0.0

Test plan

Verify syntax with python -m py_compile
Run pre-commit checks

🤖 Generated with Claude Code

character_repetition_filter.py: - Cache sum(freq_char_ngrams) into local variable `total` to avoid computing O(n) sum twice (once for zero-check, once for division) - Remove redundant list() in sorted() call - Use list.count(1) instead of len([el for el if el == 1]) word_repetition_filter.py: - Cache sum(freq_word_ngrams) into local variable `total` to avoid computing O(n) sum twice Co-Authored-By: Claude <noreply@anthropic.com>

gemini-code-assist · 2026-02-26T14:45:25Z

Summary of Changes

Hello @dubin555, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the performance of repetition filters by optimizing how frequency sums are calculated. It addresses redundant sum() calls in both character and word repetition filters, caching these results to improve efficiency. Additionally, it refines list processing in the character repetition filter for further optimization.

Highlights

Performance Optimization in Character Repetition Filter: Eliminated redundant sum() calls and optimized list processing for character n-gram frequencies, improving efficiency.
Performance Optimization in Word Repetition Filter: Cached the sum() of word n-gram frequencies to prevent redundant computations, enhancing performance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

data_juicer/ops/filter/character_repetition_filter.py
- Cached the result of sum(freq_char_ngrams) into a local variable total to avoid re-computation.
- Removed the redundant list() constructor when sorting freq_char_ngrams.values().
- Replaced a list comprehension for counting elements equal to 1 with the more efficient list.count(1) method.
data_juicer/ops/filter/word_repetition_filter.py
- Cached the result of sum(freq_word_ngrams) into a local variable total to avoid re-computation.

Activity

The author dubin555 created this pull request.
The author verified syntax with python -m py_compile.
The author ran pre-commit checks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively improves performance by eliminating redundant sum() calls in CharacterRepetitionFilter and WordRepetitionFilter, caching the result in a local variable. The changes also include other good optimizations, such as removing a redundant list() conversion and using the more efficient list.count() method. I've added one suggestion in word_repetition_filter.py for a further minor performance gain by using a generator expression, which aligns with the goal of this PR. Overall, these are solid improvements to the codebase.

data_juicer/ops/filter/word_repetition_filter.py

dubin555 · 2026-03-02T15:10:24Z

Thanks for the review and approval! Agreed — I'll leave the code as-is and skip the generator expression change. Readability matters more here.

dubin555 temporarily deployed to Testing February 26, 2026 14:45 — with GitHub Actions Inactive

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

data_juicer/ops/filter/word_repetition_filter.py Show resolved Hide resolved

yxdyc requested a review from HYLcool February 27, 2026 04:31

yxdyc approved these changes Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache redundant sum() calls in repetition filters#924

perf: cache redundant sum() calls in repetition filters#924
dubin555 wants to merge 1 commit intodatajuicer:mainfrom
dubin555:perf/filter-redundant-sum-computation

dubin555 commented Feb 26, 2026

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

dubin555 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dubin555 commented Feb 26, 2026

Summary

Problem

character_repetition_filter.py (line 71-74)

word_repetition_filter.py (line 106-107)

Fix

Test plan

Uh oh!

gemini-code-assist bot commented Feb 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

dubin555 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants