Skip to content

feat(mapper): add custom tokenizer support to RemoveRepeatSentencesMapper#925

Open
JohnGiorgi wants to merge 1 commit intodatajuicer:mainfrom
JohnGiorgi:jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper
Open

feat(mapper): add custom tokenizer support to RemoveRepeatSentencesMapper#925
JohnGiorgi wants to merge 1 commit intodatajuicer:mainfrom
JohnGiorgi:jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper

Conversation

@JohnGiorgi
Copy link
Contributor

@JohnGiorgi JohnGiorgi commented Feb 27, 2026

Summary

Add a tokenizer parameter to RemoveRepeatSentencesMapper that allows users to override the built-in regex-based sentence splitter with a custom tokenizer.

Motivation: The built-in split_sentence regex treats every . followed by a non-quote character as a sentence boundary. This incorrectly splits text containing decimal numbers (e.g. "2.5 kg"), abbreviations, or version numbers. When these fragments are independently deduplicated, the resulting text is corrupted. A custom tokenizer (e.g. NLTK's sent_tokenize) handles these cases correctly.

The tokenizer parameter accepts:

  • A Python callable (str) -> list[str] (for API usage), e.g. nltk.sent_tokenize
  • A lambda string (for YAML configs), e.g. "lambda text: __import__('nltk').sent_tokenize(text)"
  • None (default) to preserve existing behavior

Lambda strings are validated using ast.parse, following the same pattern as PythonLambdaMapper.

Usage

Via Python API:

from nltk.tokenize import sent_tokenize

op = RemoveRepeatSentencesMapper(tokenizer=sent_tokenize)

Via YAML config:

- remove_repeat_sentences_mapper:
    tokenizer: "lambda text: __import__('nltk').sent_tokenize(text)"

Changes

  • data_juicer/ops/mapper/remove_repeat_sentences_mapper.py

    • Add tokenizer parameter to __init__ (default None, fully backward-compatible)
    • Add _create_tokenizer static method for parsing/validating lambda strings (same ast.parse pattern as PythonLambdaMapper)
    • Add _wrap_tokenizer helper (see design note below)
    • In process_batched, replace split_sentence(line) with self._tokenize(line) (1-line change — the rest of process_batched is untouched)
  • tests/ops/mapper/test_remove_repeat_sentences_mapper.py

    • test_custom_tokenizer_callable: verifies dedup works with a callable tokenizer
    • test_custom_tokenizer_lambda_str: verifies lambda strings from YAML configs work
    • test_custom_tokenizer_preserves_decimals: verifies text with decimal numbers (e.g. "2.5 kg") is not corrupted

All existing tests pass unchanged.

Design note: _wrap_tokenizer

The existing split_sentence function works by inserting \n into the text and splitting on it, which means its output fragments preserve the original inter-sentence whitespace as a leading space (e.g. ["Hello.", " Goodbye."]). The downstream dedup loop relies on this: it concatenates kept sentences with new_sent += sentence, and the leading space provides correct spacing.

Custom tokenizers like nltk.sent_tokenize return clean tokens without leading whitespace (e.g. ["Hello.", "Goodbye."]). Without adjustment, new_sent += sentence would produce "Hello.Goodbye.".

Rather than restructuring process_batched (which would make the diff harder to review), _wrap_tokenizer normalizes custom tokenizer output to match split_sentence's convention by prepending a space to every sentence after the first. This keeps process_batched as a 1-line diff.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the RemoveRepeatSentencesMapper by introducing support for custom sentence tokenizers. This change improves the accuracy of sentence splitting, particularly for texts containing numbers or abbreviations, and offers greater flexibility for users to define their own tokenization logic. The implementation ensures backward compatibility and proper handling of whitespace for seamless integration.

Highlights

  • Custom Tokenizer Support: Introduced a tokenizer parameter to RemoveRepeatSentencesMapper allowing users to provide custom sentence splitting logic, addressing limitations of the default regex-based splitter.
  • Flexible Tokenizer Input: The tokenizer parameter now accepts Python callables, lambda strings (for YAML configurations), or None to retain the original behavior.
  • Improved Sentence Splitting Accuracy: Custom tokenizers can correctly handle cases like decimal numbers, abbreviations, and version numbers, which were previously mis-split by the default regex, preventing text corruption.
  • Whitespace Preservation: A new _wrap_tokenizer helper function was added to ensure custom tokenizer outputs maintain consistent whitespace formatting with the original split_sentence function, simplifying integration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • data_juicer/ops/mapper/remove_repeat_sentences_mapper.py
    • Imported the ast module for parsing lambda strings.
    • Added a new _wrap_tokenizer function to adjust custom tokenizer output for consistent whitespace.
    • Introduced a tokenizer parameter to the __init__ method.
    • Implemented a static method _create_tokenizer to parse and validate tokenizer lambda strings.
    • Modified the process_batched method to use the new _tokenize attribute, which dynamically selects the appropriate tokenizer.
  • tests/ops/mapper/test_remove_repeat_sentences_mapper.py
    • Added test_custom_tokenizer_callable to verify functionality with callable tokenizers.
    • Added test_custom_tokenizer_lambda_str to test lambda string tokenizers from YAML.
    • Added test_custom_tokenizer_preserves_decimals to confirm correct handling of decimal numbers.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tokenizer parameter to RemoveRepeatSentencesMapper, enabling the use of custom sentence tokenizers. This is a valuable enhancement for correctly handling text with decimal numbers or abbreviations. The implementation is well-structured, backward-compatible, and includes corresponding unit tests. However, there is a significant security concern regarding the use of eval() on user-provided lambda strings from configuration, which could allow for arbitrary code execution. I have provided a detailed comment on this issue.

Comment on lines +87 to +98
def _create_tokenizer(tokenizer_str: str):
"""Parse and validate a tokenizer lambda string."""
try:
node = ast.parse(tokenizer_str, mode="eval")
if not isinstance(node.body, ast.Lambda):
raise ValueError("Input string must be a valid lambda function.")
if len(node.body.args.args) != 1:
raise ValueError("Lambda function must have exactly one argument.")
compiled_code = compile(node, "<string>", "eval")
return eval(compiled_code, {"__builtins__": __builtins__})
except Exception as e:
raise ValueError(f"Invalid tokenizer lambda: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The use of eval with __builtins__ on a user-provided string from a configuration file introduces a potential security vulnerability. If the configuration can be controlled by an untrusted user, this could allow for arbitrary code execution. The example in the tests (__import__('nltk')) demonstrates that this is intended to be powerful, but it also opens the door for malicious code.

For example, a malicious user could provide a lambda like "lambda text: __import__('os').system('rm -rf /')".

Consider restricting the environment in which the lambda is evaluated. If full flexibility is required, this security risk should be very clearly documented so that users understand that the configuration files must be from a trusted source.

Copy link
Contributor Author

@JohnGiorgi JohnGiorgi Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the existing pattern used by PythonLambdaMapper._create_lambda, which uses the same ast.parsecompileeval(compiled_code, {"__builtins__": __builtins__}) approach.

def _create_lambda(self, lambda_str: str):
# Parse input string into an AST and check for a valid lambda function
try:
node = ast.parse(lambda_str, mode="eval")
# Check if the body of the expression is a lambda
if not isinstance(node.body, ast.Lambda):
raise ValueError("Input string must be a valid lambda function.")
# Check that the lambda has exactly one argument
if len(node.body.args.args) != 1:
raise ValueError("Lambda function must have exactly one argument.")
# Compile the AST to code
compiled_code = compile(node, "<string>", "eval")
# Safely evaluate the compiled code allowing built-in functions
func = eval(compiled_code, {"__builtins__": __builtins__})
return func
except Exception as e:
raise ValueError(f"Invalid lambda function: {e}")

The trust model is the same: YAML configs are authored by the user running the pipeline.

@JohnGiorgi JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 79575cf to 79a7d76 Compare February 27, 2026 15:48
@JohnGiorgi JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 79a7d76 to fe52c38 Compare February 27, 2026 15:50
@JohnGiorgi JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from fe52c38 to 3b47d82 Compare February 27, 2026 15:51
@JohnGiorgi JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 3b47d82 to 1497066 Compare February 27, 2026 15:53
@JohnGiorgi JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 1497066 to 6226ce5 Compare February 27, 2026 15:54
…pper

The built-in regex sentence splitter treats every period followed by a
non-quote character as a sentence boundary, which incorrectly splits
text containing decimal numbers (e.g. "2.5 kg"), abbreviations, and
version numbers. When these fragments are independently deduplicated,
the resulting text is corrupted.

Add a `tokenizer` parameter that accepts a custom sentence tokenizer
to override the default regex splitter. The tokenizer can be:

- A Python callable (for API usage), e.g. `nltk.sent_tokenize`
- A lambda string (for YAML configs), e.g.
  `"lambda text: __import__('nltk').sent_tokenize(text)"`
- None (default) to preserve existing behavior

Lambda strings are validated using `ast.parse`, following the same
pattern as `PythonLambdaMapper`.

Made-with: Cursor
@JohnGiorgi JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 6226ce5 to 58cc253 Compare February 28, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant