feat(mapper): add custom tokenizer support to RemoveRepeatSentencesMapper by JohnGiorgi · Pull Request #925 · datajuicer/data-juicer

JohnGiorgi · 2026-02-27T15:46:41Z

Summary

Add a tokenizer parameter to RemoveRepeatSentencesMapper that allows users to override the built-in regex-based sentence splitter with a custom tokenizer.

Motivation: The built-in split_sentence regex treats every . followed by a non-quote character as a sentence boundary. This incorrectly splits text containing decimal numbers (e.g. "2.5 kg"), abbreviations, or version numbers. When these fragments are independently deduplicated, the resulting text is corrupted. A custom tokenizer (e.g. NLTK's sent_tokenize) handles these cases correctly.

The tokenizer parameter accepts:

A Python callable (str) -> list[str] (for API usage), e.g. nltk.sent_tokenize
A lambda string (for YAML configs), e.g. "lambda text: __import__('nltk').sent_tokenize(text)"
None (default) to preserve existing behavior

Lambda strings are validated using ast.parse, following the same pattern as PythonLambdaMapper.

Usage

Via Python API:

from nltk.tokenize import sent_tokenize

op = RemoveRepeatSentencesMapper(tokenizer=sent_tokenize)

Via YAML config:

- remove_repeat_sentences_mapper:
    tokenizer: "lambda text: __import__('nltk').sent_tokenize(text)"

Changes

data_juicer/ops/mapper/remove_repeat_sentences_mapper.py
- Add tokenizer parameter to __init__ (default None, fully backward-compatible)
- Add _create_tokenizer static method for parsing/validating lambda strings (same ast.parse pattern as PythonLambdaMapper)
- Add _wrap_tokenizer helper (see design note below)
- In process_batched, replace split_sentence(line) with self._tokenize(line) (1-line change — the rest of process_batched is untouched)
tests/ops/mapper/test_remove_repeat_sentences_mapper.py
- test_custom_tokenizer_callable: verifies dedup works with a callable tokenizer
- test_custom_tokenizer_lambda_str: verifies lambda strings from YAML configs work
- test_custom_tokenizer_preserves_decimals: verifies text with decimal numbers (e.g. "2.5 kg") is not corrupted

All existing tests pass unchanged.

Design note: `_wrap_tokenizer`

The existing split_sentence function works by inserting \n into the text and splitting on it, which means its output fragments preserve the original inter-sentence whitespace as a leading space (e.g. ["Hello.", " Goodbye."]). The downstream dedup loop relies on this: it concatenates kept sentences with new_sent += sentence, and the leading space provides correct spacing.

Custom tokenizers like nltk.sent_tokenize return clean tokens without leading whitespace (e.g. ["Hello.", "Goodbye."]). Without adjustment, new_sent += sentence would produce "Hello.Goodbye.".

Rather than restructuring process_batched (which would make the diff harder to review), _wrap_tokenizer normalizes custom tokenizer output to match split_sentence's convention by prepending a space to every sentence after the first. This keeps process_batched as a 1-line diff.

gemini-code-assist · 2026-02-27T15:46:58Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the RemoveRepeatSentencesMapper by introducing support for custom sentence tokenizers. This change improves the accuracy of sentence splitting, particularly for texts containing numbers or abbreviations, and offers greater flexibility for users to define their own tokenization logic. The implementation ensures backward compatibility and proper handling of whitespace for seamless integration.

Highlights

Custom Tokenizer Support: Introduced a tokenizer parameter to RemoveRepeatSentencesMapper allowing users to provide custom sentence splitting logic, addressing limitations of the default regex-based splitter.
Flexible Tokenizer Input: The tokenizer parameter now accepts Python callables, lambda strings (for YAML configurations), or None to retain the original behavior.
Improved Sentence Splitting Accuracy: Custom tokenizers can correctly handle cases like decimal numbers, abbreviations, and version numbers, which were previously mis-split by the default regex, preventing text corruption.
Whitespace Preservation: A new _wrap_tokenizer helper function was added to ensure custom tokenizer outputs maintain consistent whitespace formatting with the original split_sentence function, simplifying integration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

data_juicer/ops/mapper/remove_repeat_sentences_mapper.py
- Imported the ast module for parsing lambda strings.
- Added a new _wrap_tokenizer function to adjust custom tokenizer output for consistent whitespace.
- Introduced a tokenizer parameter to the __init__ method.
- Implemented a static method _create_tokenizer to parse and validate tokenizer lambda strings.
- Modified the process_batched method to use the new _tokenize attribute, which dynamically selects the appropriate tokenizer.
tests/ops/mapper/test_remove_repeat_sentences_mapper.py
- Added test_custom_tokenizer_callable to verify functionality with callable tokenizers.
- Added test_custom_tokenizer_lambda_str to test lambda string tokenizers from YAML.
- Added test_custom_tokenizer_preserves_decimals to confirm correct handling of decimal numbers.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a tokenizer parameter to RemoveRepeatSentencesMapper, enabling the use of custom sentence tokenizers. This is a valuable enhancement for correctly handling text with decimal numbers or abbreviations. The implementation is well-structured, backward-compatible, and includes corresponding unit tests. However, there is a significant security concern regarding the use of eval() on user-provided lambda strings from configuration, which could allow for arbitrary code execution. I have provided a detailed comment on this issue.

gemini-code-assist · 2026-02-27T15:47:57Z

data_juicer/ops/mapper/remove_repeat_sentences_mapper.py

+    def _create_tokenizer(tokenizer_str: str):
+        """Parse and validate a tokenizer lambda string."""
+        try:
+            node = ast.parse(tokenizer_str, mode="eval")
+            if not isinstance(node.body, ast.Lambda):
+                raise ValueError("Input string must be a valid lambda function.")
+            if len(node.body.args.args) != 1:
+                raise ValueError("Lambda function must have exactly one argument.")
+            compiled_code = compile(node, "<string>", "eval")
+            return eval(compiled_code, {"__builtins__": __builtins__})
+        except Exception as e:
+            raise ValueError(f"Invalid tokenizer lambda: {e}")


The use of eval with __builtins__ on a user-provided string from a configuration file introduces a potential security vulnerability. If the configuration can be controlled by an untrusted user, this could allow for arbitrary code execution. The example in the tests (__import__('nltk')) demonstrates that this is intended to be powerful, but it also opens the door for malicious code.

For example, a malicious user could provide a lambda like "lambda text: __import__('os').system('rm -rf /')".

Consider restricting the environment in which the lambda is evaluated. If full flexibility is required, this security risk should be very clearly documented so that users understand that the configuration files must be from a trusted source.

This follows the existing pattern used by PythonLambdaMapper._create_lambda, which uses the same ast.parse → compile → eval(compiled_code, {"__builtins__": __builtins__}) approach.

data-juicer/data_juicer/ops/mapper/python_lambda_mapper.py

Lines 38 to 57 in fa4d7b2

def _create_lambda(self, lambda_str: str):

# Parse input string into an AST and check for a valid lambda function

try:

node = ast.parse(lambda_str, mode="eval")

# Check if the body of the expression is a lambda

if not isinstance(node.body, ast.Lambda):

raise ValueError("Input string must be a valid lambda function.")

# Check that the lambda has exactly one argument

if len(node.body.args.args) != 1:

raise ValueError("Lambda function must have exactly one argument.")

# Compile the AST to code

compiled_code = compile(node, "<string>", "eval")

# Safely evaluate the compiled code allowing built-in functions

func = eval(compiled_code, {"__builtins__": __builtins__})

return func

except Exception as e:

raise ValueError(f"Invalid lambda function: {e}")

The trust model is the same: YAML configs are authored by the user running the pipeline.

…pper The built-in regex sentence splitter treats every period followed by a non-quote character as a sentence boundary, which incorrectly splits text containing decimal numbers (e.g. "2.5 kg"), abbreviations, and version numbers. When these fragments are independently deduplicated, the resulting text is corrupted. Add a `tokenizer` parameter that accepts a custom sentence tokenizer to override the default regex splitter. The tokenizer can be: - A Python callable (for API usage), e.g. `nltk.sent_tokenize` - A lambda string (for YAML configs), e.g. `"lambda text: __import__('nltk').sent_tokenize(text)"` - None (default) to preserve existing behavior Lambda strings are validated using `ast.parse`, following the same pattern as `PythonLambdaMapper`. Made-with: Cursor

JohnGiorgi requested a deployment to Testing February 27, 2026 15:46 — with GitHub Actions Waiting

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 79575cf to 79a7d76 Compare February 27, 2026 15:48

JohnGiorgi requested a deployment to Testing February 27, 2026 15:48 — with GitHub Actions Waiting

JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 79a7d76 to fe52c38 Compare February 27, 2026 15:50

JohnGiorgi requested a deployment to Testing February 27, 2026 15:50 — with GitHub Actions Waiting

JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from fe52c38 to 3b47d82 Compare February 27, 2026 15:51

JohnGiorgi requested a deployment to Testing February 27, 2026 15:51 — with GitHub Actions Waiting

JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 3b47d82 to 1497066 Compare February 27, 2026 15:53

JohnGiorgi requested a deployment to Testing February 27, 2026 15:53 — with GitHub Actions Waiting

JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 1497066 to 6226ce5 Compare February 27, 2026 15:54

JohnGiorgi requested a deployment to Testing February 27, 2026 15:54 — with GitHub Actions Waiting

JohnGiorgi force-pushed the jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper branch from 6226ce5 to 58cc253 Compare February 28, 2026 13:28

JohnGiorgi had a problem deploying to Testing February 28, 2026 13:28 — with GitHub Actions Failure

JohnGiorgi temporarily deployed to Testing February 28, 2026 13:28 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mapper): add custom tokenizer support to RemoveRepeatSentencesMapper#925

feat(mapper): add custom tokenizer support to RemoveRepeatSentencesMapper#925
JohnGiorgi wants to merge 1 commit intodatajuicer:mainfrom
JohnGiorgi:jg/add-custom-tokenizer-to-remove-repeat-sentences-mapper

JohnGiorgi commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

JohnGiorgi Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def _create_lambda(self, lambda_str: str):
	# Parse input string into an AST and check for a valid lambda function
	try:
	node = ast.parse(lambda_str, mode="eval")

	# Check if the body of the expression is a lambda
	if not isinstance(node.body, ast.Lambda):
	raise ValueError("Input string must be a valid lambda function.")

	# Check that the lambda has exactly one argument
	if len(node.body.args.args) != 1:
	raise ValueError("Lambda function must have exactly one argument.")

	# Compile the AST to code
	compiled_code = compile(node, "<string>", "eval")
	# Safely evaluate the compiled code allowing built-in functions
	func = eval(compiled_code, {"__builtins__": __builtins__})
	return func
	except Exception as e:
	raise ValueError(f"Invalid lambda function: {e}")

Conversation

JohnGiorgi commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Changes

Design note: _wrap_tokenizer

Uh oh!

gemini-code-assist bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

JohnGiorgi Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JohnGiorgi commented Feb 27, 2026 •

edited

Loading

Design note: `_wrap_tokenizer`

JohnGiorgi Feb 27, 2026 •

edited

Loading