feat(mapper): add custom tokenizer support to RemoveRepeatSentencesMapper#925
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a tokenizer parameter to RemoveRepeatSentencesMapper, enabling the use of custom sentence tokenizers. This is a valuable enhancement for correctly handling text with decimal numbers or abbreviations. The implementation is well-structured, backward-compatible, and includes corresponding unit tests. However, there is a significant security concern regarding the use of eval() on user-provided lambda strings from configuration, which could allow for arbitrary code execution. I have provided a detailed comment on this issue.
| def _create_tokenizer(tokenizer_str: str): | ||
| """Parse and validate a tokenizer lambda string.""" | ||
| try: | ||
| node = ast.parse(tokenizer_str, mode="eval") | ||
| if not isinstance(node.body, ast.Lambda): | ||
| raise ValueError("Input string must be a valid lambda function.") | ||
| if len(node.body.args.args) != 1: | ||
| raise ValueError("Lambda function must have exactly one argument.") | ||
| compiled_code = compile(node, "<string>", "eval") | ||
| return eval(compiled_code, {"__builtins__": __builtins__}) | ||
| except Exception as e: | ||
| raise ValueError(f"Invalid tokenizer lambda: {e}") |
There was a problem hiding this comment.
The use of eval with __builtins__ on a user-provided string from a configuration file introduces a potential security vulnerability. If the configuration can be controlled by an untrusted user, this could allow for arbitrary code execution. The example in the tests (__import__('nltk')) demonstrates that this is intended to be powerful, but it also opens the door for malicious code.
For example, a malicious user could provide a lambda like "lambda text: __import__('os').system('rm -rf /')".
Consider restricting the environment in which the lambda is evaluated. If full flexibility is required, this security risk should be very clearly documented so that users understand that the configuration files must be from a trusted source.
There was a problem hiding this comment.
This follows the existing pattern used by PythonLambdaMapper._create_lambda, which uses the same ast.parse → compile → eval(compiled_code, {"__builtins__": __builtins__}) approach.
data-juicer/data_juicer/ops/mapper/python_lambda_mapper.py
Lines 38 to 57 in fa4d7b2
The trust model is the same: YAML configs are authored by the user running the pipeline.
79575cf to
79a7d76
Compare
79a7d76 to
fe52c38
Compare
fe52c38 to
3b47d82
Compare
3b47d82 to
1497066
Compare
1497066 to
6226ce5
Compare
…pper
The built-in regex sentence splitter treats every period followed by a
non-quote character as a sentence boundary, which incorrectly splits
text containing decimal numbers (e.g. "2.5 kg"), abbreviations, and
version numbers. When these fragments are independently deduplicated,
the resulting text is corrupted.
Add a `tokenizer` parameter that accepts a custom sentence tokenizer
to override the default regex splitter. The tokenizer can be:
- A Python callable (for API usage), e.g. `nltk.sent_tokenize`
- A lambda string (for YAML configs), e.g.
`"lambda text: __import__('nltk').sent_tokenize(text)"`
- None (default) to preserve existing behavior
Lambda strings are validated using `ast.parse`, following the same
pattern as `PythonLambdaMapper`.
Made-with: Cursor
6226ce5 to
58cc253
Compare
Summary
Add a
tokenizerparameter toRemoveRepeatSentencesMapperthat allows users to override the built-in regex-based sentence splitter with a custom tokenizer.Motivation: The built-in
split_sentenceregex treats every.followed by a non-quote character as a sentence boundary. This incorrectly splits text containing decimal numbers (e.g."2.5 kg"), abbreviations, or version numbers. When these fragments are independently deduplicated, the resulting text is corrupted. A custom tokenizer (e.g. NLTK'ssent_tokenize) handles these cases correctly.The
tokenizerparameter accepts:(str) -> list[str](for API usage), e.g.nltk.sent_tokenize"lambda text: __import__('nltk').sent_tokenize(text)"None(default) to preserve existing behaviorLambda strings are validated using
ast.parse, following the same pattern asPythonLambdaMapper.Usage
Via Python API:
Via YAML config:
Changes
data_juicer/ops/mapper/remove_repeat_sentences_mapper.pytokenizerparameter to__init__(defaultNone, fully backward-compatible)_create_tokenizerstatic method for parsing/validating lambda strings (sameast.parsepattern asPythonLambdaMapper)_wrap_tokenizerhelper (see design note below)process_batched, replacesplit_sentence(line)withself._tokenize(line)(1-line change — the rest ofprocess_batchedis untouched)tests/ops/mapper/test_remove_repeat_sentences_mapper.pytest_custom_tokenizer_callable: verifies dedup works with a callable tokenizertest_custom_tokenizer_lambda_str: verifies lambda strings from YAML configs worktest_custom_tokenizer_preserves_decimals: verifies text with decimal numbers (e.g."2.5 kg") is not corruptedAll existing tests pass unchanged.
Design note:
_wrap_tokenizerThe existing
split_sentencefunction works by inserting\ninto the text and splitting on it, which means its output fragments preserve the original inter-sentence whitespace as a leading space (e.g.["Hello.", " Goodbye."]). The downstream dedup loop relies on this: it concatenates kept sentences withnew_sent += sentence, and the leading space provides correct spacing.Custom tokenizers like
nltk.sent_tokenizereturn clean tokens without leading whitespace (e.g.["Hello.", "Goodbye."]). Without adjustment,new_sent += sentencewould produce"Hello.Goodbye.".Rather than restructuring
process_batched(which would make the diff harder to review),_wrap_tokenizernormalizes custom tokenizer output to matchsplit_sentence's convention by prepending a space to every sentence after the first. This keepsprocess_batchedas a 1-line diff.