perf: optimize TokenNumFilter with batch tokenization#929
perf: optimize TokenNumFilter with batch tokenization#929JohnGiorgi wants to merge 1 commit intodatajuicer:mainfrom
Conversation
Switch TokenNumFilter from per-sample tokenizer.tokenize() to batched tokenizer(texts, add_special_tokens=False), and enable _batched_op for framework-level batching. This yields ~1.15x throughput improvement on 100K samples (1,350 → 1,555 samples/sec) by: - Amortizing get_model() and Python call overhead across batches of 1000 - Returning int IDs directly instead of converting to string tokens - Enabling the framework's batched map path via _batched_op = True Token counts are identical between old and new approaches, verified on 100K synthetic samples. add_special_tokens=False matches the behavior of the previous tokenizer.tokenize() call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively optimizes TokenNumFilter by introducing batched tokenization, which is a significant performance improvement. The implementation is clean and follows existing patterns in the codebase. I have one suggestion to improve the robustness of the new batched implementation by handling non-string inputs, which could be a regression from the previous version.
| for idx, stat in enumerate(samples_stats): | ||
| if StatsKeys.num_token not in stat: | ||
| indices.append(idx) | ||
| texts.append(samples_list[idx]) |
There was a problem hiding this comment.
The previous implementation used get_words_from_document, which likely handled non-string or None values in the text field gracefully (e.g., by treating them as empty documents, resulting in a token count of 0). This new batched implementation directly appends samples_list[idx] to the texts list. If samples_list contains any non-string values, the tokenizer call will raise an error. To improve robustness and prevent potential regressions, it's important to handle non-string inputs. A safe way to do this is to assign a token count of 0 to non-string samples, mimicking the likely behavior of the previous implementation.
| for idx, stat in enumerate(samples_stats): | |
| if StatsKeys.num_token not in stat: | |
| indices.append(idx) | |
| texts.append(samples_list[idx]) | |
| for idx, stat in enumerate(samples_stats): | |
| if StatsKeys.num_token not in stat: | |
| text = samples_list[idx] | |
| if isinstance(text, str): | |
| indices.append(idx) | |
| texts.append(text) | |
| else: | |
| # Treat non-string samples as having 0 tokens | |
| samples_stats[idx][StatsKeys.num_token] = 0 |
There was a problem hiding this comment.
I don't think this is true:
data-juicer/data_juicer/ops/common/helper_func.py
Lines 117 to 138 in 3152ad3
There was a problem hiding this comment.
I also think there is an upstream filter for empty strings:
data-juicer/data_juicer/format/formatter.py
Lines 190 to 204 in 3152ad3
Summary
Switch
TokenNumFilterfrom per-sampletokenizer.tokenize()to batchedtokenizer(texts, add_special_tokens=False), and enable_batched_op = True.Problem
TokenNumFilterprocesses each sample individually viacompute_stats_single, callingtokenizer.tokenize(text)through theget_words_from_documentwrapper. This has three sources of overhead:_batched_opis not set, so the framework processes one sample at a time (batch_size=1).tokenize()returns subword strings (e.g.['▁Hello', '▁world']); the string conversion is wasted when we only need a countget_model()lookup +get_words_from_document()wrapper called per sampleFix
_batched_op = Trueto enable framework batching (batches of 1000, matchingDEFAULT_BATCH_SIZE)compute_stats_single→compute_stats_batched: collect untokenized texts, calltokenizer(texts, add_special_tokens=False)once per batch, storelen(input_ids)per sampleprocess_single→process_batched: list comprehension over batch statsget_words_from_documentimportFollows the established
_batched_oppattern used by other filters (e.g.WordsNumFilter,TextLengthFilter).add_special_tokens=Falseis critical for backward compatibility:tokenizer.tokenize()does not add special tokens, so we match that behavior. Token counts verified identical on 100K synthetic samples.Impact
~1.15x throughput on 100K synthetic paragraph-length samples (1,350 → 1,555 samples/sec). Not massive, but makes a difference at scale, e.g. this is projected to save ~49min for a dataset of size 500_000K examples.
Test plan
pytest tests/ops/filter/test_token_num_filter.pytokenizer.tokenize()vs newtokenizer(texts, add_special_tokens=False)['input_ids'])