fix(ops): fix NlpaugEnMapper only augmenting first sample in batch#927
Conversation
…irst The process_batched method used `samples[self.text_key][0]` which only augmented the first sample in each batch. With the default batch_size of 1000, this meant 999 out of every 1000 samples silently passed through without augmentation. The field replication logic was also broken, multiplying the entire original metadata list by the text count instead of per-sample replication. Fix: iterate over all samples in the batch, augment each individually, and correctly replicate metadata fields per-sample.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a significant bug in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a critical bug in NlpaugEnMapper where only the first sample in a batch was being processed. The logic is properly updated to iterate over all samples, and a related metadata replication bug is also fixed. The addition of new regression tests is a great way to ensure the correctness of the fix and prevent this issue from recurring. I have a couple of minor suggestions to improve the maintainability of the new test code.
| print(f"Input samples: 3") | ||
| print(f"Output texts: {num_texts}") | ||
| print(f"Output metas: {num_metas}") | ||
| print(f"Expected texts (correct): 6 (3 original + 3 augmented)") | ||
| print(f"Texts: {result['text']}") |
There was a problem hiding this comment.
While these print statements can be helpful for debugging, they add noise to the test suite's output. It's a best practice in unit testing to rely on descriptive assertion failure messages for diagnostics rather than printing to stdout. This keeps test runs clean. This suggestion also applies to the other tests in this file.
There was a problem hiding this comment.
Good point, thanks! Removed the print statements from all three tests — the assertion messages provide sufficient diagnostics on failure. Also replaced the hardcoded text list with a reference to samples['text'] to avoid duplication. Both fixed in the latest push.
| for original_text in ['The quick brown fox jumps over the lazy dog', | ||
| 'Machine learning is transforming the world today', | ||
| 'Natural language processing enables computers to understand text']: | ||
| self.assertIn(original_text, result['text'], | ||
| f"Original text missing from output: {original_text}") |
There was a problem hiding this comment.
The list of original texts is hardcoded here, which duplicates the samples['text'] list defined earlier in the test. To improve maintainability and prevent potential inconsistencies if the test data is modified, it's better to reference the samples['text'] variable directly.
| for original_text in ['The quick brown fox jumps over the lazy dog', | |
| 'Machine learning is transforming the world today', | |
| 'Natural language processing enables computers to understand text']: | |
| self.assertIn(original_text, result['text'], | |
| f"Original text missing from output: {original_text}") | |
| for original_text in samples['text']: | |
| self.assertIn(original_text, result['text'], | |
| f"Original text missing from output: {original_text}") |
There was a problem hiding this comment.
Fixed — now iterating over samples['text'] directly instead of duplicating the list. Thanks for the suggestion.
Remove print statements from tests to keep test output clean. Replace hardcoded text list with reference to samples['text'].
Merge multi-line function call into single line to satisfy black formatter requirements. Signed-off-by: dubin555 <dubin555@gmail.com>
Summary
Fix a bug in
NlpaugEnMapper.process_batched()wheresamples[self.text_key][0]hardcodes processing of only the first sample in each batch. With the defaultbatch_size=1000, this means ~99.9% of samples silently pass through without augmentation.Also fix broken metadata replication that used bulk list multiplication (
res_samples[key] * len(...)) instead of correct per-sample replication.Problem
The method takes only the first element from the batch's text list. The comment
# batch_size = 1reveals the author assumed single-sample batches, but the class declares_batched_op = Trueand the framework default isbatch_size=1000.Existing tests didn't catch this because
NestedDataset.map()overrides batch_size to 1 by default (atdj_dataset.py:378), so the unit test path always sends single-sample batches. The production path throughMapper.run()usesself.batch_sizewhich defaults to 1000.Two sub-bugs:
Fix
deepcopyimportBefore fix (batch of 3 samples,
aug_num=1,keep_original_sample=True):After fix:
Tests
Added 3 regression tests covering multi-sample batch processing:
test_batch_only_augments_first_sample— verifies all samples in a batch are augmentedtest_batch_without_keep_original— verifies correct behavior whenkeep_original_sample=Falsetest_batch_sequential_mode— verifies sequential augmentation mode with batchesAll 12 existing tests continue to pass: