fix(ops): fix NlpaugEnMapper only augmenting first sample in batch by dubin555 · Pull Request #927 · datajuicer/data-juicer

dubin555 · 2026-03-03T11:48:28Z

Summary

Fix a bug in NlpaugEnMapper.process_batched() where samples[self.text_key][0] hardcodes processing of only the first sample in each batch. With the default batch_size=1000, this means ~99.9% of samples silently pass through without augmentation.

Also fix broken metadata replication that used bulk list multiplication (res_samples[key] * len(...)) instead of correct per-sample replication.

Problem

# Line 149 - BEFORE (wrong)
texts_to_aug = samples[self.text_key][0]  # only first sample

The method takes only the first element from the batch's text list. The comment # batch_size = 1 reveals the author assumed single-sample batches, but the class declares _batched_op = True and the framework default is batch_size=1000.

Existing tests didn't catch this because NestedDataset.map() overrides batch_size to 1 by default (at dj_dataset.py:378), so the unit test path always sends single-sample batches. The production path through Mapper.run() uses self.batch_size which defaults to 1000.

Two sub-bugs:

Only the first sample is augmented; all others are silently dropped/ignored
Metadata replication is broken: multiplying the entire original metadata list by the new text count produces N×M entries instead of correct per-sample replication

Fix

Replace single-sample processing with a loop over all samples in the batch
Augment each sample individually and collect results
Replicate metadata fields per-sample rather than via bulk list multiplication
Remove now-unused deepcopy import

Before fix (batch of 3 samples, aug_num=1, keep_original_sample=True):

Output texts: 4  (3 originals + 1 augmented from first sample only)
Output metas: 12 (4 × 3 = broken replication)

After fix:

Output texts: 6  (3 originals + 3 augmented, one per sample)
Output metas: 6  (correctly replicated per-sample)

Tests

Added 3 regression tests covering multi-sample batch processing:

test_batch_only_augments_first_sample — verifies all samples in a batch are augmented
test_batch_without_keep_original — verifies correct behavior when keep_original_sample=False
test_batch_sequential_mode — verifies sequential augmentation mode with batches

All 12 existing tests continue to pass:

tests/ops/mapper/test_nlpaug_en_mapper.py: 12 passed
tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py: 3 passed

…irst The process_batched method used `samples[self.text_key][0]` which only augmented the first sample in each batch. With the default batch_size of 1000, this meant 999 out of every 1000 samples silently passed through without augmentation. The field replication logic was also broken, multiplying the entire original metadata list by the text count instead of per-sample replication. Fix: iterate over all samples in the batch, augment each individually, and correctly replicate metadata fields per-sample.

gemini-code-assist · 2026-03-03T11:48:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a significant bug in the NlpaugEnMapper component that prevented proper batch processing and metadata handling during text augmentation. The changes ensure that all samples within a batch are correctly augmented and their associated metadata is accurately replicated, resolving a silent data loss issue and improving the reliability of the data augmentation pipeline.

Highlights

Batch Processing Fix: Resolved a critical bug in NlpaugEnMapper.process_batched() where only the first sample in a batch was augmented, leading to silent omission of most samples in production.
Metadata Replication Correction: Fixed an issue where metadata was incorrectly replicated using bulk list multiplication, now ensuring accurate per-sample replication.
Comprehensive Regression Tests: Introduced three new regression tests to validate the corrected batch processing behavior, including scenarios for keep_original_sample and sequential augmentation mode.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

data_juicer/ops/mapper/nlpaug_en_mapper.py
- Removed the unused deepcopy import.
- Rewrote the process_batched method to iterate through each sample in the input batch, ensuring every sample is augmented.
- Corrected the logic for replicating metadata fields to align with the number of augmented texts generated for each original sample.
tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py
- Added a new test file dedicated to reproducing and verifying the fix for the batch processing bug.
- Included test_batch_only_augments_first_sample to confirm all samples are augmented.
- Added test_batch_without_keep_original to check behavior when original samples are not retained.
- Implemented test_batch_sequential_mode to ensure sequential augmentation works correctly with batches.

Activity

No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes a critical bug in NlpaugEnMapper where only the first sample in a batch was being processed. The logic is properly updated to iterate over all samples, and a related metadata replication bug is also fixed. The addition of new regression tests is a great way to ensure the correctness of the fix and prevent this issue from recurring. I have a couple of minor suggestions to improve the maintainability of the new test code.

gemini-code-assist · 2026-03-03T11:49:45Z

tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py

+        print(f"Input samples: 3")
+        print(f"Output texts: {num_texts}")
+        print(f"Output metas: {num_metas}")
+        print(f"Expected texts (correct): 6 (3 original + 3 augmented)")
+        print(f"Texts: {result['text']}")


While these print statements can be helpful for debugging, they add noise to the test suite's output. It's a best practice in unit testing to rely on descriptive assertion failure messages for diagnostics rather than printing to stdout. This keeps test runs clean. This suggestion also applies to the other tests in this file.

Good point, thanks! Removed the print statements from all three tests — the assertion messages provide sufficient diagnostics on failure. Also replaced the hardcoded text list with a reference to samples['text'] to avoid duplication. Both fixed in the latest push.

gemini-code-assist · 2026-03-03T11:49:45Z

tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py

+        for original_text in ['The quick brown fox jumps over the lazy dog',
+                              'Machine learning is transforming the world today',
+                              'Natural language processing enables computers to understand text']:
+            self.assertIn(original_text, result['text'],
+                          f"Original text missing from output: {original_text}")


The list of original texts is hardcoded here, which duplicates the samples['text'] list defined earlier in the test. To improve maintainability and prevent potential inconsistencies if the test data is modified, it's better to reference the samples['text'] variable directly.

Suggested change

for original_text in ['The quick brown fox jumps over the lazy dog',

'Machine learning is transforming the world today',

'Natural language processing enables computers to understand text']:

self.assertIn(original_text, result['text'],

f"Original text missing from output: {original_text}")

for original_text in samples['text']:

self.assertIn(original_text, result['text'],

f"Original text missing from output: {original_text}")

Fixed — now iterating over samples['text'] directly instead of duplicating the list. Thanks for the suggestion.

Remove print statements from tests to keep test output clean. Replace hardcoded text list with reference to samples['text'].

Merge multi-line function call into single line to satisfy black formatter requirements. Signed-off-by: dubin555 <dubin555@gmail.com>

dubin555 requested a deployment to Testing March 3, 2026 11:48 — with GitHub Actions Waiting

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

refactor(test): clean up test output and reduce duplication

e6beb8c

Remove print statements from tests to keep test output clean. Replace hardcoded text list with reference to samples['text'].

dubin555 requested a deployment to Testing March 3, 2026 13:28 — with GitHub Actions Waiting

style: fix black formatting in nlpaug_en_mapper.py

7e12020

Merge multi-line function call into single line to satisfy black formatter requirements. Signed-off-by: dubin555 <dubin555@gmail.com>

dubin555 requested a deployment to Testing March 5, 2026 13:22 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ops): fix NlpaugEnMapper only augmenting first sample in batch#927

fix(ops): fix NlpaugEnMapper only augmenting first sample in batch#927
dubin555 wants to merge 3 commits intodatajuicer:mainfrom
dubin555:oss-scout/verify-fb-data_juicer-ops-mapper-nlpaug_en_mapper-

dubin555 commented Mar 3, 2026

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

dubin555 Mar 3, 2026

Uh oh!

gemini-code-assist bot Mar 3, 2026

Uh oh!

dubin555 Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dubin555 commented Mar 3, 2026

Summary

Problem

Fix

Tests

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

dubin555 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

dubin555 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant