Skip to content

fix(ops): fix NlpaugEnMapper only augmenting first sample in batch#927

Open
dubin555 wants to merge 3 commits intodatajuicer:mainfrom
dubin555:oss-scout/verify-fb-data_juicer-ops-mapper-nlpaug_en_mapper-
Open

fix(ops): fix NlpaugEnMapper only augmenting first sample in batch#927
dubin555 wants to merge 3 commits intodatajuicer:mainfrom
dubin555:oss-scout/verify-fb-data_juicer-ops-mapper-nlpaug_en_mapper-

Conversation

@dubin555
Copy link
Contributor

@dubin555 dubin555 commented Mar 3, 2026

Summary

Fix a bug in NlpaugEnMapper.process_batched() where samples[self.text_key][0] hardcodes processing of only the first sample in each batch. With the default batch_size=1000, this means ~99.9% of samples silently pass through without augmentation.

Also fix broken metadata replication that used bulk list multiplication (res_samples[key] * len(...)) instead of correct per-sample replication.

Problem

# Line 149 - BEFORE (wrong)
texts_to_aug = samples[self.text_key][0]  # only first sample

The method takes only the first element from the batch's text list. The comment # batch_size = 1 reveals the author assumed single-sample batches, but the class declares _batched_op = True and the framework default is batch_size=1000.

Existing tests didn't catch this because NestedDataset.map() overrides batch_size to 1 by default (at dj_dataset.py:378), so the unit test path always sends single-sample batches. The production path through Mapper.run() uses self.batch_size which defaults to 1000.

Two sub-bugs:

  1. Only the first sample is augmented; all others are silently dropped/ignored
  2. Metadata replication is broken: multiplying the entire original metadata list by the new text count produces N×M entries instead of correct per-sample replication

Fix

  • Replace single-sample processing with a loop over all samples in the batch
  • Augment each sample individually and collect results
  • Replicate metadata fields per-sample rather than via bulk list multiplication
  • Remove now-unused deepcopy import

Before fix (batch of 3 samples, aug_num=1, keep_original_sample=True):

Output texts: 4  (3 originals + 1 augmented from first sample only)
Output metas: 12 (4 × 3 = broken replication)

After fix:

Output texts: 6  (3 originals + 3 augmented, one per sample)
Output metas: 6  (correctly replicated per-sample)

Tests

Added 3 regression tests covering multi-sample batch processing:

  • test_batch_only_augments_first_sample — verifies all samples in a batch are augmented
  • test_batch_without_keep_original — verifies correct behavior when keep_original_sample=False
  • test_batch_sequential_mode — verifies sequential augmentation mode with batches

All 12 existing tests continue to pass:

tests/ops/mapper/test_nlpaug_en_mapper.py: 12 passed
tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py: 3 passed

…irst

The process_batched method used `samples[self.text_key][0]` which only
augmented the first sample in each batch. With the default batch_size
of 1000, this meant 999 out of every 1000 samples silently passed
through without augmentation. The field replication logic was also
broken, multiplying the entire original metadata list by the text
count instead of per-sample replication.

Fix: iterate over all samples in the batch, augment each individually,
and correctly replicate metadata fields per-sample.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a significant bug in the NlpaugEnMapper component that prevented proper batch processing and metadata handling during text augmentation. The changes ensure that all samples within a batch are correctly augmented and their associated metadata is accurately replicated, resolving a silent data loss issue and improving the reliability of the data augmentation pipeline.

Highlights

  • Batch Processing Fix: Resolved a critical bug in NlpaugEnMapper.process_batched() where only the first sample in a batch was augmented, leading to silent omission of most samples in production.
  • Metadata Replication Correction: Fixed an issue where metadata was incorrectly replicated using bulk list multiplication, now ensuring accurate per-sample replication.
  • Comprehensive Regression Tests: Introduced three new regression tests to validate the corrected batch processing behavior, including scenarios for keep_original_sample and sequential augmentation mode.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • data_juicer/ops/mapper/nlpaug_en_mapper.py
    • Removed the unused deepcopy import.
    • Rewrote the process_batched method to iterate through each sample in the input batch, ensuring every sample is augmented.
    • Corrected the logic for replicating metadata fields to align with the number of augmented texts generated for each original sample.
  • tests/ops/mapper/test_nlpaug_en_mapper_batch_bug.py
    • Added a new test file dedicated to reproducing and verifying the fix for the batch processing bug.
    • Included test_batch_only_augments_first_sample to confirm all samples are augmented.
    • Added test_batch_without_keep_original to check behavior when original samples are not retained.
    • Implemented test_batch_sequential_mode to ensure sequential augmentation works correctly with batches.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a critical bug in NlpaugEnMapper where only the first sample in a batch was being processed. The logic is properly updated to iterate over all samples, and a related metadata replication bug is also fixed. The addition of new regression tests is a great way to ensure the correctness of the fix and prevent this issue from recurring. I have a couple of minor suggestions to improve the maintainability of the new test code.

Comment on lines +41 to +45
print(f"Input samples: 3")
print(f"Output texts: {num_texts}")
print(f"Output metas: {num_metas}")
print(f"Expected texts (correct): 6 (3 original + 3 augmented)")
print(f"Texts: {result['text']}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While these print statements can be helpful for debugging, they add noise to the test suite's output. It's a best practice in unit testing to rely on descriptive assertion failure messages for diagnostics rather than printing to stdout. This keeps test runs clean. This suggestion also applies to the other tests in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks! Removed the print statements from all three tests — the assertion messages provide sufficient diagnostics on failure. Also replaced the hardcoded text list with a reference to samples['text'] to avoid duplication. Both fixed in the latest push.

Comment on lines +48 to +52
for original_text in ['The quick brown fox jumps over the lazy dog',
'Machine learning is transforming the world today',
'Natural language processing enables computers to understand text']:
self.assertIn(original_text, result['text'],
f"Original text missing from output: {original_text}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of original texts is hardcoded here, which duplicates the samples['text'] list defined earlier in the test. To improve maintainability and prevent potential inconsistencies if the test data is modified, it's better to reference the samples['text'] variable directly.

Suggested change
for original_text in ['The quick brown fox jumps over the lazy dog',
'Machine learning is transforming the world today',
'Natural language processing enables computers to understand text']:
self.assertIn(original_text, result['text'],
f"Original text missing from output: {original_text}")
for original_text in samples['text']:
self.assertIn(original_text, result['text'],
f"Original text missing from output: {original_text}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — now iterating over samples['text'] directly instead of duplicating the list. Thanks for the suggestion.

Remove print statements from tests to keep test output clean.
Replace hardcoded text list with reference to samples['text'].
Merge multi-line function call into single line to satisfy black
formatter requirements.

Signed-off-by: dubin555 <dubin555@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant