[NewOp] Add generate_challenging_qa_mapper based on MindGYM principles by Bat-Reality · Pull Request #703 · datajuicer/data-juicer

Bat-Reality · 2025-06-14T06:12:45Z

Introduces a novel QA generation module based on self-challenging mechanisms, designed to autonomously synthesize high-quality reasoning-focused question-answer pairs, inspired by the MindGYM paper.

cyruszhang · 2025-06-15T17:28:55Z

data_juicer/ops/mapper/generate_challenging_qa_mapper.py

+from data_juicer.utils.lazy_loader import LazyLoader
+from data_juicer.utils.model_utils import get_model, prepare_model
+
+torch = LazyLoader('torch', 'torch')


can import from model_utils: torch, vllm

cyruszhang · 2025-06-15T17:30:40Z

data_juicer/ops/mapper/generate_challenging_qa_mapper.py

+OP_NAME = 'generate_challenging_qa_mapper'
+
+
+def retry_on_error(func, max_retries=5, delay=1):


can use existing third party retry library or put retry_on_error in a util

HYLcool · 2025-06-30T08:12:45Z

Please merge the latest main branch and run pre-commit locally.

lingzhq · 2025-11-19T08:39:07Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new generate_challenging_qa_mapper operator, a significant feature for generating reasoning-focused QA pairs using a multi-turn conversation with a language model. The implementation is well-structured. However, I've identified a few issues that need attention. There's a critical error handling gap that could lead to a crash if the model's output is malformed. Additionally, the GPU configuration is hardcoded, which could cause runtime errors or suboptimal performance in different environments. I've also provided suggestions to improve code clarity and enhance the new test case with assertions.

gemini-code-assist · 2025-11-19T08:41:03Z

data_juicer/ops/mapper/generate_challenging_qa_mapper.py

+        qa = self.extract_json(qa[0].outputs[0].text)
+        qa["thinking"] = multihop[0].outputs[0].text


If extract_json returns None because the model output is malformed, the subsequent line qa["thinking"] = ... will raise a TypeError, causing the process to crash. It's crucial to check if qa is None and handle this case gracefully, for instance, by raising a ValueError to trigger the retry mechanism with a more informative error message.

Suggested change

qa = self.extract_json(qa[0].outputs[0].text)

qa["thinking"] = multihop[0].outputs[0].text

qa = self.extract_json(qa[0].outputs[0].text)

if qa is None:

raise ValueError("Failed to extract valid JSON from model output.")

qa["thinking"] = multihop[0].outputs[0].text

data_juicer/ops/mapper/generate_challenging_qa_mapper.py

gemini-code-assist · 2025-11-19T08:41:03Z

data_juicer/ops/mapper/generate_challenging_qa_mapper.py

+        """
+        super().__init__(*args, **kwargs)
+        self.hf_model = hf_model
+        self.model_key = prepare_model(model_type="huggingface", pretrained_model_name_or_path=hf_model)


This line is redundant as self.model_key is immediately overwritten on line 116. It should be removed to avoid confusion and unnecessary model preparation.

data_juicer/ops/mapper/generate_challenging_qa_mapper.py

gemini-code-assist · 2025-11-19T08:41:03Z

tests/ops/mapper/test_generate_challenging_qa_mapper.py

+            result = op.process(deepcopy(sample))
+            print(f'Output results: {result}')


This test runs the operator but doesn't have any assertions to verify the output. A test should validate the behavior of the code. Please add assertions to check that the returned result dictionary contains the expected keys from the generated QA pair. This will make the test more meaningful and robust.

Suggested change

result = op.process(deepcopy(sample))

print(f'Output results: {result}')

result = op.process(deepcopy(sample))

self.assertIn('background_document', result)

self.assertIn('reasoning_category', result)

self.assertIn('sub_questions', result)

self.assertIn('relationship_category', result)

self.assertIn('multihop_question', result)

self.assertIn('multihop_answer', result)

self.assertIn('thinking', result)

print(f'Output results: {result}')

lingzhq · 2025-11-20T14:33:08Z

I ran some tests locally and have a few points of feedback.

ERRORs Found During Testing:

Mixed Languages in Output: The model's output sometimes contains mixed Chinese and English, for example, returns a sample containing a new field named '背景文档'.
No Clear Termination Condition: The generation process doesn't seem to respect the requested number of outputs. I tried to generate 3 samples, but the log showed at least 50 "Processed prompts" iterations without stopping.

Other suggestions for Improvement:

JSON Parsing Robustness: The process still requires many retries (always up to 6 times) due to frequent JSON parsing failures.
Generation Configs: Consider exposing generation configurations as kwargs to the user instead of hardcoding them.
Performance: The generation speed is a bit slow, currently averaging around 45 toks/s for the output.

Bat-Reality requested review from HYLcool, lingzhq and yxdyc June 14, 2025 06:12

Bat-Reality had a problem deploying to Testing June 14, 2025 06:12 — with GitHub Actions Failure

cyruszhang reviewed Jun 15, 2025

View reviewed changes

HYLcool temporarily deployed to Testing June 30, 2025 08:09 — with GitHub Actions Inactive

HYLcool had a problem deploying to Testing June 30, 2025 08:09 — with GitHub Actions Error

Bat-Reality added 2 commits November 18, 2025 21:12

[NewOp] Add generate_challenging_qa_mapper based on MindGYM principles

7be8b1c

style: format code with pre-commit

7f2c203

Bat-Reality force-pushed the op-MindGYM branch from 805b8f6 to 7f2c203 Compare November 19, 2025 08:34

Bat-Reality had a problem deploying to Testing November 19, 2025 08:34 — with GitHub Actions Failure

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

Bat-Reality had a problem deploying to Testing November 19, 2025 08:43 — with GitHub Actions Failure

Bat-Reality had a problem deploying to Testing November 19, 2025 08:44 — with GitHub Actions Failure

feat: update MindGYM op with modified code

b7e2654

Bat-Reality force-pushed the op-MindGYM branch from 8a37b7a to b7e2654 Compare November 19, 2025 09:42

Bat-Reality had a problem deploying to Testing November 19, 2025 09:42 — with GitHub Actions Failure

feat: update MindGYM op with modified code

0ceea72

Bat-Reality had a problem deploying to Testing November 19, 2025 10:04 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NewOp] Add generate_challenging_qa_mapper based on MindGYM principles#703

[NewOp] Add generate_challenging_qa_mapper based on MindGYM principles#703
Bat-Reality wants to merge 4 commits intodatajuicer:mainfrom
Bat-Reality:op-MindGYM

Bat-Reality commented Jun 14, 2025

Uh oh!

cyruszhang Jun 15, 2025

Uh oh!

cyruszhang Jun 15, 2025

Uh oh!

HYLcool commented Jun 30, 2025

Uh oh!

lingzhq commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 19, 2025

Uh oh!

lingzhq commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		OP_NAME = 'generate_challenging_qa_mapper'


		def retry_on_error(func, max_retries=5, delay=1):

		qa = self.extract_json(qa[0].outputs[0].text)
		qa["thinking"] = multihop[0].outputs[0].text

		result = op.process(deepcopy(sample))
		print(f'Output results: {result}')

-            result = op.process(deepcopy(sample))
-            print(f'Output results: {result}')
+            result = op.process(deepcopy(sample))
+            self.assertIn('background_document', result)
+            self.assertIn('reasoning_category', result)
+            self.assertIn('sub_questions', result)
+            self.assertIn('relationship_category', result)
+            self.assertIn('multihop_question', result)
+            self.assertIn('multihop_answer', result)
+            self.assertIn('thinking', result)
+            print(f'Output results: {result}')

Conversation

Bat-Reality commented Jun 14, 2025

Uh oh!

cyruszhang Jun 15, 2025

Choose a reason for hiding this comment

Uh oh!

cyruszhang Jun 15, 2025

Choose a reason for hiding this comment

Uh oh!

HYLcool commented Jun 30, 2025

Uh oh!

lingzhq commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

lingzhq commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants