[NewOp] Add generate_challenging_qa_mapper based on MindGYM principles#703
[NewOp] Add generate_challenging_qa_mapper based on MindGYM principles#703Bat-Reality wants to merge 4 commits intodatajuicer:mainfrom
Conversation
| from data_juicer.utils.lazy_loader import LazyLoader | ||
| from data_juicer.utils.model_utils import get_model, prepare_model | ||
|
|
||
| torch = LazyLoader('torch', 'torch') |
There was a problem hiding this comment.
can import from model_utils: torch, vllm
| OP_NAME = 'generate_challenging_qa_mapper' | ||
|
|
||
|
|
||
| def retry_on_error(func, max_retries=5, delay=1): |
There was a problem hiding this comment.
can use existing third party retry library or put retry_on_error in a util
|
Please merge the latest main branch and run pre-commit locally. |
805b8f6 to
7f2c203
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new generate_challenging_qa_mapper operator, a significant feature for generating reasoning-focused QA pairs using a multi-turn conversation with a language model. The implementation is well-structured. However, I've identified a few issues that need attention. There's a critical error handling gap that could lead to a crash if the model's output is malformed. Additionally, the GPU configuration is hardcoded, which could cause runtime errors or suboptimal performance in different environments. I've also provided suggestions to improve code clarity and enhance the new test case with assertions.
| qa = self.extract_json(qa[0].outputs[0].text) | ||
| qa["thinking"] = multihop[0].outputs[0].text |
There was a problem hiding this comment.
If extract_json returns None because the model output is malformed, the subsequent line qa["thinking"] = ... will raise a TypeError, causing the process to crash. It's crucial to check if qa is None and handle this case gracefully, for instance, by raising a ValueError to trigger the retry mechanism with a more informative error message.
| qa = self.extract_json(qa[0].outputs[0].text) | |
| qa["thinking"] = multihop[0].outputs[0].text | |
| qa = self.extract_json(qa[0].outputs[0].text) | |
| if qa is None: | |
| raise ValueError("Failed to extract valid JSON from model output.") | |
| qa["thinking"] = multihop[0].outputs[0].text |
| """ | ||
| super().__init__(*args, **kwargs) | ||
| self.hf_model = hf_model | ||
| self.model_key = prepare_model(model_type="huggingface", pretrained_model_name_or_path=hf_model) |
| result = op.process(deepcopy(sample)) | ||
| print(f'Output results: {result}') |
There was a problem hiding this comment.
This test runs the operator but doesn't have any assertions to verify the output. A test should validate the behavior of the code. Please add assertions to check that the returned result dictionary contains the expected keys from the generated QA pair. This will make the test more meaningful and robust.
| result = op.process(deepcopy(sample)) | |
| print(f'Output results: {result}') | |
| result = op.process(deepcopy(sample)) | |
| self.assertIn('background_document', result) | |
| self.assertIn('reasoning_category', result) | |
| self.assertIn('sub_questions', result) | |
| self.assertIn('relationship_category', result) | |
| self.assertIn('multihop_question', result) | |
| self.assertIn('multihop_answer', result) | |
| self.assertIn('thinking', result) | |
| print(f'Output results: {result}') |
8a37b7a to
b7e2654
Compare
|
I ran some tests locally and have a few points of feedback. ERRORs Found During Testing:
Other suggestions for Improvement:
|
Introduces a novel QA generation module based on self-challenging mechanisms, designed to autonomously synthesize high-quality reasoning-focused question-answer pairs, inspired by the MindGYM paper.