[WIP] arXiv/PDF to Markdown mappers + dj-op one-shot runner#917
[WIP] arXiv/PDF to Markdown mappers + dj-op one-shot runner#917
Conversation
Summary of ChangesHello @yxdyc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Data-Juicer's document processing capabilities by introducing two new mappers for converting academic papers and general PDF content into structured Markdown. It also provides a convenient command-line utility for quickly testing individual operators, streamlining development and debugging workflows. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two useful new mappers, arxiv_to_markdown_mapper and pdf_to_markdown_mapper, for converting arXiv papers and PDF files into Markdown format. It also adds a convenient one-shot CLI tool, dj-op, for quickly testing operators. The implementation is well-structured, including new functionality, documentation, and unit tests. My review focuses on a few areas for improvement: enhancing security by using HTTPS for API calls, reducing code duplication to improve maintainability, increasing clarity in the documentation, and fixing a bug in the new CLI tool's output handling logic.
| if isinstance(out, dict) and not any(isinstance(v, list) for v in out.values()): | ||
| result = out | ||
| else: | ||
| # Batched return: dict of lists -> take first | ||
| result = {k: (v[0] if isinstance(v, list) else v) for k, v in out.items()} |
There was a problem hiding this comment.
The logic to detect if an operator's output is batched is not robust. It incorrectly identifies any sample containing a list as a batched output. For instance, a sample like {'text': '...', 'tokens': ['a', 'b']} would be mishandled, leading to data truncation (e.g., tokens becoming just ['a']). A more reliable approach is to check if all values in the output dictionary are lists of length 1.
| if isinstance(out, dict) and not any(isinstance(v, list) for v in out.values()): | |
| result = out | |
| else: | |
| # Batched return: dict of lists -> take first | |
| result = {k: (v[0] if isinstance(v, list) else v) for k, v in out.items()} | |
| # Check if the output is a batched sample (dict of lists of size 1) | |
| is_batched = (isinstance(out, dict) and out and | |
| all(isinstance(v, list) and len(v) == 1 for v in out.values())) | |
| if is_batched: | |
| # Batched return: dict of lists -> take first | |
| result = {k: v[0] for k, v in out.items()} | |
| else: | |
| result = out |
| ARXIV_PDF_URL_TEMPLATE = "https://arxiv.org/pdf/{arxiv_id}.pdf" | ||
| ARXIV_ABS_URL_TEMPLATE = "https://arxiv.org/abs/{arxiv_id}" | ||
| ARXIV_HTML_URL_TEMPLATE = "https://arxiv.org/html/{arxiv_id}" | ||
| ARXIV_API_QUERY = "http://export.arxiv.org/api/query?id_list={arxiv_id}" |
There was a problem hiding this comment.
| def _pdf_to_markdown_pdfplumber(pdf_bytes: bytes) -> str: | ||
| """Convert PDF bytes to plain text with minimal structure using pdfplumber.""" | ||
| try: | ||
| with io.BytesIO(pdf_bytes) as f: | ||
| with pdfplumber.open(f) as pdf: | ||
| parts = [] | ||
| for i, page in enumerate(pdf.pages): | ||
| tables = page.find_tables() | ||
| for table in tables: | ||
| page = page.outside_bbox(table.bbox) | ||
| text = page.extract_text() | ||
| if not text: | ||
| continue | ||
| page_num = str(page.page_number) | ||
| if text.rstrip().endswith(page_num): | ||
| text = text.rstrip()[: -len(page_num)] | ||
| if text.strip(): | ||
| parts.append(f"## Page {i + 1}\n\n{text.strip()}") | ||
| return "\n\n".join(parts) if parts else "" | ||
| except Exception as e: | ||
| logger.warning(f"pdfplumber failed to parse PDF: {e}") | ||
| return "" | ||
|
|
||
|
|
||
| def _pdf_to_markdown_mineru(pdf_bytes: bytes, keep_images: bool = False) -> str: | ||
| """Convert PDF bytes to structured Markdown using MinerU (magic-pdf).""" | ||
| try: | ||
| from magic_pdf.data.data_reader_writer import FileBasedDataWriter | ||
| from magic_pdf.data.dataset import PymuDocDataset | ||
| from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze | ||
| from magic_pdf.config.enums import SupportedPdfParseMethod | ||
| except ImportError as e: | ||
| logger.warning(f"magic-pdf not available, fallback to pdfplumber: {e}") | ||
| return _pdf_to_markdown_pdfplumber(pdf_bytes) | ||
|
|
||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| image_dir = os.path.join(tmpdir, "images") | ||
| os.makedirs(image_dir, exist_ok=True) | ||
| image_writer = FileBasedDataWriter(image_dir) | ||
|
|
||
| try: | ||
| ds = PymuDocDataset(pdf_bytes) | ||
| if ds.classify() == SupportedPdfParseMethod.OCR: | ||
| infer_result = ds.apply(doc_analyze, ocr=True) | ||
| pipe_result = infer_result.pipe_ocr_mode(image_writer) | ||
| else: | ||
| infer_result = ds.apply(doc_analyze, ocr=False) | ||
| pipe_result = infer_result.pipe_txt_mode(image_writer) | ||
|
|
||
| image_dir_basename = "images" | ||
| md_content = pipe_result.get_markdown(image_dir_basename) | ||
| if not keep_images and md_content: | ||
| md_content = re.sub(r"!\[[^\]]*\]\([^)]+\)", "", md_content) | ||
| md_content = re.sub(r"\n{3,}", "\n\n", md_content).strip() | ||
| return md_content or _pdf_to_markdown_pdfplumber(pdf_bytes) | ||
| except Exception as e: | ||
| logger.warning(f"MinerU conversion failed, fallback to pdfplumber: {e}") | ||
| return _pdf_to_markdown_pdfplumber(pdf_bytes) |
There was a problem hiding this comment.
The helper functions _pdf_to_markdown_pdfplumber and _pdf_to_markdown_mineru are nearly identical to those in arxiv_to_markdown_mapper.py. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, it would be beneficial to refactor this shared logic into a common utility module, such as data_juicer/utils/pdf_utils.py.
| | `keep_images` | bool | `False` | Keep image refs in Markdown (mineru only). 是否保留图片引用。 | | ||
| | `timeout` | int | `60` | Request timeout (seconds). 请求超时(秒)。 | | ||
| | `download_delay` | float | `1.0` | Delay before each download (rate limiting). 每次下载前延迟(秒)。 | | ||
| | `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 | |
There was a problem hiding this comment.
The default value for output_key is documented as text_key, which is the name of a parameter. This could be confusing for users. The operator's __init__ method shows the default is None, which then falls back to the value of self.text_key. It would be clearer to document the default as None and explain in the description that it defaults to the value of text_key.
| | `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 | | |
| | `output_key` | str | `None` | Field to write Markdown to. Defaults to the value of `text_key`. 写入 Markdown 的字段名, 默认为 `text_key` 的值。 | |
| | `pdf_key` | str | `"pdf"` | Field containing PDF path (str) or bytes. 存 PDF 路径或字节的字段名。 | | ||
| | `backend` | str | `"mineru"` | `mineru` (推荐,MinerU 高精度) 或 `pdfplumber` (内置回退). 转换后端。 | | ||
| | `keep_images` | bool | `False` | Keep image refs in Markdown (mineru only). 是否保留图片引用。 | | ||
| | `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 | |
There was a problem hiding this comment.
The default value for output_key is documented as text_key, which is the name of a parameter. This could be confusing for users. The operator's __init__ method shows the default is None, which then falls back to the value of self.text_key. It would be clearer to document the default as None and explain in the description that it defaults to the value of text_key.
| | `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 | | |
| | `output_key` | str | `None` | Field to write Markdown to. Defaults to the value of `text_key`. 写入 Markdown 的字段名, 默认为 `text_key` 的值。 | |
New operators
Tooling
Docs & defaults