Skip to content

[WIP] arXiv/PDF to Markdown mappers + dj-op one-shot runner#917

Open
yxdyc wants to merge 1 commit intomainfrom
feature/arxiv-pdf-markdown-from-main
Open

[WIP] arXiv/PDF to Markdown mappers + dj-op one-shot runner#917
yxdyc wants to merge 1 commit intomainfrom
feature/arxiv-pdf-markdown-from-main

Conversation

@yxdyc
Copy link
Collaborator

@yxdyc yxdyc commented Feb 14, 2026

New operators

  • arxiv_to_markdown_mapper: Converts an arXiv paper ID (or URL) into a single structured Markdown document. Backends: mineru (default, recommended), pdfplumber, crawl4ai. With crawl4ai, tries the HTML full-text page first, then the abstract page, then PDF.
  • pdf_to_markdown_mapper: Converts PDF from a field (path or bytes) to Markdown; supports mineru (recommended) and pdfplumber.

Tooling

  • dj-op: One-shot CLI to run a single operator on one sample (e.g. dj-op arxiv_to_markdown_mapper '{"arxiv_id":"2501.14755"}').

Docs & defaults

  • Docs recommend MinerU; pdfplumber as fallback. Optional deps: py-data-juicer[document] (magic-pdf, crawl4ai)

Co-authored-by: Cursor <cursoragent@cursor.com>
@yxdyc yxdyc added the dj:op issues/PRs about some specific OPs label Feb 14, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yxdyc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Data-Juicer's document processing capabilities by introducing two new mappers for converting academic papers and general PDF content into structured Markdown. It also provides a convenient command-line utility for quickly testing individual operators, streamlining development and debugging workflows.

Highlights

  • New arXiv to Markdown Mapper: Introduced arxiv_to_markdown_mapper to convert arXiv paper IDs or URLs into structured Markdown documents. It supports multiple backends including MinerU (recommended), pdfplumber, and crawl4ai, with intelligent fallback mechanisms.
  • New PDF to Markdown Mapper: Added pdf_to_markdown_mapper for converting PDF content (either file paths or raw bytes) into structured Markdown. This mapper also supports MinerU (recommended) and pdfplumber for conversion.
  • New One-Shot Operator Runner CLI: Developed a new command-line interface tool, dj-op, which allows users to run any single Data-Juicer operator on a single sample, facilitating quick testing and debugging.
  • Enhanced Documentation and Dependencies: Updated the project's documentation to include detailed descriptions and usage examples for the new mappers. A new document extra dependency group was added to pyproject.toml to manage optional dependencies like magic-pdf and crawl4ai.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • README.md
    • Added example usage for the new dj-op CLI tool.
  • data_juicer/ops/mapper/init.py
    • Imported and registered the new ArxivToMarkdownMapper and PdfToMarkdownMapper.
  • data_juicer/ops/mapper/arxiv_to_markdown_mapper.py
    • Implemented the ArxivToMarkdownMapper with logic for fetching arXiv metadata, downloading PDFs, and converting content using various backends (MinerU, pdfplumber, crawl4ai).
  • data_juicer/ops/mapper/pdf_to_markdown_mapper.py
    • Implemented the PdfToMarkdownMapper to convert PDF bytes or file paths to Markdown using MinerU or pdfplumber.
  • docs/operators/mapper/arxiv_to_markdown_mapper.md
    • Created documentation for the arxiv_to_markdown_mapper, detailing its parameters, dependencies, and usage.
  • docs/operators/mapper/pdf_to_markdown_mapper.md
    • Created documentation for the pdf_to_markdown_mapper, detailing its parameters, dependencies, and usage.
  • pyproject.toml
    • Defined a new document extra dependency group for PDF/arXiv conversion tools.
    • Added the dj-op entry point for the new CLI tool.
  • tests/ops/mapper/test_arxiv_to_markdown_mapper.py
    • Added unit tests for the ArxivToMarkdownMapper, covering ID normalization and basic processing.
  • tests/ops/mapper/test_pdf_to_markdown_mapper.py
    • Added unit tests for the PdfToMarkdownMapper, including PDF byte handling and basic conversion.
  • tools/run_op.py
    • Created the dj-op command-line utility to execute Data-Juicer operators with specified samples and arguments.
Activity
  • The pull request is currently marked as Work In Progress (WIP), indicating active development by the author.
  • No specific review comments or other activities have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two useful new mappers, arxiv_to_markdown_mapper and pdf_to_markdown_mapper, for converting arXiv papers and PDF files into Markdown format. It also adds a convenient one-shot CLI tool, dj-op, for quickly testing operators. The implementation is well-structured, including new functionality, documentation, and unit tests. My review focuses on a few areas for improvement: enhancing security by using HTTPS for API calls, reducing code duplication to improve maintainability, increasing clarity in the documentation, and fixing a bug in the new CLI tool's output handling logic.

Comment on lines +103 to +107
if isinstance(out, dict) and not any(isinstance(v, list) for v in out.values()):
result = out
else:
# Batched return: dict of lists -> take first
result = {k: (v[0] if isinstance(v, list) else v) for k, v in out.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to detect if an operator's output is batched is not robust. It incorrectly identifies any sample containing a list as a batched output. For instance, a sample like {'text': '...', 'tokens': ['a', 'b']} would be mishandled, leading to data truncation (e.g., tokens becoming just ['a']). A more reliable approach is to check if all values in the output dictionary are lists of length 1.

Suggested change
if isinstance(out, dict) and not any(isinstance(v, list) for v in out.values()):
result = out
else:
# Batched return: dict of lists -> take first
result = {k: (v[0] if isinstance(v, list) else v) for k, v in out.items()}
# Check if the output is a batched sample (dict of lists of size 1)
is_batched = (isinstance(out, dict) and out and
all(isinstance(v, list) and len(v) == 1 for v in out.values()))
if is_batched:
# Batched return: dict of lists -> take first
result = {k: v[0] for k, v in out.items()}
else:
result = out

ARXIV_PDF_URL_TEMPLATE = "https://arxiv.org/pdf/{arxiv_id}.pdf"
ARXIV_ABS_URL_TEMPLATE = "https://arxiv.org/abs/{arxiv_id}"
ARXIV_HTML_URL_TEMPLATE = "https://arxiv.org/html/{arxiv_id}"
ARXIV_API_QUERY = "http://export.arxiv.org/api/query?id_list={arxiv_id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For security and to follow best practices, API requests should be made over HTTPS instead of HTTP. The arXiv API supports HTTPS.

Suggested change
ARXIV_API_QUERY = "http://export.arxiv.org/api/query?id_list={arxiv_id}"
ARXIV_API_QUERY = "https://export.arxiv.org/api/query?id_list={arxiv_id}"

Comment on lines +55 to +112
def _pdf_to_markdown_pdfplumber(pdf_bytes: bytes) -> str:
"""Convert PDF bytes to plain text with minimal structure using pdfplumber."""
try:
with io.BytesIO(pdf_bytes) as f:
with pdfplumber.open(f) as pdf:
parts = []
for i, page in enumerate(pdf.pages):
tables = page.find_tables()
for table in tables:
page = page.outside_bbox(table.bbox)
text = page.extract_text()
if not text:
continue
page_num = str(page.page_number)
if text.rstrip().endswith(page_num):
text = text.rstrip()[: -len(page_num)]
if text.strip():
parts.append(f"## Page {i + 1}\n\n{text.strip()}")
return "\n\n".join(parts) if parts else ""
except Exception as e:
logger.warning(f"pdfplumber failed to parse PDF: {e}")
return ""


def _pdf_to_markdown_mineru(pdf_bytes: bytes, keep_images: bool = False) -> str:
"""Convert PDF bytes to structured Markdown using MinerU (magic-pdf)."""
try:
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod
except ImportError as e:
logger.warning(f"magic-pdf not available, fallback to pdfplumber: {e}")
return _pdf_to_markdown_pdfplumber(pdf_bytes)

with tempfile.TemporaryDirectory() as tmpdir:
image_dir = os.path.join(tmpdir, "images")
os.makedirs(image_dir, exist_ok=True)
image_writer = FileBasedDataWriter(image_dir)

try:
ds = PymuDocDataset(pdf_bytes)
if ds.classify() == SupportedPdfParseMethod.OCR:
infer_result = ds.apply(doc_analyze, ocr=True)
pipe_result = infer_result.pipe_ocr_mode(image_writer)
else:
infer_result = ds.apply(doc_analyze, ocr=False)
pipe_result = infer_result.pipe_txt_mode(image_writer)

image_dir_basename = "images"
md_content = pipe_result.get_markdown(image_dir_basename)
if not keep_images and md_content:
md_content = re.sub(r"!\[[^\]]*\]\([^)]+\)", "", md_content)
md_content = re.sub(r"\n{3,}", "\n\n", md_content).strip()
return md_content or _pdf_to_markdown_pdfplumber(pdf_bytes)
except Exception as e:
logger.warning(f"MinerU conversion failed, fallback to pdfplumber: {e}")
return _pdf_to_markdown_pdfplumber(pdf_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The helper functions _pdf_to_markdown_pdfplumber and _pdf_to_markdown_mineru are nearly identical to those in arxiv_to_markdown_mapper.py. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, it would be beneficial to refactor this shared logic into a common utility module, such as data_juicer/utils/pdf_utils.py.

| `keep_images` | bool | `False` | Keep image refs in Markdown (mineru only). 是否保留图片引用。 |
| `timeout` | int | `60` | Request timeout (seconds). 请求超时(秒)。 |
| `download_delay` | float | `1.0` | Delay before each download (rate limiting). 每次下载前延迟(秒)。 |
| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for output_key is documented as text_key, which is the name of a parameter. This could be confusing for users. The operator's __init__ method shows the default is None, which then falls back to the value of self.text_key. It would be clearer to document the default as None and explain in the description that it defaults to the value of text_key.

Suggested change
| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |
| `output_key` | str | `None` | Field to write Markdown to. Defaults to the value of `text_key`. 写入 Markdown 的字段名, 默认为 `text_key` 的值|

| `pdf_key` | str | `"pdf"` | Field containing PDF path (str) or bytes. 存 PDF 路径或字节的字段名。 |
| `backend` | str | `"mineru"` | `mineru` (推荐,MinerU 高精度) 或 `pdfplumber` (内置回退). 转换后端。 |
| `keep_images` | bool | `False` | Keep image refs in Markdown (mineru only). 是否保留图片引用。 |
| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default value for output_key is documented as text_key, which is the name of a parameter. This could be confusing for users. The operator's __init__ method shows the default is None, which then falls back to the value of self.text_key. It would be clearer to document the default as None and explain in the description that it defaults to the value of text_key.

Suggested change
| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |
| `output_key` | str | `None` | Field to write Markdown to. Defaults to the value of `text_key`. 写入 Markdown 的字段名, 默认为 `text_key` 的值|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dj:op issues/PRs about some specific OPs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant