[WIP] arXiv/PDF to Markdown mappers + dj-op one-shot runner by yxdyc · Pull Request #917 · datajuicer/data-juicer

yxdyc · 2026-02-14T07:42:45Z

New operators

arxiv_to_markdown_mapper: Converts an arXiv paper ID (or URL) into a single structured Markdown document. Backends: mineru (default, recommended), pdfplumber, crawl4ai. With crawl4ai, tries the HTML full-text page first, then the abstract page, then PDF.
pdf_to_markdown_mapper: Converts PDF from a field (path or bytes) to Markdown; supports mineru (recommended) and pdfplumber.

Tooling

dj-op: One-shot CLI to run a single operator on one sample (e.g. dj-op arxiv_to_markdown_mapper '{"arxiv_id":"2501.14755"}').

Docs & defaults

Docs recommend MinerU; pdfplumber as fallback. Optional deps: py-data-juicer[document] (magic-pdf, crawl4ai)

Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist · 2026-02-14T07:43:09Z

Summary of Changes

Hello @yxdyc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Data-Juicer's document processing capabilities by introducing two new mappers for converting academic papers and general PDF content into structured Markdown. It also provides a convenient command-line utility for quickly testing individual operators, streamlining development and debugging workflows.

Highlights

New arXiv to Markdown Mapper: Introduced arxiv_to_markdown_mapper to convert arXiv paper IDs or URLs into structured Markdown documents. It supports multiple backends including MinerU (recommended), pdfplumber, and crawl4ai, with intelligent fallback mechanisms.
New PDF to Markdown Mapper: Added pdf_to_markdown_mapper for converting PDF content (either file paths or raw bytes) into structured Markdown. This mapper also supports MinerU (recommended) and pdfplumber for conversion.
New One-Shot Operator Runner CLI: Developed a new command-line interface tool, dj-op, which allows users to run any single Data-Juicer operator on a single sample, facilitating quick testing and debugging.
Enhanced Documentation and Dependencies: Updated the project's documentation to include detailed descriptions and usage examples for the new mappers. A new document extra dependency group was added to pyproject.toml to manage optional dependencies like magic-pdf and crawl4ai.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

README.md
- Added example usage for the new dj-op CLI tool.
data_juicer/ops/mapper/init.py
- Imported and registered the new ArxivToMarkdownMapper and PdfToMarkdownMapper.
data_juicer/ops/mapper/arxiv_to_markdown_mapper.py
- Implemented the ArxivToMarkdownMapper with logic for fetching arXiv metadata, downloading PDFs, and converting content using various backends (MinerU, pdfplumber, crawl4ai).
data_juicer/ops/mapper/pdf_to_markdown_mapper.py
- Implemented the PdfToMarkdownMapper to convert PDF bytes or file paths to Markdown using MinerU or pdfplumber.
docs/operators/mapper/arxiv_to_markdown_mapper.md
- Created documentation for the arxiv_to_markdown_mapper, detailing its parameters, dependencies, and usage.
docs/operators/mapper/pdf_to_markdown_mapper.md
- Created documentation for the pdf_to_markdown_mapper, detailing its parameters, dependencies, and usage.
pyproject.toml
- Defined a new document extra dependency group for PDF/arXiv conversion tools.
- Added the dj-op entry point for the new CLI tool.
tests/ops/mapper/test_arxiv_to_markdown_mapper.py
- Added unit tests for the ArxivToMarkdownMapper, covering ID normalization and basic processing.
tests/ops/mapper/test_pdf_to_markdown_mapper.py
- Added unit tests for the PdfToMarkdownMapper, including PDF byte handling and basic conversion.
tools/run_op.py
- Created the dj-op command-line utility to execute Data-Juicer operators with specified samples and arguments.

Activity

The pull request is currently marked as Work In Progress (WIP), indicating active development by the author.
No specific review comments or other activities have been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two useful new mappers, arxiv_to_markdown_mapper and pdf_to_markdown_mapper, for converting arXiv papers and PDF files into Markdown format. It also adds a convenient one-shot CLI tool, dj-op, for quickly testing operators. The implementation is well-structured, including new functionality, documentation, and unit tests. My review focuses on a few areas for improvement: enhancing security by using HTTPS for API calls, reducing code duplication to improve maintainability, increasing clarity in the documentation, and fixing a bug in the new CLI tool's output handling logic.

gemini-code-assist · 2026-02-14T07:46:11Z

tools/run_op.py

+    if isinstance(out, dict) and not any(isinstance(v, list) for v in out.values()):
+        result = out
+    else:
+        # Batched return: dict of lists -> take first
+        result = {k: (v[0] if isinstance(v, list) else v) for k, v in out.items()}


The logic to detect if an operator's output is batched is not robust. It incorrectly identifies any sample containing a list as a batched output. For instance, a sample like {'text': '...', 'tokens': ['a', 'b']} would be mishandled, leading to data truncation (e.g., tokens becoming just ['a']). A more reliable approach is to check if all values in the output dictionary are lists of length 1.

Suggested change

if isinstance(out, dict) and not any(isinstance(v, list) for v in out.values()):

result = out

else:

# Batched return: dict of lists -> take first

result = {k: (v[0] if isinstance(v, list) else v) for k, v in out.items()}

# Check if the output is a batched sample (dict of lists of size 1)

is_batched = (isinstance(out, dict) and out and

all(isinstance(v, list) and len(v) == 1 for v in out.values()))

if is_batched:

# Batched return: dict of lists -> take first

result = {k: v[0] for k, v in out.items()}

else:

result = out

gemini-code-assist · 2026-02-14T07:46:12Z

data_juicer/ops/mapper/arxiv_to_markdown_mapper.py

+ARXIV_PDF_URL_TEMPLATE = "https://arxiv.org/pdf/{arxiv_id}.pdf"
+ARXIV_ABS_URL_TEMPLATE = "https://arxiv.org/abs/{arxiv_id}"
+ARXIV_HTML_URL_TEMPLATE = "https://arxiv.org/html/{arxiv_id}"
+ARXIV_API_QUERY = "http://export.arxiv.org/api/query?id_list={arxiv_id}"


For security and to follow best practices, API requests should be made over HTTPS instead of HTTP. The arXiv API supports HTTPS.

Suggested change

ARXIV_API_QUERY = "http://export.arxiv.org/api/query?id_list={arxiv_id}"

ARXIV_API_QUERY = "https://export.arxiv.org/api/query?id_list={arxiv_id}"

gemini-code-assist · 2026-02-14T07:46:12Z

data_juicer/ops/mapper/pdf_to_markdown_mapper.py

+def _pdf_to_markdown_pdfplumber(pdf_bytes: bytes) -> str:
+    """Convert PDF bytes to plain text with minimal structure using pdfplumber."""
+    try:
+        with io.BytesIO(pdf_bytes) as f:
+            with pdfplumber.open(f) as pdf:
+                parts = []
+                for i, page in enumerate(pdf.pages):
+                    tables = page.find_tables()
+                    for table in tables:
+                        page = page.outside_bbox(table.bbox)
+                    text = page.extract_text()
+                    if not text:
+                        continue
+                    page_num = str(page.page_number)
+                    if text.rstrip().endswith(page_num):
+                        text = text.rstrip()[: -len(page_num)]
+                    if text.strip():
+                        parts.append(f"## Page {i + 1}\n\n{text.strip()}")
+                return "\n\n".join(parts) if parts else ""
+    except Exception as e:
+        logger.warning(f"pdfplumber failed to parse PDF: {e}")
+        return ""
+
+
+def _pdf_to_markdown_mineru(pdf_bytes: bytes, keep_images: bool = False) -> str:
+    """Convert PDF bytes to structured Markdown using MinerU (magic-pdf)."""
+    try:
+        from magic_pdf.data.data_reader_writer import FileBasedDataWriter
+        from magic_pdf.data.dataset import PymuDocDataset
+        from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+        from magic_pdf.config.enums import SupportedPdfParseMethod
+    except ImportError as e:
+        logger.warning(f"magic-pdf not available, fallback to pdfplumber: {e}")
+        return _pdf_to_markdown_pdfplumber(pdf_bytes)
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        image_dir = os.path.join(tmpdir, "images")
+        os.makedirs(image_dir, exist_ok=True)
+        image_writer = FileBasedDataWriter(image_dir)
+
+        try:
+            ds = PymuDocDataset(pdf_bytes)
+            if ds.classify() == SupportedPdfParseMethod.OCR:
+                infer_result = ds.apply(doc_analyze, ocr=True)
+                pipe_result = infer_result.pipe_ocr_mode(image_writer)
+            else:
+                infer_result = ds.apply(doc_analyze, ocr=False)
+                pipe_result = infer_result.pipe_txt_mode(image_writer)
+
+            image_dir_basename = "images"
+            md_content = pipe_result.get_markdown(image_dir_basename)
+            if not keep_images and md_content:
+                md_content = re.sub(r"!\[[^\]]*\]\([^)]+\)", "", md_content)
+                md_content = re.sub(r"\n{3,}", "\n\n", md_content).strip()
+            return md_content or _pdf_to_markdown_pdfplumber(pdf_bytes)
+        except Exception as e:
+            logger.warning(f"MinerU conversion failed, fallback to pdfplumber: {e}")
+            return _pdf_to_markdown_pdfplumber(pdf_bytes)


The helper functions _pdf_to_markdown_pdfplumber and _pdf_to_markdown_mineru are nearly identical to those in arxiv_to_markdown_mapper.py. To improve maintainability and adhere to the DRY (Don't Repeat Yourself) principle, it would be beneficial to refactor this shared logic into a common utility module, such as data_juicer/utils/pdf_utils.py.

gemini-code-assist · 2026-02-14T07:46:12Z

docs/operators/mapper/arxiv_to_markdown_mapper.md

+| `keep_images` | bool | `False` | Keep image refs in Markdown (mineru only). 是否保留图片引用。 |
+| `timeout` | int | `60` | Request timeout (seconds). 请求超时（秒）。 |
+| `download_delay` | float | `1.0` | Delay before each download (rate limiting). 每次下载前延迟（秒）。 |
+| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |


The default value for output_key is documented as text_key, which is the name of a parameter. This could be confusing for users. The operator's __init__ method shows the default is None, which then falls back to the value of self.text_key. It would be clearer to document the default as None and explain in the description that it defaults to the value of text_key.

Suggested change

| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |

| `output_key` | str | `None` | Field to write Markdown to. Defaults to the value of `text_key`. 写入 Markdown 的字段名, 默认为 `text_key` 的值。 |

gemini-code-assist · 2026-02-14T07:46:12Z

docs/operators/mapper/pdf_to_markdown_mapper.md

+| `pdf_key` | str | `"pdf"` | Field containing PDF path (str) or bytes. 存 PDF 路径或字节的字段名。 |
+| `backend` | str | `"mineru"` | `mineru` (推荐，MinerU 高精度) 或 `pdfplumber` (内置回退). 转换后端。 |
+| `keep_images` | bool | `False` | Keep image refs in Markdown (mineru only). 是否保留图片引用。 |
+| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |


The default value for output_key is documented as text_key, which is the name of a parameter. This could be confusing for users. The operator's __init__ method shows the default is None, which then falls back to the value of self.text_key. It would be clearer to document the default as None and explain in the description that it defaults to the value of text_key.

Suggested change

| `output_key` | str | `text_key` | Field to write Markdown to. 写入 Markdown 的字段名。 |

| `output_key` | str | `None` | Field to write Markdown to. Defaults to the value of `text_key`. 写入 Markdown 的字段名, 默认为 `text_key` 的值。 |

WIP

12aa536

Co-authored-by: Cursor <cursoragent@cursor.com>

yxdyc requested a deployment to Testing February 14, 2026 07:42 — with GitHub Actions Waiting

yxdyc added the dj:op issues/PRs about some specific OPs label Feb 14, 2026

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] arXiv/PDF to Markdown mappers + dj-op one-shot runner#917

[WIP] arXiv/PDF to Markdown mappers + dj-op one-shot runner#917
yxdyc wants to merge 1 commit intomainfrom
feature/arxiv-pdf-markdown-from-main

yxdyc commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	ARXIV_API_QUERY = "http://export.arxiv.org/api/query?id_list={arxiv_id}"
	ARXIV_API_QUERY = "https://export.arxiv.org/api/query?id_list={arxiv_id}"

	\| `output_key` \| str \| `text_key` \| Field to write Markdown to. 写入 Markdown 的字段名。 \|
	\| `output_key` \| str \| `None` \| Field to write Markdown to. Defaults to the value of `text_key`. 写入 Markdown 的字段名, 默认为 `text_key` 的值。 \|

Conversation

yxdyc commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant