Skip to content

Commit 44659cc

Browse files
committed
feat: add latex_figure_context_extractor_mapper operator
Add a new mapper operator that extracts figures and their citing context from LaTeX source. It parses figure/figure*/wrapfigure environments, handles subfigure environments and \subfigure/\subfloat commands, and finds prose paragraphs that cite each figure via \ref/\cref/\autoref. One input paper row fans out into N output figure rows (one per figure or subfigure). Samples without figures are dropped. Output fields: images, caption, label, citing_paragraphs, parent_caption, parent_label. Includes: - Operator implementation with recursive nested-brace regex support - Config entry in config_all.yaml - Registration in mapper __init__.py - Comprehensive unit tests (21 test cases) - Operator documentation (EN/CN) Made-with: Cursor
1 parent ae290f7 commit 44659cc

File tree

6 files changed

+1227
-1
lines changed

6 files changed

+1227
-1
lines changed

data_juicer/config/config_all.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,14 @@ process:
198198
model_params: {} # Parameters for initializing the API model.
199199
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
200200
- expand_macro_mapper: # expand macro definitions in Latex text.
201+
- latex_figure_context_extractor_mapper: # Extract figures and their citing context from LaTeX source.
202+
citation_commands: ['\ref', '\cref', '\Cref', '\autoref'] # LaTeX reference commands to search for citing paragraphs.
203+
paragraph_separator: '\n\n' # Pattern for splitting LaTeX text into paragraphs.
204+
caption_key: 'caption' # Output field name for the figure caption.
205+
label_key: 'label' # Output field name for the LaTeX label.
206+
context_key: 'citing_paragraphs' # Output field name for citing paragraphs.
207+
parent_caption_key: 'parent_caption' # Output field name for the parent figure's caption (subfigures only).
208+
parent_label_key: 'parent_label' # Output field name for the parent figure's label (for grouping subfigures).
201209
- extract_entity_attribute_mapper: # Extract attributes for given entities from the text.
202210
api_model: 'gpt-4o' # API model name.
203211
query_entities: ["孙悟空", "猪八戒"] # Entity list to be queried.

data_juicer/ops/mapper/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@
4747
from .imgdiff_difference_caption_generator_mapper import (
4848
Difference_Caption_Generator_Mapper,
4949
)
50+
from .latex_figure_context_extractor_mapper import LatexFigureContextExtractorMapper
5051
from .mllm_mapper import MllmMapper
5152
from .nlpaug_en_mapper import NlpaugEnMapper
5253
from .nlpcda_zh_mapper import NlpcdaZhMapper
@@ -159,6 +160,7 @@
159160
"ImageSegmentMapper",
160161
"ImageTaggingMapper",
161162
"ImageTaggingVLMMapper",
163+
"LatexFigureContextExtractorMapper",
162164
"MllmMapper",
163165
"NlpaugEnMapper",
164166
"NlpcdaZhMapper",

0 commit comments

Comments
 (0)