Commit 44659cc
committed
feat: add latex_figure_context_extractor_mapper operator
Add a new mapper operator that extracts figures and their citing context
from LaTeX source. It parses figure/figure*/wrapfigure environments,
handles subfigure environments and \subfigure/\subfloat commands, and
finds prose paragraphs that cite each figure via \ref/\cref/\autoref.
One input paper row fans out into N output figure rows (one per figure
or subfigure). Samples without figures are dropped.
Output fields: images, caption, label, citing_paragraphs,
parent_caption, parent_label.
Includes:
- Operator implementation with recursive nested-brace regex support
- Config entry in config_all.yaml
- Registration in mapper __init__.py
- Comprehensive unit tests (21 test cases)
- Operator documentation (EN/CN)
Made-with: Cursor1 parent ae290f7 commit 44659cc
File tree
6 files changed
+1227
-1
lines changed- data_juicer
- config
- ops/mapper
- docs
- operators/mapper
- tests/ops/mapper
6 files changed
+1227
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
198 | 198 | | |
199 | 199 | | |
200 | 200 | | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
201 | 209 | | |
202 | 210 | | |
203 | 211 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
50 | 51 | | |
51 | 52 | | |
52 | 53 | | |
| |||
159 | 160 | | |
160 | 161 | | |
161 | 162 | | |
| 163 | + | |
162 | 164 | | |
163 | 165 | | |
164 | 166 | | |
| |||
0 commit comments