Commit ebe11fb
committed
feat: add latex_figure_context_extractor_mapper operator
Add a new mapper operator that extracts figures and their citing context
from LaTeX source. It parses figure/figure*/wrapfigure environments,
handles subfigure environments and \subfigure/\subfloat commands, and
finds prose paragraphs that cite each figure via \ref/\cref/\autoref.
One input paper row fans out into N output figure rows (one per figure
or subfigure). Samples without figures are dropped.
Output fields: images, caption, label, citing_paragraphs,
parent_caption, parent_label.
Includes:
- Operator implementation with recursive nested-brace regex support
- Config entry in config_all.yaml
- Registration in mapper __init__.py
- Comprehensive unit tests (21 test cases)
- Operator documentation (EN/CN)1 parent ae290f7 commit ebe11fb
File tree
5 files changed
+1265
-0
lines changed- data_juicer
- config
- ops/mapper
- docs/operators/mapper
- tests/ops/mapper
5 files changed
+1265
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
198 | 198 | | |
199 | 199 | | |
200 | 200 | | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
201 | 209 | | |
202 | 210 | | |
203 | 211 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
50 | 53 | | |
51 | 54 | | |
52 | 55 | | |
| |||
159 | 162 | | |
160 | 163 | | |
161 | 164 | | |
| 165 | + | |
162 | 166 | | |
163 | 167 | | |
164 | 168 | | |
| |||
0 commit comments