Commit 3b47d82
committed
feat(mapper): add custom tokenizer support to RemoveRepeatSentencesMapper
The built-in regex sentence splitter treats every period followed by a
non-quote character as a sentence boundary, which incorrectly splits
text containing decimal numbers (e.g. "2.5 kg"), abbreviations, and
version numbers. When these fragments are independently deduplicated,
the resulting text is corrupted.
Add a `tokenizer` parameter that accepts a custom sentence tokenizer
to override the default regex splitter. The tokenizer can be:
- A Python callable (for API usage), e.g. `nltk.sent_tokenize`
- A lambda string (for YAML configs), e.g.
`"lambda text: __import__('nltk').sent_tokenize(text)"`
- None (default) to preserve existing behavior
Lambda strings are validated using `ast.parse`, following the same
pattern as `PythonLambdaMapper`.
Made-with: Cursor1 parent fa4d7b2 commit 3b47d82
File tree
2 files changed
+89
-1
lines changed- data_juicer/ops/mapper
- tests/ops/mapper
2 files changed
+89
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
1 | 4 | | |
2 | 5 | | |
3 | 6 | | |
| |||
11 | 14 | | |
12 | 15 | | |
13 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
14 | 32 | | |
15 | 33 | | |
16 | 34 | | |
| |||
30 | 48 | | |
31 | 49 | | |
32 | 50 | | |
| 51 | + | |
33 | 52 | | |
34 | 53 | | |
35 | 54 | | |
| |||
45 | 64 | | |
46 | 65 | | |
47 | 66 | | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
48 | 72 | | |
49 | 73 | | |
50 | 74 | | |
| |||
54 | 78 | | |
55 | 79 | | |
56 | 80 | | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
57 | 104 | | |
58 | 105 | | |
59 | 106 | | |
| |||
62 | 109 | | |
63 | 110 | | |
64 | 111 | | |
65 | | - | |
| 112 | + | |
66 | 113 | | |
67 | 114 | | |
68 | 115 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
70 | 111 | | |
71 | 112 | | |
72 | 113 | | |
0 commit comments