Skip to content

Commit fe16ee2

Browse files
author
Nat Taylor
committed
minor tweaks to the tokenization regex
1 parent cd74573 commit fe16ee2

File tree

4 files changed

+1387
-624
lines changed

4 files changed

+1387
-624
lines changed

README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,12 @@ Basically create an instance of `Teaser()` then pass it either a URL or a text/t
1414
- I tried to carefully document the class, but it needs more detail. This is coming soon.
1515
- (Obviously) This relies on the source text having some good sentences that summarize it. Without that, our summary will suck.
1616
- Based on https://github.com/xiaoxu193/PyTeaser based on http://www.textteaser.com/
17-
- What would make this a lot better? Tweaking the scoring, duh!
17+
- What would make this a lot better? Tweaking the scoring, duh!
18+
19+
##TODO##
20+
- Add synonyms to the headline list
21+
- Try to do some NLP like:
22+
-- Stemming: https://github.com/camspiers/porter-stemmer
23+
-- Morphology: https://github.com/heromantor/phpmorphy
24+
-- WordNet: http://www.foxsurfer.com/wordnet/
25+

class.phpteaser.php

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ function splitSentences($text) {
212212
$re = '/# Split sentences on whitespace between them.
213213
(?<= # Begin positive lookbehind.
214214
[.!?] # Either an end of sentence punct,
215-
| [.!?][\'"] # or end of sentence punct and quote.
215+
| [.!?][\'"] # or end of sentence punct and quote.
216216
) # End positive lookbehind.
217217
(?<! # Begin negative lookbehind.
218218
Mr\. # Skip either "Mr."
@@ -223,10 +223,11 @@ function splitSentences($text) {
223223
| Prof\. # or "Prof.",
224224
| Sr\. # or "Sr.",
225225
| T\.V\.A\. # or "T.V.A.",
226+
| [A-Z]\. # or Middle Initial
226227
# or... (you get the idea).
227228
) # End negative lookbehind.
228229
\s+ # Split on whitespace between sentences.
229-
/ix';
230+
/x';
230231

231232
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
232233
return $sentences;

0 commit comments

Comments
 (0)