Skip to content

Commit fa404a2

Browse files
author
anwala
committed
updating informal algorithm description
1 parent 172d177 commit fa404a2

File tree

2 files changed

+4
-3
lines changed

2 files changed

+4
-3
lines changed

.dockerignore

+1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ __pycache__
55
.gitignore
66

77
Dockerfile
8+
informal_alg.txt
89
LICENSE
910
README.md
1011
setup.py

informal_alg.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Highly informal algorithm description of sumgram
1+
Highly informal algorithm description of Sumgram (2020-07-15)
22

33
Step 1
44
1. Add plain_text into doc_lst
@@ -7,7 +7,7 @@ Step 1
77

88
Step 2
99
1. extract_top_ngrams(): Extract top n-grams from text, top is defined by top DF (for multiple documents) or top TF (for single documents)
10-
2. pos_glue_split_ngrams(): If Stanford CoreNLP server (POS tagger) is active: replace children subset ngrams (e.g., "national hurricane") with superset parent multi-word proper noun (e.g., "national hurricane center") extracted by multi_word_proper_nouns(). Subset means overlap is 1.0 and match order is preserved (e.g., "hurricane national" is NOT subset of "national hurricane center" since even though overlap is 1.0, but match is out of order)
10+
2. pos_glue_split_ngrams(): If Stanford CoreNLP server (POS tagger) is active: replace children subset ngrams (e.g., "national hurricane") with superset. parent multi-word proper noun (e.g., "national hurricane center") extracted by extract_proper_nouns(). Subset means overlap is 1.0 and match order is preserved (e.g., "hurricane national" is NOT subset of "national hurricane center" since even though overlap is 1.0, but match is out of order)
1111
3. mvg_window_glue_split_ngrams()
1212
4. rm_subset_top_ngrams()
1313

@@ -41,7 +41,7 @@ for ngram in top_ngrams[:x]
4141
both = 'federal emergency management agency'
4242

4343
if k = 2,
44-
left = 'the federal emergency management',#comma counts as word
44+
left = 'the federal emergency management',#commas counts as words
4545
right = 'emergency management agency,'
4646
both = 'the federal emergency management agency,'
4747
'''

0 commit comments

Comments
 (0)