As far as the reviews are concerned, the official Goodreads API typically gets you a maximum of 300 short excerpts (here, here or here). Goodreads does not use this API on its own website, it is a side project. They use other mechanisms to display reviews on their website, mechanisms that are used by the Toolbox programs too (AJAX endpoints in this case). These mechanisms have their own limitations: you can not see all reviews, but search a book's reviews by a keyword and/or filter by the number of stars, age etc. Toolbox programs such as savreviews.pl or likeminded.pl use filters and also run a dictionary against this search in order to collect reviews.
Dictionary | Lines | Minutes | "To the Lighthouse" 5514 text reviews |
"Mrs Dalloway" 7376 text reviews |
---|---|---|---|---|
none (filters only) | - | 948 or 17% | untested | |
gram-en-l.lst | 3349 | 111 | 3057 or 55% | untested |
gram-en-s.lst | 390 | untested | untested | |
word-en-1k.lst | 1000 | 33 | 4962 or 90% | 6413 or 87% |
word-en-s.lst | 114 | untested | untested | |
gram-en-s,word-en-1k.lst | 1390 | untested | untested | |
gram-en-l,word-en-1k.lst | 4349 | 144 | 5127 or 93% | 6715 or 91% |
No duplicate reviewers, but could theoretically contain duplicate reviews posted by different members, which would be counted by Goodreads too.
File names: ${TYPE4LETTERCODE}-${LANGUAGE2LETTERCODE}-${SIZE}.lst
with
size l
meaning large dictionaries, s
meaning small dictionaries,
1k meaning 1000 lines, 3k meaning 3000 lines,
extension lst
meaning "list". Lists are ASCII files with one word per line.
Comma denotes combined dictionaries, e.g., gram-en-l,word-en-1k.lst
.
Smaller dictionaries are usually a subset of the larger ones, so you should start with the smaller ones to test. Since all Toolbox programs cache their results for some days, switching to the larger dictionaries in addition will not waste time with downloading already present results.
most frequent english n-grams first
most frequent english trigrams from gram-en-l.lst
tested against
Harry Potter reviews: I only saved trigrams which led to 10-30 unique(!) hits,
best first. Appended most frequent english trigrams which are not already
present in the Harry Potter set. Works better with a larger set of available
reviews. Randomization yield no improvements (rather opposite).
Seems often as good as the whole gram-en-l.lst
.
most frequent english words first. Performed better than the Ngrams based dictionaries
little more results than just word-en-l.lst but way more search time (1000 vs 4349)
A symlink to any of the other dictionary files. Toolbox programs default to this dictionary-symlink, so you can change it for all programs at once.