Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

Latest commit

 

History

History

list-in

Dictionaries

Purpose

As far as the reviews are concerned, the official Goodreads API typically gets you a maximum of 300 short excerpts (here, here or here). Goodreads does not use this API on its own website, it is a side project. They use other mechanisms to display reviews on their website, mechanisms that are used by the Toolbox programs too (AJAX endpoints in this case). These mechanisms have their own limitations: you can not see all reviews, but search a book's reviews by a keyword and/or filter by the number of stars, age etc. Toolbox programs such as savreviews.pl or likeminded.pl use filters and also run a dictionary against this search in order to collect reviews.

Results

Dictionary Lines Minutes "To the Lighthouse"
5514 text reviews
"Mrs Dalloway"
7376 text reviews
none (filters only) - 948 or 17% untested
gram-en-l.lst 3349 111 3057 or 55% untested
gram-en-s.lst 390 untested untested
word-en-1k.lst 1000 33 4962 or 90% 6413 or 87%
word-en-s.lst 114 untested untested
gram-en-s,word-en-1k.lst 1390 untested untested
gram-en-l,word-en-1k.lst 4349 144 5127 or 93% 6715 or 91%

No duplicate reviewers, but could theoretically contain duplicate reviews posted by different members, which would be counted by Goodreads too.

Naming Conventions

File names: ${TYPE4LETTERCODE}-${LANGUAGE2LETTERCODE}-${SIZE}.lst with size l meaning large dictionaries, s meaning small dictionaries, 1k meaning 1000 lines, 3k meaning 3000 lines, extension lst meaning "list". Lists are ASCII files with one word per line. Comma denotes combined dictionaries, e.g., gram-en-l,word-en-1k.lst.

Smaller dictionaries are usually a subset of the larger ones, so you should start with the smaller ones to test. Since all Toolbox programs cache their results for some days, switching to the larger dictionaries in addition will not waste time with downloading already present results.

File: gram-en-l.lst

most frequent english n-grams first

File: gram-en-s.lst

most frequent english trigrams from gram-en-l.lst tested against Harry Potter reviews: I only saved trigrams which led to 10-30 unique(!) hits, best first. Appended most frequent english trigrams which are not already present in the Harry Potter set. Works better with a larger set of available reviews. Randomization yield no improvements (rather opposite). Seems often as good as the whole gram-en-l.lst.

File: word-en-1k.lst

most frequent english words first. Performed better than the Ngrams based dictionaries

File: word-en-s.lst

Parts of speech

File: gram-en-l,word-en-1k.lst

little more results than just word-en-l.lst but way more search time (1000 vs 4349)

File: dict.lst

A symlink to any of the other dictionary files. Toolbox programs default to this dictionary-symlink, so you can change it for all programs at once.