Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should the annotations not be stored in postgres? #666

Open
nicolasfranck opened this issue Apr 23, 2020 · 3 comments
Open

Should the annotations not be stored in postgres? #666

nicolasfranck opened this issue Apr 23, 2020 · 3 comments

Comments

@nicolasfranck
Copy link
Contributor

Right now the AnnotationServer only stores its data in Elasticsearch.

From my experience, indexes are designed for speed, not for longterm
storage, so they expect data to be stored elsewhere too.

Wouldn't it be better to store the annotation in postgres first, and then
index them in elasticsearch? Besides: on the edit page,
annotations are only loaded for a certain document_id and filepart_id,
so there annotations can be loaded from postgres to reflect
the current state (there is a lapse in time before records become visible in ES).
In this case, ES would only be necessary for more advanced functionality
like statistics, facets..

Of course, this implies a way to "reindex" the data from postgres into ES,
possible using a command line script, or a task.

@rsimon
Copy link
Member

rsimon commented Apr 23, 2020

ES brands itself as, basically, a NoSQL store, too. We decided initially that the advantages far outweigh the problems that reindexing and keeping things in sync would cause. (We've also operated the system for 3+ years now and are very happy with ElasticSearch as "permanent storage". Methods for backup and restore are also pretty well advanced, although knock on wood we haven't had the need to use them so far.)

The decision for ElasticSearch was based on three factors:

  • Annotations are inherently deeply nested objects, so working with ElasticSearch is a really good fit (although I'm aware that there's a JSON datatype in Postgres, too.)
  • Annotations are the kind of data that are inserted at high speed, and ElasticSearch is really good and convenient for this. But again, not the main deciding factor.
  • Most importantly: there's a whole bunch of places where we need to run fast aggregation queries on annotation statistics (histograms over time, distinct place URIs, etc.) I still find the ElasticSearch query pretty much unmatched in terms of simplicity. In other words: this saved us writing code, no more, no less ;-)

I'd need to check, but I think we commit changes to the index for instant updates when annotations are involved (but not batch inserts, e.g. when managing the gazetteers). So the time lapse isn't an issue for us in practice.

That being said: I'd definitely do some things in a different way if I got to redesign the system from scratch. Most importantly, separate out the storage backend, and make this a "real" W3C-compliant annotation server. Alas, that's a task for the future ;-)

@ghost
Copy link

ghost commented Apr 27, 2020

Just thought I'd mention, as it seems relevant: Our developer (at TLCMap, UON) Zongwen Fan is working on enabling users to add places not found in the gazetteer etc. The intention is these would end up in some sort of user contributed Gazetteer which could also be used in searches, and which can be exported and sent back to people who look after the cannonical gazetteer as a way of crowd sourcing identification of historical places. Zongwen noted that the places added don't go into the gazetteer but just into elastic search, and we weren't clear on what the approach should be to solve this overall issue of users needing to be able to add places that aren't already there. We'll figure something out, but just thought I'd bring it up as timely and relevant to discussion/planning on this issue.

@rsimon
Copy link
Member

rsimon commented Apr 27, 2020

Hi @bpascoe,

perhaps we can make a separate issue about this at some point. A possible approach would probably be to create a separate (empty) gazetteer first. Then you can add places by creating yourself an EntityImporter and using its importRecord method. This way, you don't need to deal with any of the specifics of how the gazetteer is organized internally. The place will just get added, as long as you make sure that your EntityRecord carries all the required information.

If you expose some kind of API route to the UI, you can add places individually that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants