Move Hypercane from MongoDB to PostgreSQL for storage and caching #65

shawnmjones · 2022-02-03T17:42:43Z

Hypercane uses MongoDB for caching memento content, headers, and derived data. It also uses PostgreSQL as part of its Web User Interface (WUI). Rather than having to install/maintain multiple databases for different purposes, we want to move Hypercane to PostgreSQL for the following reasons.

MongoDB does not install "easily" for some users. I installed it on macOS with homebrew but reinstalled it after issues. On Ubuntu/RHEL, the admin needs to add a third-party yum/apt repository install it. Almost every distro includes PostgreSQL.
I've had issues saving MongoDB data and restoring its data across versions and systems. Sometimes the BSON is corrupted. I'm sure I was supposed to do something on one end or another, but it seems like SQL databases have an easier time with this.
As Hypercane has matured, I've saved more derived data in the database. The ability to query this with SQL is becoming more and more attractive over time. Such standardization may provide third-party tools with another interface for easy analysis.
A point in favor of MongoDB is that I can shove any data we want into a record and not worry about creating standardized fields. We could achieve something similar with planned foreign keys and relations in SQL at the expense of planning time and schema changes. The truth is that function calls in the code have to correspond to database actions; hence, we will write some queries either way. Moving to PostgreSQL will require that we change the schema for each derived value that we want to store.
For space reasons, a user may want to clear out the memento content and keep the derived data. With MongoDB, we have to save the parts we like, get rid of the whole record, and create a new record. With SQL and a decently designed schema, we can delete the records from the table storing the content.
Another point in favor of MongoDB is its ability to expire records, which we do not currently use, but should. PostgreSQL does not natively support this as far as I can tell, but I can achieve something similar with triggers.
MongoDB has a BSON size limit of 16MB unless I switch to GridFS. PostgreSQL has a maximum field size of 1GB. Currently, Hypercane discards anything over 16MB, which means that some images and other binary files are skipped rather than processed.
Some claim that MongoDB is faster than PostgreSQL, but some studies show that PostgreSQL has caught up. (We need to add these links.) Performance depends on indexing, table structure, and the queries used in the test. We can likely get comparable performance with good database creation scripts.
Some web archiving folks had suggested instead storing the data as WARC/WAT/etc. and maintaining a CDX. This was a good suggestion when we were only caching content, but it does not work as well for querying the derived data. If we store derived data in the CDX, it becomes a table.🙂
The choice of MongoDB came from needing to handle concurrent writes. Writing a single WARC for each memento creates many files and addresses concurrency. Creating a CDX afterward must be timed well. Alternatives, like SQLite, don't handle concurrent writes well either. Database engines, like PostgreSQL or MongoDB, manage this with their own caching, checkpointing, and optimization.
Thanks to the pilot, we have a better idea of the type of data we should store in the database, meaning that we have a better data model moving forward.

With all of this in mind, I will be using this issue to document ER diagrams and other insights as I experiment with this change.

shawnmjones added the enhancement New feature or request label Feb 3, 2022

shawnmjones self-assigned this Feb 3, 2022

shawnmjones added this to To do in Hypercane v0.6 Feb 3, 2022

shawnmjones mentioned this issue Feb 3, 2022

Test that Hypercane works effectively with the proxy servers specified in HTTP*_PROXY environment variables #16

Closed

shawnmjones pinned this issue Feb 3, 2022

shawnmjones mentioned this issue Feb 3, 2022

Add the ability to only use the cache #12

Open

shawnmjones mentioned this issue Feb 21, 2022

Add a command for managing the cache #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Hypercane from MongoDB to PostgreSQL for storage and caching #65

Move Hypercane from MongoDB to PostgreSQL for storage and caching #65

shawnmjones commented Feb 3, 2022 •

edited

Move Hypercane from MongoDB to PostgreSQL for storage and caching #65

Move Hypercane from MongoDB to PostgreSQL for storage and caching #65

Comments

shawnmjones commented Feb 3, 2022 • edited

shawnmjones commented Feb 3, 2022 •

edited