Document storage explanation #25

danielballan · 2024-12-11T22:58:46Z

This is a rough draft of a statement from "the project" about the vision for moving Bluesky document storage to a layout better optimized for data access.

prjemian · 2024-12-12T00:35:16Z

Limited to postgres and SQLite or only demonstrated on these two? Are any capabilities used in the interface unique to these two and not possible using other SQL servers, such as MySQL, MariaDB, or Oracle?

danielballan · 2024-12-12T11:37:52Z

We use the SQLAlchemy library, which abstracts over a variety of SQL dialects. Those two (PG and SQLite) are the only ones we test against. They were chosen because at present they are generally considered the most robust in their respective domains of client-server and embedded relational databases. Other SQL dialects could in principle be supported if they have sufficient support for JSON, particularly indexing on keys in JSON columns.

kivel

Great stuff! I can't wait to try this back home.

kivel · 2024-12-12T22:24:15Z

docs/explanations/document-storage.md

+For the first time in the ten-year history of the Bluesky project, the Bluesky
+core developers will soon recommend a change in how data and metadata from
+Bluesky documents should be stored.


I'd prefer an absolute time like in the first quarter of 2025 over will soon recommend.

Or maybe refer to the last paragraph of this document.

kivel · 2024-12-12T22:29:52Z

docs/explanations/document-storage.md

+non-optimal for _batch reads_ and _random access_. These are critical
+shortcomings in a data store.
+
+In order to access a portion data from MongoDB as a table or an array, we


Suggested change

In order to access a portion data from MongoDB as a table or an array, we

In order to access a portion of data from MongoDB as a table or an array, we

kivel · 2024-12-12T22:36:02Z

docs/explanations/document-storage.md

+effectively take a "transpose" of the Event documents to build a columnar
+representation of the data. The implementation is fairly complex, and thus
+expensive to debug and maintain. And the operation imposes a performance cost
+that becomes noticeable beyond ~100 events.


A benchmark demonstrating the benefits of the SQL database vs. the MongoDB for data retrieval would be nice. I assume it's not too hard to compare both with identical dataset being written to both backends and fetched from either. External factor need to be considered, like comparing a local MongoDB against a remote/central PostgreSQL will be unfair due to latencies.

kivel · 2024-12-12T22:45:06Z

docs/explanations/document-storage.md

+2. Ingest them into the new storage, just as if they were a live experiment.
+3. Stream documents from the new storage and validate semantic fidelity.
+
+Of course, it possible to test this offline and evaluate the performance and


Suggested change

Of course, it possible to test this offline and evaluate the performance and

Of course, it will be possible to test this offline and evaluate the performance and

danielballan added 3 commits December 11, 2024 17:27

Draft document storage explanation.

c5ad7c9

Move explanation

ee952ba

Note about ADBC

9d6d847

danielballan requested review from genematx and tacaswell December 11, 2024 23:04

Delete words

f9e7141

danielballan requested a review from stuartcampbell December 11, 2024 23:15

danielballan added 4 commits December 11, 2024 18:25

Add note on SQLite

f696520

Tighten writing

65d95b5

More wording tweaks

f19ed05

remove words

5b498c1

fix word

690efaa

kivel reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document storage explanation #25

Document storage explanation #25

danielballan commented Dec 11, 2024

prjemian commented Dec 12, 2024

danielballan commented Dec 12, 2024

kivel left a comment

kivel Dec 12, 2024

kivel Dec 12, 2024

kivel Dec 12, 2024

kivel Dec 12, 2024

kivel Dec 12, 2024

	In order to access a portion data from MongoDB as a table or an array, we
	In order to access a portion of data from MongoDB as a table or an array, we

	Of course, it possible to test this offline and evaluate the performance and
	Of course, it will be possible to test this offline and evaluate the performance and

Document storage explanation #25

Are you sure you want to change the base?

Document storage explanation #25

Conversation

danielballan commented Dec 11, 2024

prjemian commented Dec 12, 2024

danielballan commented Dec 12, 2024

kivel left a comment

Choose a reason for hiding this comment

kivel Dec 12, 2024

Choose a reason for hiding this comment

kivel Dec 12, 2024

Choose a reason for hiding this comment

kivel Dec 12, 2024

Choose a reason for hiding this comment

kivel Dec 12, 2024

Choose a reason for hiding this comment

kivel Dec 12, 2024

Choose a reason for hiding this comment