Any guidance when/what to use partitions for, vs key prefixes? #80

jamesmunns · 2024-09-23T16:51:10Z

jamesmunns
Sep 23, 2024

Hey there! I'm looking at using fjall, and am wondering if there is any guidance around how to use partitions vs prefixed keys.

I have a relatively structured set of data right now, something like this (note: I'm using vaguely handlebars-like syntax for wildcards and regex-like syntax for access, let me know if it's not clear what I mean):

I have N devices I am keeping data for
- Each device has a list of log entries (let's say they are basically strings)
- Each device has some metadata (let's say it's generally a struct, though the struct might have some array-like fields)
- Each device also has arbitrary data it might send, each with a unique "kind" ID

Originally I was planning on just formatting all the keys something like this:

{device_id}/metadata -> maps to metadata of that device
{device_id}/logs/{uuidv7} - all the logs from the device, sorted by timestamp (via uuidv7)
{device_id}/{message_kind}/{uuidv7} - all the arbitrary data kinds from the device, where all messages with the common {device_id}/{message_kind}/ prefix are the same type of data

I'm not sure where to draw the lines of "when to use key prefixes in the same partition", or "when to use separate partitions", particularly when the partitions might be "dynamic". I bet the answer is "it depends on your access patterns".

I could go "partition maximal", so for example each instance of {device_id}/metadata is a partition (so if I have 10 device IDs, I have 10 partions), or even {device_id}/{message_kind}/ being unique partitions (so if I have 10 device IDs, each with 10 unique message kinds, then I'd have 100 partitions). I'd imagine this would work best when I want to have access patterns like "get me the latest 10 messages of a specific message kind from a specific device, because it doesn't have to skip over a lot of messages of other kinds or from other devices.

That being said, I'd imagine if I ever wanted to do some kind of access pattern like "show me the most recent 100 logs from all devices", e.g. .*/logs/.* or whatever, this might be harder to do because I need to scan all the various partitions, then aggregate/sort/combine the messages to the most recent 100.

I just wanted to check that this understanding is correct, and if you have any suggestions on how to "think in fjall" when it comes to organizing data to play as nice as possible. My scale here isn't crazy (maybe dozens of devices, with dozens of message kinds, but many thousand log/message entries each), so it's probably not super important I get this right, right now. But I am curious! And it would be good to know how to better use fjall, if I do end up using it for "bigger" things in the future!

Thanks for fjall, the 2.0 release announcement came just as I was wondering how I would turn my Arc<Mutex<AllOfTheData>> into a persistent + loadable store, so the timing was perfect :)

Answered by marvin-j97

Sep 23, 2024

Hi,

I bet the answer is "it depends on your access patterns".

That's for sure, so I can't give you definitive answer of what's best.

I'm not sure where to draw the lines of "when to use key prefixes in the same partition", or "when to use separate partitions", particularly when the partitions might be "dynamic"

You can think of a single partition as a single database table/index in a RDBMS (the table itself being just the primary index).

Personally here, I would create 3 partitions:

dev: {device_id} => metadata struct
dev_log: {device_id}/{uuidv7} => log
dev_msg: {device_id}/{msg_kind}/{uuidv7} => msg

That way you also don't have to specify /logs/ or /metadata/ in the key. Shorter k…

View full answer

marvin-j97 · 2024-09-23T19:35:16Z

marvin-j97
Sep 23, 2024
Maintainer

Hi,

I bet the answer is "it depends on your access patterns".

That's for sure, so I can't give you definitive answer of what's best.

I'm not sure where to draw the lines of "when to use key prefixes in the same partition", or "when to use separate partitions", particularly when the partitions might be "dynamic"

You can think of a single partition as a single database table/index in a RDBMS (the table itself being just the primary index).

Personally here, I would create 3 partitions:

dev: {device_id} => metadata struct
dev_log: {device_id}/{uuidv7} => log
dev_msg: {device_id}/{msg_kind}/{uuidv7} => msg

That way you also don't have to specify /logs/ or /metadata/ in the key. Shorter keys are always better, to the point where shortening something like /metadata/ to /m/ would actually have an impact in a big data system.

I'd imagine this would work best when I want to have access patterns like "get me the latest 10 messages of a specific message kind from a specific device, because it doesn't have to skip over a lot of messages of other kinds or from other devices.

Even in a single partition, this can be efficiently queried using:

for kv in partition.prefix("{device_id}/{message_kind}/").rev().take(10) { ... }

That being said, I'd imagine if I ever wanted to do some kind of access pattern like "show me the most recent 100 logs from all devices", e.g. ./logs/. or whatever, this might be harder to do because I need to scan all the various partitions, then aggregate/sort/combine the messages to the most recent 100.

Writing your query using a left star wildcard more or less already tells you your query may need a full table scan, which is bad obviously.

For this access pattern it would be best to keep a separate partition (global_log?) and write every log message to it as well (you can use a WriteBatch or WriteTransaction), but formatted using a different row key: {uuidv7} => log msg, and then just query:

for kv in global_log.iter().rev().take(100) { ... }

Duplicating data sounds a bit mad, but that's exactly what a secondary index à la DynamoDB would do - and it optimizes locality for exactly the query we need. Maybe it also demonstrates why joins can actually be quite expensive compared to denormalization.

Other considerations

How many partitions are too many? Hard to say. Partitions do add overhead, but a couple of ten thousands should be feasible. In general, if you decided to create a partition per device/user/whatever, it may not be a good choice if you don't know an upper bound of how many partitions may be created.

A single partition, using your row key design, could absolutely work here too, as you know your data set is not crazy large, but it would prevent you from adjusting partition configurations based on the data stored: data with different access patterns or lifetime requirements probably should be stored in different partitions. And every partition can be configured to match its contained data.

For instance, if you decide you want to truncate the history of device log messages, you could create a separate partition (device_logs) and use FIFO compaction with a max size of XX GB - that will automatically drop old data over time to cap disk space usage (https://github.com/fjall-rs/fjall/tree/main/examples/rolling-log).

Or your device_messages may get a lot of write traffic, so you could increase the write buffer (max_memtable_size) for that partition only, to increase write performance.

I just wanted to check that this understanding is correct, and if you have any suggestions on how to "think in fjall" when it comes to organizing data to play as nice as possible.

Basically anything that applies to LevelDB, RocksDB, Bigtable, DynamoDB, should apply here, too. In short, you always want to design your row key in such way to optimize locality for your queries.

Bigtable (chapter 2: Data Model): https://static.googleusercontent.com/media/research.google.com/de//archive/bigtable-osdi06.pdf
Bigtable row key design: https://cloud.google.com/bigtable/docs/schema-design
Bigtable timeseries design: https://cloud.google.com/bigtable/docs/schema-design-time-series
Single table design: https://aws.amazon.com/de/blogs/database/single-table-vs-multi-table-design-in-amazon-dynamodb/
https://www.datadoghq.com/blog/engineering/timeseries-indexing-at-scale/
Wide column design: https://marvin-j97.github.io/smoltable/guides/wide-column-intro/
KV design in an SQL database: https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/

3 replies

jamesmunns Sep 23, 2024
Author

Thank you! I don't actually have any operational knowledge (incl choosing architecture/layout of data for) of any of the stores you mentioned, but that unlocks a lot of reading material to read :)

I think the most succinct questions I should have asked was "what tradeoffs are there between the ratio of partitions to prefixes", or "how many of each is too many", or "ops notes if one or both would be very big in total counts".

I think you answered all of this, and I appreciate it!

Is there anywhere like design docs or an FAQ where this guidance could be added?

marvin-j97 Sep 23, 2024
Maintainer

Is there anywhere like design docs or an FAQ where this guidance could be added?

I will surely write a blog (https://fjall-rs.github.io/) post about some KV design stuff at some point. The problem with creating some basic guide is that KV's are so foundational to everything, you simply need to learn some intuition of how to design row keys to layout data on disk with locality in mind.

To demonstrate & taking your data set as an example - but imagine you stored ~10 billion devices instead, each having a name, some metadata and a operating system ID as properties stored as JSON:

If you wanted to count every device grouped by operating system, suddenly my advice of using a single partition for device metadata would actually not be that great. In this case creating a new partition (dev_os) where only the operating system ID is stored per device ID would greatly benefit such OLAP-style scan. This essentially gives rise to the idea of locality groups, where you now split your "rows" by its columns to make scans scan less data, also explained here (https://marvin-j97.github.io/smoltable/guides/locality-groups/). This is all the nitty gritty details a RDBMS tries to handle for you, e.g. partitioning: https://learn.microsoft.com/en-us/sql/relational-databases/partitions/partitioned-tables-and-indexes.

So yeah, it's hard to create some kind of guide or FAQ, it would probably be easier to write a book.

marvin-j97 Sep 24, 2024
Maintainer

Oh and I forgot

how many of each is too many

The amount of unique prefixes doesn't have an impact. A table with 1'000x1'000 prefixed tuples should perform just like one with 1x1'000'000 tuples, because the keys are really just byte arrays. The prefixes just have some meaning we interpret into them, and allow packing related data together physically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any guidance when/what to use partitions for, vs key prefixes? #80

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Any guidance when/what to use partitions for, vs key prefixes? #80

jamesmunns Sep 23, 2024

Replies: 1 comment · 3 replies

marvin-j97 Sep 23, 2024 Maintainer

Other considerations

jamesmunns Sep 23, 2024 Author

marvin-j97 Sep 23, 2024 Maintainer

marvin-j97 Sep 24, 2024 Maintainer

jamesmunns
Sep 23, 2024

Replies: 1 comment 3 replies

marvin-j97
Sep 23, 2024
Maintainer

jamesmunns Sep 23, 2024
Author

marvin-j97 Sep 23, 2024
Maintainer

marvin-j97 Sep 24, 2024
Maintainer