Any guidance when/what to use partitions for, vs key prefixes? #80
-
Hey there! I'm looking at using fjall, and am wondering if there is any guidance around how to use partitions vs prefixed keys. I have a relatively structured set of data right now, something like this (note: I'm using vaguely handlebars-like syntax for wildcards and regex-like syntax for access, let me know if it's not clear what I mean):
Originally I was planning on just formatting all the keys something like this:
I'm not sure where to draw the lines of "when to use key prefixes in the same partition", or "when to use separate partitions", particularly when the partitions might be "dynamic". I bet the answer is "it depends on your access patterns". I could go "partition maximal", so for example each instance of That being said, I'd imagine if I ever wanted to do some kind of access pattern like "show me the most recent 100 logs from all devices", e.g. I just wanted to check that this understanding is correct, and if you have any suggestions on how to "think in fjall" when it comes to organizing data to play as nice as possible. My scale here isn't crazy (maybe dozens of devices, with dozens of message kinds, but many thousand log/message entries each), so it's probably not super important I get this right, right now. But I am curious! And it would be good to know how to better use fjall, if I do end up using it for "bigger" things in the future! Thanks for fjall, the 2.0 release announcement came just as I was wondering how I would turn my |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi,
That's for sure, so I can't give you definitive answer of what's best.
You can think of a single partition as a single database table/index in a RDBMS (the table itself being just the primary index). Personally here, I would create 3 partitions:
That way you also don't have to specify
Even in a single partition, this can be efficiently queried using: for kv in partition.prefix("{device_id}/{message_kind}/").rev().take(10) { ... }
Writing your query using a left star wildcard more or less already tells you your query may need a full table scan, which is bad obviously. For this access pattern it would be best to keep a separate partition ( for kv in global_log.iter().rev().take(100) { ... } Duplicating data sounds a bit mad, but that's exactly what a secondary index à la DynamoDB would do - and it optimizes locality for exactly the query we need. Maybe it also demonstrates why joins can actually be quite expensive compared to denormalization. Other considerationsHow many partitions are too many? Hard to say. Partitions do add overhead, but a couple of ten thousands should be feasible. In general, if you decided to create a partition per device/user/whatever, it may not be a good choice if you don't know an upper bound of how many partitions may be created. A single partition, using your row key design, could absolutely work here too, as you know your data set is not crazy large, but it would prevent you from adjusting partition configurations based on the data stored: data with different access patterns or lifetime requirements probably should be stored in different partitions. And every partition can be configured to match its contained data. For instance, if you decide you want to truncate the history of device log messages, you could create a separate partition ( Or your
Basically anything that applies to LevelDB, RocksDB, Bigtable, DynamoDB, should apply here, too. In short, you always want to design your row key in such way to optimize locality for your queries.
|
Beta Was this translation helpful? Give feedback.
Hi,
That's for sure, so I can't give you definitive answer of what's best.
You can think of a single partition as a single database table/index in a RDBMS (the table itself being just the primary index).
Personally here, I would create 3 partitions:
dev
:{device_id} => metadata struct
dev_log
:{device_id}/{uuidv7} => log
dev_msg
:{device_id}/{msg_kind}/{uuidv7} => msg
That way you also don't have to specify
/logs/
or/metadata/
in the key. Shorter k…