Is it possible to have read replicas backed by same set of SST files in the remote storage? #116

sanayak · 2021-04-09T05:36:36Z

Rocksdb-cloud provides a way to put the SSTs on remote storage for durability and to separate the compute from storage. As I understand, the main the benefit of using rocksdb-cloud is to scale the compute independent of the storage.
We have one instance of rocksdb-cloud running and which puts the SSTs file on the remote storage such S3 for example. Now, if we are not able to serve the read requests for the one instance of the rocksdb-cloud because of the compute limitations of one instance, is it possible to have read replicas which are backed by same set of SSTs in the remote storage, so that read requests can be load balanced across the read replicas to get the desired throughput.

dhruba · 2021-04-09T06:43:06Z

You would have to clone the original db and use the clone on another machine. cloning the db means that the saame set of cloud-sst files are used by the other instance as well. One example here: https://github.com/rockset/rocksdb-cloud/blob/master/cloud/examples/clone_example.cc

sanayak · 2021-04-09T07:00:16Z

Thanks @dhruba for the prompt response. Few more questions related this,

Does that mean clone db shares the same SST files written by the main instance and does not create its own set of SSTs files throughout its lifetime?
How does clone db get the wal/memtable entries to serve the read requests?
Is this clone db writable? If yes, all the written keys will be flushed to same set of SST files owned by main instance?

dhruba · 2021-04-09T07:36:25Z

Hi @sanayak,

when you open a rocksdb-cloud instance you specify a source cloud path and a destination cloud path. rocksdb-cloud will use all the files in those two directories to open the database. All new files that are produced by the db will be written out to the destination-cloud-path only.

If you have multiple instance using the same database, then one will be the primary instance and one will be the replica instance. The primary instance will open the db with (srcpath=A, destpath=A). New writes to the primary are written to the cloud bucket path called A.

If you want a replica, open the same rocksdb-cloud database with (srcpath= A, destpath=B). The replica will start off with a clone of the primary database at the time when the db is opened from path A. Any files created by the compaction process on the replica database will be written to path B. The replica db does not automatically follow the updates from the primary db, it is an independent clone db and has its own life. You can write to the clone if you want.

If you want the replica db to follow the updates in the primary db, then periodically close the replica db and reclone it from the primary by reopening it with (srcpath=A, destpath=C).

sanayak · 2021-04-16T06:24:38Z

Thanks a lot @dhruba for the detailed explanation.

Couple of follow-up questions :

Is it possible to have hot data in local storage (SSD) and cold data in remote cloud storage based on the timestamp of the insertion/modification of the record? For example, if we define data will be hot for 7days where it will accessed very frequently; after that period, it can be moved to inexpensive storage. Is there any such mechanisms exists with rocksdb/rocksdb-cloud currently to address this scenario or we need to build that in the application layer itself? Any pointers in this direction would be helpful.
Does auto-compaction work with Rocksdb-cloud? In our experiment we had to manually trigger the compaction with rocksdb-cloud; just wondering if we are missing any configuration to make auto compaction work.

dhruba · 2021-04-16T06:27:31Z

Is it possible to have hot data in local storage (SSD) and cold data in remote cloud storage
-- You have to build this in your app
Does auto-compaction work with Rocksdb-cloud?
-- yes definitely. we use that feature in our production clusters

sanayak · 2021-04-16T07:14:04Z

Thanks @dhruba .
From the documentation, I understand that there are 2 policies to move the data from local store to remote store.

Keep all the SST files in the Remote storage as and when they are closed and no SSTs in local storage.
Keep the subset of SST files locally which are written recently. (Potentially L0 and L1 may be.)

Just curious if there exists any mechanism to move the SSTs across the storage tiers (from local storage to remote storage) based on the time it got flushed/closed. Is it possible to define such tiering policies with rocksdb-cloud? That would approximately simulate the hot vs cold data semantics from rocksdb perspective.

dhruba · 2021-04-18T20:20:33Z

if there exists any mechanism to move the SSTs from local storage to remote storage based on the time it got flushed/closed.

You might be able to use CheckPointToCloud() to manually copy all local sst files to the cloud https://github.com/rockset/rocksdb-cloud/blob/master/include/rocksdb/cloud/db_cloud.h#L55

sanayak · 2021-04-19T04:32:16Z

Thanks @dhruba Looking at the implementation of the CheckPointToCloud() in https://github.com/rockset/rocksdb-cloud/blob/master/cloud/db_cloud_impl.cc it moves all the SST files to cloud storage. I think it might need some different implementation if we need to filter out SSTs which do not qualify for copying (i.e. SSTs which are created very recently).

IwfWcf · 2022-04-01T09:16:49Z

Hi @dhruba,

I noticed that in rockset read replica, the read replica would tail new update from log. Does that mean every read replica would do compaction too? Wouldn't that be a waste of work? Is there a way to avoid that by having the read replica to share the same file with primary after primary compaction without reopen read replica?

dhruba · 2022-04-09T19:24:01Z

You can configure your rocksdb-cloud instances so that your replicas do not need to do compaction. The replica instance can periodically re-open the db with ephemeral_resync_on_open = true.

https://github.com/rockset/rocksdb-cloud/blob/master/include/rocksdb/cloud/cloud_env_options.h#L301

jecaro mentioned this issue Sep 16, 2024

read_only parameter is not compatible with ephemeral_resync_on_open #335

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to have read replicas backed by same set of SST files in the remote storage? #116

Is it possible to have read replicas backed by same set of SST files in the remote storage? #116

sanayak commented Apr 9, 2021

dhruba commented Apr 9, 2021

sanayak commented Apr 9, 2021

dhruba commented Apr 9, 2021 •

edited

Loading

sanayak commented Apr 16, 2021

dhruba commented Apr 16, 2021

sanayak commented Apr 16, 2021

dhruba commented Apr 18, 2021

sanayak commented Apr 19, 2021

IwfWcf commented Apr 1, 2022

dhruba commented Apr 9, 2022

Is it possible to have read replicas backed by same set of SST files in the remote storage? #116

Is it possible to have read replicas backed by same set of SST files in the remote storage? #116

Comments

sanayak commented Apr 9, 2021

dhruba commented Apr 9, 2021

sanayak commented Apr 9, 2021

dhruba commented Apr 9, 2021 • edited Loading

sanayak commented Apr 16, 2021

dhruba commented Apr 16, 2021

sanayak commented Apr 16, 2021

dhruba commented Apr 18, 2021

sanayak commented Apr 19, 2021

IwfWcf commented Apr 1, 2022

dhruba commented Apr 9, 2022

dhruba commented Apr 9, 2021 •

edited

Loading