Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to have read replicas backed by same set of SST files in the remote storage? #116

Open
sanayak opened this issue Apr 9, 2021 · 10 comments

Comments

@sanayak
Copy link

sanayak commented Apr 9, 2021

Rocksdb-cloud provides a way to put the SSTs on remote storage for durability and to separate the compute from storage. As I understand, the main the benefit of using rocksdb-cloud is to scale the compute independent of the storage.
We have one instance of rocksdb-cloud running and which puts the SSTs file on the remote storage such S3 for example. Now, if we are not able to serve the read requests for the one instance of the rocksdb-cloud because of the compute limitations of one instance, is it possible to have read replicas which are backed by same set of SSTs in the remote storage, so that read requests can be load balanced across the read replicas to get the desired throughput.

@dhruba
Copy link

dhruba commented Apr 9, 2021

You would have to clone the original db and use the clone on another machine. cloning the db means that the saame set of cloud-sst files are used by the other instance as well. One example here: https://github.com/rockset/rocksdb-cloud/blob/master/cloud/examples/clone_example.cc

@sanayak
Copy link
Author

sanayak commented Apr 9, 2021

Thanks @dhruba for the prompt response. Few more questions related this,

  • Does that mean clone db shares the same SST files written by the main instance and does not create its own set of SSTs files throughout its lifetime?
  • How does clone db get the wal/memtable entries to serve the read requests?
  • Is this clone db writable? If yes, all the written keys will be flushed to same set of SST files owned by main instance?

@dhruba
Copy link

dhruba commented Apr 9, 2021

Hi @sanayak,

when you open a rocksdb-cloud instance you specify a source cloud path and a destination cloud path. rocksdb-cloud will use all the files in those two directories to open the database. All new files that are produced by the db will be written out to the destination-cloud-path only.

If you have multiple instance using the same database, then one will be the primary instance and one will be the replica instance. The primary instance will open the db with (srcpath=A, destpath=A). New writes to the primary are written to the cloud bucket path called A.

If you want a replica, open the same rocksdb-cloud database with (srcpath= A, destpath=B). The replica will start off with a clone of the primary database at the time when the db is opened from path A. Any files created by the compaction process on the replica database will be written to path B. The replica db does not automatically follow the updates from the primary db, it is an independent clone db and has its own life. You can write to the clone if you want.

If you want the replica db to follow the updates in the primary db, then periodically close the replica db and reclone it from the primary by reopening it with (srcpath=A, destpath=C).

@sanayak
Copy link
Author

sanayak commented Apr 16, 2021

Thanks a lot @dhruba for the detailed explanation.

Couple of follow-up questions :

  1. Is it possible to have hot data in local storage (SSD) and cold data in remote cloud storage based on the timestamp of the insertion/modification of the record? For example, if we define data will be hot for 7days where it will accessed very frequently; after that period, it can be moved to inexpensive storage. Is there any such mechanisms exists with rocksdb/rocksdb-cloud currently to address this scenario or we need to build that in the application layer itself? Any pointers in this direction would be helpful.

  2. Does auto-compaction work with Rocksdb-cloud? In our experiment we had to manually trigger the compaction with rocksdb-cloud; just wondering if we are missing any configuration to make auto compaction work.

@dhruba
Copy link

dhruba commented Apr 16, 2021

  1. Is it possible to have hot data in local storage (SSD) and cold data in remote cloud storage
    -- You have to build this in your app

  2. Does auto-compaction work with Rocksdb-cloud?
    -- yes definitely. we use that feature in our production clusters

@sanayak
Copy link
Author

sanayak commented Apr 16, 2021

Thanks @dhruba .
From the documentation, I understand that there are 2 policies to move the data from local store to remote store.

  1. Keep all the SST files in the Remote storage as and when they are closed and no SSTs in local storage.
  2. Keep the subset of SST files locally which are written recently. (Potentially L0 and L1 may be.)

Just curious if there exists any mechanism to move the SSTs across the storage tiers (from local storage to remote storage) based on the time it got flushed/closed. Is it possible to define such tiering policies with rocksdb-cloud? That would approximately simulate the hot vs cold data semantics from rocksdb perspective.

@dhruba
Copy link

dhruba commented Apr 18, 2021

if there exists any mechanism to move the SSTs from local storage to remote storage based on the time it got flushed/closed.

You might be able to use CheckPointToCloud() to manually copy all local sst files to the cloud https://github.com/rockset/rocksdb-cloud/blob/master/include/rocksdb/cloud/db_cloud.h#L55

@sanayak
Copy link
Author

sanayak commented Apr 19, 2021

Thanks @dhruba Looking at the implementation of the CheckPointToCloud() in https://github.com/rockset/rocksdb-cloud/blob/master/cloud/db_cloud_impl.cc it moves all the SST files to cloud storage. I think it might need some different implementation if we need to filter out SSTs which do not qualify for copying (i.e. SSTs which are created very recently).

@IwfWcf
Copy link

IwfWcf commented Apr 1, 2022

Hi @dhruba,

I noticed that in rockset read replica, the read replica would tail new update from log. Does that mean every read replica would do compaction too? Wouldn't that be a waste of work? Is there a way to avoid that by having the read replica to share the same file with primary after primary compaction without reopen read replica?

@dhruba
Copy link

dhruba commented Apr 9, 2022

You can configure your rocksdb-cloud instances so that your replicas do not need to do compaction. The replica instance can periodically re-open the db with ephemeral_resync_on_open = true.

https://github.com/rockset/rocksdb-cloud/blob/master/include/rocksdb/cloud/cloud_env_options.h#L301

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants