-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[#25722] yb-admin command to clear rocksdb block cache #28733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[#25722] yb-admin command to clear rocksdb block cache #28733
Conversation
// without permanently restricting its size. | ||
auto original_capacity = block_cache->GetCapacity(); | ||
block_cache->SetCapacity(0); | ||
block_cache->SetCapacity(original_capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inspired by
yugabyte-db/src/yb/rocksdb/db/readahead_test.cc
Lines 228 to 234 in 915c743
void PurgeBlockCache() { | |
auto* block_cache = table_factory_->table_options().block_cache.get(); | |
auto capacity = block_cache->GetCapacity(); | |
block_cache->SetCapacity(0); | |
block_cache->SetCapacity(capacity); | |
LOG(INFO) << "Purged block cache"; | |
} |
#include "yb/tserver/tserver_admin.service.h" | ||
#include "yb/tserver/tserver_service.service.h" | ||
|
||
namespace rocksdb { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs for forward declaration
|
||
message ClearCacheResponsePB { | ||
optional TabletServerErrorPB error = 1; | ||
optional uint64 cache_capacity_bytes = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this field ss more for logging purpose
repeated RbsInfo rbs_infos = 2; | ||
} | ||
|
||
message ClearCacheRequestPB {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to expand this command in the future (e.g. adding flag to decide if excluding system tables, targeting specific tablets, etc.) this is a good starting point
return Status::OK(); | ||
} | ||
|
||
const auto clear_cache_args = "[<timeout_in_seconds>] (default 20)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followed @arybochkin's comment on #26078 (comment)
@rthallamko3 can I get a reviewer for this PR please? |
@ttyusupov , Can you help review the changes? |
const ClearCacheRequestPB* req, ClearCacheResponsePB* resp, rpc::RpcContext context) { | ||
LOG(INFO) << "Received ClearCache RPC request from " << context.requestor_string(); | ||
|
||
TabletPeers tablet_peers = server_->tablet_manager()->GetTabletPeers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to do that for every peers because RocksDB block cache is shared across all tablets:
tablet::TabletOptions tablet_options_; |
REGISTER_COMMAND(flush_table); | ||
REGISTER_COMMAND(flush_table_by_id); | ||
REGISTER_COMMAND(flush_sys_catalog); | ||
REGISTER_COMMAND(clear_cache); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clear_block_cache
should be better to avoid confusion with other caches
// Setting the cache capacity to 0 forces the cache to evict all stored entries, effectively clearing its contents. | ||
// Immediately restoring the capacity to its original value allows the cache to resume normal operation | ||
// without permanently restricting its size. | ||
auto original_capacity = block_cache->GetCapacity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const auto
} | ||
|
||
const auto clear_cache_args = "[<timeout_in_seconds>] (default 20)"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add automated test into yb-admin_client-test.cc
that will:
- create table
- put some data into it
- flush table to disk (so we have SST files)
- run scan query other the whole table to load SST blocks into block_cache
- make sure block cache usage is above expected level
- clear block cache
- make sure block cache usage is zero
if (servers.empty()) { | ||
return STATUS(IllegalState, "No tablet servers found in cluster"); | ||
} | ||
LOG(INFO) << "Found " << servers.size() << " tablet servers" << endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed internally that we shouldn't be using LOG(INFO) for yb-admin tool going forward
Resolves #25722
Changes Made:
Implementation touches the whole request flow from CLI, client side, RPC to server backend. I added the clear_cache (NEW) command to the yb-admin and implemented the corresponding client-side logic that broadcasts cache clearing requests to all tablet servers in the cluster.
On the server side, I created a new ClearCache RPC service and implemented the core cache clearing functionality (= that traverses all tablets on each tserver to purge RocksDB block caches for both regular and intents databases).
How to Purge
The cache clearing mechanism uses SetCapacity(0) → SetCapacity(original) pattern to purge all cached blocks while preserving the original cache configuration. For each tablet server, the implementation iterates through all tablet peers, retrieves their associated tablets, and clears the block caches of both the regular_db (main data storage) and intents_db (transaction intents storage).
Testing:
✅ I validated the behaviour from the locally running multi-node servers:
./build/latest/bin/yb-admin clear_cache
and verified results both from tserver logs and metricsHere is the trimmed log from the local running cluster on purging block caches:
Here is the screenshot from the Grafana showing the following metrics: