Skip to content

Conversation

kenju
Copy link
Contributor

@kenju kenju commented Sep 29, 2025

Resolves #25722

Changes Made:

Implementation touches the whole request flow from CLI, client side, RPC to server backend. I added the clear_cache (NEW) command to the yb-admin and implemented the corresponding client-side logic that broadcasts cache clearing requests to all tablet servers in the cluster.

On the server side, I created a new ClearCache RPC service and implemented the core cache clearing functionality (= that traverses all tablets on each tserver to purge RocksDB block caches for both regular and intents databases).

How to Purge

The cache clearing mechanism uses SetCapacity(0) → SetCapacity(original) pattern to purge all cached blocks while preserving the original cache configuration. For each tablet server, the implementation iterates through all tablet peers, retrieves their associated tablets, and clears the block caches of both the regular_db (main data storage) and intents_db (transaction intents storage).

Testing:

✅ I validated the behaviour from the locally running multi-node servers:

  • Set up 2-node cluster using ./bin/yugabyted start
  • Inserted test data to populate block caches
  • Successfully executed ./build/latest/bin/yb-admin clear_cache and verified results both from tserver logs and metrics

Here is the trimmed log from the local running cluster on purging block caches:

I0929 07:20:06.320581 1865035776 tablet_service.cc:2610] Received ClearCache RPC request from 127.0.0.1:51992
I0929 07:20:06.320622 1865035776 tablet_service.cc:2614] Found 18 tablet peers on this tserver
I0929 07:20:06.320639 1865035776 tablet_service.cc:2596] Purged regular block cache for tablet c81ac388637d4c7c8b5d8886c218a661 (capacity: 9895604649 bytes)
I0929 07:20:06.320647 1865035776 tablet_service.cc:2596] Purged intents block cache for tablet c81ac388637d4c7c8b5d8886c218a661 (capacity: 9895604649 bytes)
I0929 07:20:06.320874 1865035776 tablet_service.cc:2596] Purged regular block cache for tablet 2d385b1548704559be8ee72196cdf735 (capacity: 9895604649 bytes)
I0929 07:20:06.320879 1865035776 tablet_service.cc:2596] Purged intents block cache for tablet 2d385b1548704559be8ee72196cdf735 (capacity: 9895604649 bytes)
(...)
W0929 07:20:06.320926 1865035776 tablet_service.cc:2623] Failed to get tablet 7dfe5ede85d841ffb5cd390ab6fedc43: Illegal state (yb/tablet/tablet_peer.cc:864): Tablet not running: tablet object 7dfe5ede85d841ffb5cd390ab6fedc43 has invalid state kDestroyed
W0929 07:20:06.320931 1865035776 tablet_service.cc:2623] Failed to get tablet 6611490ba739471c9e1e488bcbcb6777: Illegal state (yb/tablet/tablet_peer.cc:864): Tablet not running: tablet object 6611490ba739471c9e1e488bcbcb6777 has invalid state kDestroyed
I0929 07:20:06.320937 1865035776 tablet_service.cc:2596] Purged regular block cache for tablet 57bafd8ed1f848b29e358e857a1d6977 (capacity: 9895604649 bytes)
I0929 07:20:06.320940 1865035776 tablet_service.cc:2596] Purged intents block cache for tablet 57bafd8ed1f848b29e358e857a1d6977 (capacity: 9895604649 bytes)
I0929 07:20:06.320943 1865035776 tablet_service.cc:2649] Successfully cleared cache on 10 tablets
I0929 07:20:06.320950 1865035776 tablet_service.cc:2653] Successfully responded to ClearCache RPC

Here is the screenshot from the Grafana showing the following metrics:

block_cache_single_touch_usage{job="yugabytedb-tserver"}
block_cache_multi_touch_usage{job="yugabytedb-tserver"}
29-26-kq6q6-imhlg

@CLAassistant
Copy link

CLAassistant commented Sep 29, 2025

CLA assistant check
All committers have signed the CLA.

@kenju kenju changed the title [#25722] Implemented yb-admin command to clear the block cache [#25722] yb-admin command to clear rocksdb block cache Sep 29, 2025
// without permanently restricting its size.
auto original_capacity = block_cache->GetCapacity();
block_cache->SetCapacity(0);
block_cache->SetCapacity(original_capacity);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspired by

void PurgeBlockCache() {
auto* block_cache = table_factory_->table_options().block_cache.get();
auto capacity = block_cache->GetCapacity();
block_cache->SetCapacity(0);
block_cache->SetCapacity(capacity);
LOG(INFO) << "Purged block cache";
}

#include "yb/tserver/tserver_admin.service.h"
#include "yb/tserver/tserver_service.service.h"

namespace rocksdb {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs for forward declaration


message ClearCacheResponsePB {
optional TabletServerErrorPB error = 1;
optional uint64 cache_capacity_bytes = 2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field ss more for logging purpose

repeated RbsInfo rbs_infos = 2;
}

message ClearCacheRequestPB {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to expand this command in the future (e.g. adding flag to decide if excluding system tables, targeting specific tablets, etc.) this is a good starting point

return Status::OK();
}

const auto clear_cache_args = "[<timeout_in_seconds>] (default 20)";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed @arybochkin's comment on #26078 (comment)

@kenju
Copy link
Contributor Author

kenju commented Oct 1, 2025

@rthallamko3 can I get a reviewer for this PR please?

@rthallamko3 rthallamko3 requested a review from ttyusupov October 1, 2025 19:26
@rthallamko3
Copy link
Contributor

@ttyusupov , Can you help review the changes?

const ClearCacheRequestPB* req, ClearCacheResponsePB* resp, rpc::RpcContext context) {
LOG(INFO) << "Received ClearCache RPC request from " << context.requestor_string();

TabletPeers tablet_peers = server_->tablet_manager()->GetTabletPeers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to do that for every peers because RocksDB block cache is shared across all tablets:

tablet::TabletOptions tablet_options_;

REGISTER_COMMAND(flush_table);
REGISTER_COMMAND(flush_table_by_id);
REGISTER_COMMAND(flush_sys_catalog);
REGISTER_COMMAND(clear_cache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clear_block_cache should be better to avoid confusion with other caches

// Setting the cache capacity to 0 forces the cache to evict all stored entries, effectively clearing its contents.
// Immediately restoring the capacity to its original value allows the cache to resume normal operation
// without permanently restricting its size.
auto original_capacity = block_cache->GetCapacity();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const auto

}

const auto clear_cache_args = "[<timeout_in_seconds>] (default 20)";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add automated test into yb-admin_client-test.cc that will:

  • create table
  • put some data into it
  • flush table to disk (so we have SST files)
  • run scan query other the whole table to load SST blocks into block_cache
  • make sure block cache usage is above expected level
  • clear block cache
  • make sure block cache usage is zero

if (servers.empty()) {
return STATUS(IllegalState, "No tablet servers found in cluster");
}
LOG(INFO) << "Found " << servers.size() << " tablet servers" << endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed internally that we shouldn't be using LOG(INFO) for yb-admin tool going forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DocDB] yb-admin command to clear the block cache

4 participants