-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry-pick TiKV related changes to 8.10.fb #364
Conversation
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
compaction_filter: add bottommost_level into context (tikv#160) Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> add range for compaction filter context (tikv#192) * add range for compaction filter context Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> allow no_io for VersionSet::GetTableProperties (tikv#211) * allow no_io for VersionSet::GetTableProperties Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> expose seqno from compaction filter and iterator (tikv#215) This PR supports to access `seqno` for every key/value pairs in compaction filter or iterator. It's helpful to enhance GC in compaction filter in TiKV. Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> allow to query DB stall status (tikv#226) This PR adds a new property is-write-stalled to query whether the column family is in write stall or not. In TiKV there is a compaction filter used for GC, in which DB::write is called. So if we can query whether the DB instance is stalled or not, we can skip to create more compaction filter instances to save some resources. Signed-off-by: qupeng <[email protected]> Signed-off-by: tabokie <[email protected]> Fix compatibilty issue with Titan Signed-off-by: v01dstar <[email protected]> filter deletion in compaction filter (tikv#344) And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <[email protected]> Adjustments for compaptibilty with 8.10.facebook Signed-off-by: v01dstar <[email protected]> Adjust tikv related changes with upstream Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Ref tikv#277 When the iterator read keys in reverse order, each Prev() function cost O(log n) times. So I add prev pointer for every node in skiplist to improve the Prev() function. Signed-off-by: Little-Wallace [email protected] Implemented new virtual functions: - `InsertWithHintConcurrently` - `FindRandomEntry` Signed-off-by: tabokie <[email protected]> Signed-off-by: v01dstar <[email protected]>
Add WAL write duration metric UCP tikv/tikv#6541 Signed-off-by: Wangweizhen <[email protected]> Signed-off-by: tabokie <[email protected]> Signed-off-by: v01dstar <[email protected]>
I want to use format of rocksdb::WriteBatch to encode key-value pairs of TiKV, and I need an more effective method to copy data from Entry to WriteBatch directly so that I could avoid CPU cost of decode. Signed-off-by: Little-Wallace <[email protected]> Signed-off-by: tabokie <[email protected]> Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Implement multi batches write Signed-off-by: v01dstar <[email protected]> Fix SIGABRT caused by uninitialized mutex (tikv#296) (tikv#298) * Fix SIGABRT caused by uninitialized mutex Signed-off-by: Wenbo Zhang <[email protected]> * Use spinlock instead of mutex to reduce writer ctor cost Signed-off-by: Wenbo Zhang <[email protected]> * Update db/write_thread.h Co-authored-by: Xinye Tao <[email protected]> Signed-off-by: Wenbo Zhang <[email protected]> Co-authored-by: Xinye Tao <[email protected]> Signed-off-by: Wenbo Zhang <[email protected]> Co-authored-by: Xinye Tao <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: hillium <[email protected]> Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
* Add copy constructor for ColumnFamilyHandleImpl Signed-off-by: Yang Zhang <[email protected]>
* return sequence number of writes Signed-off-by: 5kbpers <[email protected]> * fix compile error Signed-off-by: 5kbpers <[email protected]> Signed-off-by: tabokie <[email protected]>
…nFlushBegin event (tikv#300) * add largest seqno of memtable Signed-off-by: 5kbpers <[email protected]> * add test Signed-off-by: 5kbpers <[email protected]> * address comment Signed-off-by: 5kbpers <[email protected]> * address comment Signed-off-by: 5kbpers <[email protected]> * format Signed-off-by: 5kbpers <[email protected]> * memtable info Signed-off-by: 5kbpers <[email protected]> Signed-off-by: 5kbpers <[email protected]> Signed-off-by: Yang Zhang <[email protected]>
A callback that is called after write succeeds and changes have been applied to memtable. Titan change: tikv/titan#270 Signed-off-by: tabokie <[email protected]> Signed-off-by: Yang Zhang <[email protected]>
Summary: Modify existing write buffer manager to support multiple instances. Previously, a flush is triggered before user writes if `ShouldFlush()` returns true. But in the multiple-instance context, this will cause flushing for all DBs that are undergoing writes. In this patch, column families are registered to a shared linked list inside the write buffer manager. When flush condition is triggered, the column family with highest score from this list will be chosen and flushed. The score can be either size or age. The flush condition calculation is also changed to exclude immutable memtables. This is because RocksDB schedules flush every time an immutable memtable is generated. They will eventually be evicted from memory given the flush bandwidth doesn't bottleneck. Test plan: - Unit test cases - Trigger flush of largest/oldest memtable in another DB - Resolve flush condition by destroy CF/DB - Dynamically change flush threshold - Manual test insert, update, read-write workload, [script](https://gist.github.com/tabokie/d38d27dc3843946c7813ab7bafd0f753). Signed-off-by: tabokie <[email protected]> Signed-off-by: Yang Zhang <[email protected]>
* fix bug of using post write callback with empty batch Signed-off-by: tabokie <[email protected]> * fix nullptr Signed-off-by: tabokie <[email protected]> Signed-off-by: tabokie <[email protected]>
Add support to merge multiple DBs that have no overlapping data (tombstone included). Memtables are frozen and then referenced by the target DB. Table files are hard linked with new file numbers into the target DB. After merge, the sequence numbers of memtables and L0 files will appear out-of-order compared to a single DB. But for any given user key, the ordering still holds because there will only be one unique source DB that contains the key and the source DB's ordering is inherited by the target DB. If source and target instances share the same block cache, target instance will be able to reuse cache. This is done by cloning the table readers of source instances to the target instance. Because the cache key is stored in table reader, reads after the merge can still retrieve source instances' blocks via old cache key. Under release build, it takes 8ms to merge a 25GB DB (500 files) into another. Signed-off-by: tabokie <[email protected]>
* exclude uninitialized files when estimating compression ratio Signed-off-by: tabokie <[email protected]> * add comment Signed-off-by: tabokie <[email protected]> * fix flaky test Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>
* hook delete dir in encrypted env Signed-off-by: tabokie <[email protected]> * add a comment Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>
* add toggle Signed-off-by: tabokie <[email protected]> * protect underflow Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * remove deadline and add penalty for l0 files Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * consider compaction trigger Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>
Also added a new options to detect whether manual compaction is disabled. In practice we use this to avoid blocking on flushing a tablet that will be destroyed shortly after. --------- Signed-off-by: tabokie <[email protected]>
…heckpoint (tikv#338) * fix renaming encrypted directory Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * patch test manager Signed-off-by: tabokie <[email protected]> * fix build Signed-off-by: tabokie <[email protected]> * check compaction paused during checkpoint Signed-off-by: tabokie <[email protected]> * add comment Signed-off-by: tabokie <[email protected]> --------- Signed-off-by: tabokie <[email protected]>
And delay the buffer initialization of writable file to first actual write. --------- Signed-off-by: tabokie <[email protected]>
Signed-off-by: Spade A <[email protected]>
Signed-off-by: SpadeA-Tang <[email protected]>
2e011bf
to
3ea895b
Compare
Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: v01dstar <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
return; | ||
} | ||
|
||
++total_requests_[pri]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like tikv ran into a segfault on this line. I'm still trying to understand why but I guess it's related to the addition of new IO priorities.
Here's the evidence:
- segfault information:
[Mon Jun 24 10:59:14 2024] apply-1[2562783]: segfault at 7f68952f40a0 ip 000055bd6eacc09a sp 00007f6c1d621f90 error 6 in tikv-server[55bd69e00000+6052000]
- The static address of the segfault code can be calculated as ip (000055bd6eacc09a) - base_addr_of_tikv (55bd69e00000) = 0x4ccc09a.
- With the help of gdb, we can locate the code of that address.
$ gdb tikv-server
(gdb) info line *0x4ccc09a
Line 172 of "/workspace/.cargo/git/checkouts/rust-rocksdb-9e01d192e8b6561d/af14652/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc"
starts at address 0x4ccc093 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+147>
and ends at 0x4ccc0a2 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+162>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rocksdb introduced more types of priorities (User, Mid etc) since 8.x, while WriteAmpRateLimiter only considered 3 of them. I thought it could work, but maybe this can cause some problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that seems to be interesting: the segfault issue seemed to only happen on Linux machines; I wasn't able to reproduce it on Mac. So it could be arch/compiler related.
Also, I managed to get a coredump of the segfault on Linux.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00005595b5961efa in rocksdb::WriteAmpBasedRateLimiter::Request (this=0x7ffa1e214000, bytes=46541, pri=<optimized out>, stats=0x0)
at /root/tabokie/packages/cargo/.cargo/git/checkouts/rust-rocksdb-da3ff04b5d606849/ca1f1dd/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc:172
172 ++total_requests_[pri];
It shows that total_requests_ was initialized with a length of 4 as expected. But the pri
was optimized out.
(gdb) print total_requests_
$1 = {1407, 16, 274, 0}
(gdb) print pri
$2 = <optimized out>
Given the definition of IOPriority, the only way for it to cause a segfault is when pri
equals IO_TOTAL, but I don't think that's how we expect pri
to be used...
enum IOPriority {
IO_LOW = 0,
IO_MID = 1,
IO_HIGH = 2,
IO_USER = 3,
IO_TOTAL = 4
};
Still investigating...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the problem! Turns out that one constructor for Writer
did not initialize rate_limiter_priority
. With this one-line fix, the segfault problem went away.
We might want to check how the bug was introduced and whether there could be other similar problems.
Update: The upstream 8.10.fb branch does not have this problem (db/write_thread.h), so it's likely an oversight when we cherry-picked the commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: v01dstar <[email protected]>
6.29 (last TiKV base) diff: facebook/rocksdb@6.29.fb...tikv:rocksdb:6.29.tikv
LOW
,MID
,HIGH
,USER
instead of justLOW
andHIGH
, write amplification rate limiter's tuning logic may need to adjust to this change.23c8635<- removed (this is a bug)base_background_compactions
related code due to Remove unused API base_background_compactions facebook/rocksdb#9462ROCKSDB_LITE
macro userocksdb::BatchWrite()
(with some custom code). Wheneverrocksdb::BatchWrite
changes, multi-batch write's implementation should change accordingly, this requires human intervention.ROCKSDB_LITE
macroAlready exist in upstream (fb):
No longer needed:
Need triage:
To be verified:
Complications: