Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick TiKV related changes to 8.10.fb #364

Closed
wants to merge 38 commits into from

Conversation

v01dstar
Copy link

@v01dstar v01dstar commented Apr 11, 2024

6.29 (last TiKV base) diff: facebook/rocksdb@6.29.fb...tikv:rocksdb:6.29.tikv

Already exist in upstream (fb):

No longer needed:

Need triage:

To be verified:

Complications:

  • WriteBufferManager has changed a lot
  • SST file epoch number was introduced, instance merge needs to accommodate that change.
  • Write stall logic behavior change introduced in multi-instance support project made some tests to fail.
  • RocksDB now uses C++17 standard

v01dstar and others added 8 commits February 6, 2024 23:21
Signed-off-by: v01dstar <[email protected]>
Signed-off-by: v01dstar <[email protected]>
compaction_filter: add bottommost_level into context (tikv#160)

Signed-off-by: qupeng <[email protected]>
Signed-off-by: tabokie <[email protected]>

add range for compaction filter context (tikv#192)

* add range for compaction filter context

Signed-off-by: qupeng <[email protected]>
Signed-off-by: tabokie <[email protected]>

allow no_io for VersionSet::GetTableProperties (tikv#211)

* allow no_io for VersionSet::GetTableProperties

Signed-off-by: qupeng <[email protected]>
Signed-off-by: tabokie <[email protected]>

expose seqno from compaction filter and iterator (tikv#215)

This PR supports to access `seqno` for every key/value pairs in compaction filter or iterator.
It's helpful to enhance GC in compaction filter in TiKV.

Signed-off-by: qupeng <[email protected]>
Signed-off-by: tabokie <[email protected]>

allow to query DB stall status (tikv#226)

This PR adds a new property is-write-stalled to query whether the column family is in write stall or not.

In TiKV there is a compaction filter used for GC, in which DB::write is called. So if we can query whether the DB instance is stalled or not, we can skip to create more compaction filter instances to save some resources.

Signed-off-by: qupeng <[email protected]>
Signed-off-by: tabokie <[email protected]>

Fix compatibilty issue with Titan

Signed-off-by: v01dstar <[email protected]>

filter deletion in compaction filter (tikv#344)

And delay the buffer initialization of writable file to first actual write.

---------

Signed-off-by: tabokie <[email protected]>

Adjustments for compaptibilty with 8.10.facebook

Signed-off-by: v01dstar <[email protected]>

Adjust tikv related changes with upstream

Signed-off-by: v01dstar <[email protected]>
Ref tikv#277

When the iterator read keys in reverse order, each Prev() function cost O(log n) times. So I add prev pointer for every node in skiplist to improve the Prev() function.

Signed-off-by: Little-Wallace [email protected]

Implemented new virtual functions:
- `InsertWithHintConcurrently`
- `FindRandomEntry`

Signed-off-by: tabokie <[email protected]>
Signed-off-by: v01dstar <[email protected]>
Add WAL write duration metric

UCP tikv/tikv#6541

Signed-off-by: Wangweizhen <[email protected]>
Signed-off-by: tabokie <[email protected]>
Signed-off-by: v01dstar <[email protected]>
I want to use format of rocksdb::WriteBatch to encode key-value pairs of TiKV, and I need an more effective method to copy data from Entry to WriteBatch directly so that I could avoid CPU cost of decode.

Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: tabokie <[email protected]>
Signed-off-by: v01dstar <[email protected]>
v01dstar and others added 6 commits May 19, 2024 00:12
Signed-off-by: v01dstar <[email protected]>
Implement multi batches write

Signed-off-by: v01dstar <[email protected]>

Fix SIGABRT caused by uninitialized mutex (tikv#296) (tikv#298)

* Fix SIGABRT caused by uninitialized mutex

Signed-off-by: Wenbo Zhang <[email protected]>

* Use spinlock instead of mutex to reduce writer ctor cost

Signed-off-by: Wenbo Zhang <[email protected]>

* Update db/write_thread.h

Co-authored-by: Xinye Tao <[email protected]>
Signed-off-by: Wenbo Zhang <[email protected]>

Co-authored-by: Xinye Tao <[email protected]>
Signed-off-by: Wenbo Zhang <[email protected]>

Co-authored-by: Xinye Tao <[email protected]>
Signed-off-by: v01dstar <[email protected]>
YuJuncen and others added 4 commits May 21, 2024 01:18
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
v01dstar and others added 9 commits May 21, 2024 19:25
* Add copy constructor for ColumnFamilyHandleImpl

Signed-off-by: Yang Zhang <[email protected]>
* return sequence number of writes

Signed-off-by: 5kbpers <[email protected]>

* fix compile error

Signed-off-by: 5kbpers <[email protected]>
Signed-off-by: tabokie <[email protected]>
…nFlushBegin event (tikv#300)

* add largest seqno of memtable

Signed-off-by: 5kbpers <[email protected]>

* add test

Signed-off-by: 5kbpers <[email protected]>

* address comment

Signed-off-by: 5kbpers <[email protected]>

* address comment

Signed-off-by: 5kbpers <[email protected]>

* format

Signed-off-by: 5kbpers <[email protected]>

* memtable info

Signed-off-by: 5kbpers <[email protected]>

Signed-off-by: 5kbpers <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
A callback that is called after write succeeds and changes have been applied to memtable.

Titan change: tikv/titan#270

Signed-off-by: tabokie <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Summary:

Modify existing write buffer manager to support multiple instances.

Previously, a flush is triggered before user writes if `ShouldFlush()` returns
true. But in the multiple-instance context, this will cause flushing for all
DBs that are undergoing writes.

In this patch, column families are registered to a shared linked list inside
the write buffer manager. When flush condition is triggered, the column family
with highest score from this list will be chosen and flushed. The score can be
either size or age.

The flush condition calculation is also changed to exclude immutable memtables.
This is because RocksDB schedules flush every time an immutable memtable is
generated. They will eventually be evicted from memory given the flush
bandwidth doesn't bottleneck.

Test plan:

- Unit test cases
  - Trigger flush of largest/oldest memtable in another DB
  - Resolve flush condition by destroy CF/DB
  - Dynamically change flush threshold
- Manual test insert, update, read-write workload, [script](https://gist.github.com/tabokie/d38d27dc3843946c7813ab7bafd0f753).

Signed-off-by: tabokie <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
* fix bug of using post write callback with empty batch

Signed-off-by: tabokie <[email protected]>

* fix nullptr

Signed-off-by: tabokie <[email protected]>

Signed-off-by: tabokie <[email protected]>
Add support to merge multiple DBs that have no overlapping data (tombstone included).

Memtables are frozen and then referenced by the target DB. Table files are hard linked
with new file numbers into the target DB. After merge, the sequence numbers of memtables
and L0 files will appear out-of-order compared to a single DB. But for any given user
key, the ordering still holds because there will only be one unique source DB that
contains the key and the source DB's ordering is inherited by the target DB.

If source and target instances share the same block cache, target instance will be able
to reuse cache. This is done by cloning the table readers of source instances to the
target instance. Because the cache key is stored in table reader, reads after the merge
can still retrieve source instances' blocks via old cache key.

Under release build, it takes 8ms to merge a 25GB DB (500 files) into another.

Signed-off-by: tabokie <[email protected]>
* exclude uninitialized files when estimating compression ratio

Signed-off-by: tabokie <[email protected]>

* add comment

Signed-off-by: tabokie <[email protected]>

* fix flaky test

Signed-off-by: tabokie <[email protected]>

---------

Signed-off-by: tabokie <[email protected]>
* hook delete dir in encrypted env

Signed-off-by: tabokie <[email protected]>

* add a comment

Signed-off-by: tabokie <[email protected]>

---------

Signed-off-by: tabokie <[email protected]>
tabokie and others added 7 commits May 21, 2024 23:49
* add toggle

Signed-off-by: tabokie <[email protected]>

* protect underflow

Signed-off-by: tabokie <[email protected]>

* fix build

Signed-off-by: tabokie <[email protected]>

* remove deadline and add penalty for l0 files

Signed-off-by: tabokie <[email protected]>

* fix build

Signed-off-by: tabokie <[email protected]>

* consider compaction trigger

Signed-off-by: tabokie <[email protected]>

---------

Signed-off-by: tabokie <[email protected]>
Also added a new options to detect whether manual compaction is disabled. In practice we use this to avoid blocking on flushing a tablet that will be destroyed shortly after.

---------

Signed-off-by: tabokie <[email protected]>
…heckpoint (tikv#338)

* fix renaming encrypted directory

Signed-off-by: tabokie <[email protected]>

* fix build

Signed-off-by: tabokie <[email protected]>

* patch test manager

Signed-off-by: tabokie <[email protected]>

* fix build

Signed-off-by: tabokie <[email protected]>

* check compaction paused during checkpoint

Signed-off-by: tabokie <[email protected]>

* add comment

Signed-off-by: tabokie <[email protected]>

---------

Signed-off-by: tabokie <[email protected]>
And delay the buffer initialization of writable file to first actual write.

---------

Signed-off-by: tabokie <[email protected]>
@v01dstar v01dstar force-pushed the 8.10-tikv branch 5 times, most recently from 2e011bf to 3ea895b Compare May 31, 2024 01:37
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: v01dstar <[email protected]>
return;
}

++total_requests_[pri];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like tikv ran into a segfault on this line. I'm still trying to understand why but I guess it's related to the addition of new IO priorities.

Here's the evidence:

  1. segfault information:
[Mon Jun 24 10:59:14 2024] apply-1[2562783]: segfault at 7f68952f40a0 ip 000055bd6eacc09a sp 00007f6c1d621f90 error 6 in tikv-server[55bd69e00000+6052000]
  1. The static address of the segfault code can be calculated as ip (000055bd6eacc09a) - base_addr_of_tikv (55bd69e00000) = 0x4ccc09a.
  2. With the help of gdb, we can locate the code of that address.
$ gdb tikv-server
(gdb) info line *0x4ccc09a
Line 172 of "/workspace/.cargo/git/checkouts/rust-rocksdb-9e01d192e8b6561d/af14652/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc"                                                                                               
   starts at address 0x4ccc093 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+147>                                                                                                                            
   and ends at 0x4ccc0a2 <rocksdb::WriteAmpBasedRateLimiter::Request(long, rocksdb::Env::IOPriority, rocksdb::Statistics*)+162>.

Copy link
Author

@v01dstar v01dstar Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rocksdb introduced more types of priorities (User, Mid etc) since 8.x, while WriteAmpRateLimiter only considered 3 of them. I thought it could work, but maybe this can cause some problems.

Copy link
Member

@hbisheng hbisheng Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that seems to be interesting: the segfault issue seemed to only happen on Linux machines; I wasn't able to reproduce it on Mac. So it could be arch/compiler related.

Also, I managed to get a coredump of the segfault on Linux.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005595b5961efa in rocksdb::WriteAmpBasedRateLimiter::Request (this=0x7ffa1e214000, bytes=46541, pri=<optimized out>, stats=0x0)
    at /root/tabokie/packages/cargo/.cargo/git/checkouts/rust-rocksdb-da3ff04b5d606849/ca1f1dd/librocksdb_sys/rocksdb/utilities/rate_limiters/write_amp_based_rate_limiter.cc:172
172	  ++total_requests_[pri];

It shows that total_requests_ was initialized with a length of 4 as expected. But the pri was optimized out.

(gdb) print total_requests_
$1 = {1407, 16, 274, 0}
(gdb) print pri
$2 = <optimized out>

Given the definition of IOPriority, the only way for it to cause a segfault is when pri equals IO_TOTAL, but I don't think that's how we expect pri to be used...

  enum IOPriority {
    IO_LOW = 0,
    IO_MID = 1,
    IO_HIGH = 2,
    IO_USER = 3,
    IO_TOTAL = 4
  };

Still investigating...

Copy link
Member

@hbisheng hbisheng Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the problem! Turns out that one constructor for Writer did not initialize rate_limiter_priority. With this one-line fix, the segfault problem went away.

We might want to check how the bug was introduced and whether there could be other similar problems.
Update: The upstream 8.10.fb branch does not have this problem (db/write_thread.h), so it's likely an oversight when we cherry-picked the commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: v01dstar <[email protected]>
@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 22, 2024
@v01dstar v01dstar closed this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.