-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PutObject Ring Buffer #19605
Merged
Merged
Add PutObject Ring Buffer #19605
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Replace the `io.Pipe` from streamingBitrotWriter -> CreateFile with a fixed size ring buffer. This will add an output buffer for encoded shards to be written to disk - potentially via RPC. This will remove blocking when `(*streamingBitrotWriter).Write` is called and it writes hashes and data. With current settings the write looks like this: ``` Outbound ┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐ │ │ Parr. │ │ (http body) │ │ │ │ │ Bitrot Hash │ Write │ Pipe │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │ │ Erasure Shard │ ──────────► │ (unbuffered) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │ │ │ │ │ │ (io.Copy) │ │ │ └───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘ ``` We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have been delivered to the TCP buffer and the next Read hits the Pipe. Then we write the shard data. This will typically be bigger than 64KB, so it will block until 2 blocks have been read from the pipe. When we insert a ring buffer: ``` Outbound ┌───────────────────┐ ┌────────────────┐ ┌───────────────┐ ┌────────────────┐ │ │ │ │ (http body) │ │ │ │ │ Bitrot Hash │ Write │ Ring Buffer │ Read │ HTTP buffer │ Write (syscall) │ TCP Buffer │ │ Erasure Shard │ ──────────► │ (2MB) │ ────────────► │ (64K Max) │ ───────────────────► │ (4MB) │ │ │ │ │ │ (io.Copy) │ │ │ └───────────────────┘ └────────────────┘ └───────────────┘ └────────────────┘ ``` The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a memcopy. Reads will be able to fill the 64KB buffer if there is data for it. If network is congested and the ring buffer will become filled, and all syscalls will be on full buffers. Only when the ring buffer is filled will erasure coding start blocking. Since there is always "space" to write output data we remove the parallel writing, since we are always writing to memory now and the gorotine synchronization overhead probably isn't worth taking. If output was blocked in the existing, we would still wait for it to unblock in parallelWriter, so it would make no difference there - except now the ringbuffer smoothes out the load. There are some microoptimizations we could look at later. The biggest is probably that in most cases we could encode directly to the ring buffer - if we are not at a boundary. Also "force filling" the Read requests (ie blocking until a full read can be completed) could be investigated and maybe allow concurrent memcopy on read and write. But if this isn't by itself better I don't think it is overall worth it. Pending: Actual testing.
klauspost
force-pushed
the
putobject-ringbuffer
branch
from
April 24, 2024 16:46
f2a0b26
to
0bea34c
Compare
(fixed...) |
I am backlogged on testing this, will have to take internal help on this. |
@harshavardhana No worries. It is not going anywhere. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Replace the
io.Pipe
from streamingBitrotWriter -> CreateFile with a fixed size ring buffer.This will add an output buffer for encoded shards to be written to disk - potentially via RPC.
This will remove blocking when
(*streamingBitrotWriter).Write
is called and it writes hashes and data.With current settings the write looks like this:
We write a Hash (32 bytes). Since the pipe is unbuffered, it will block until the 32 bytes have been delivered to the TCP buffer and the next Read hits the Pipe. Then we write the shard data. This will typically be bigger than 64KB, so it will block until 2 blocks have been read from the pipe.
When we insert a ring buffer:
The hash+shard will fit within the ring buffer, so writes will not block - but will complete after a memcopy. Reads will be able to fill the 64KB buffer if there is data for it.
If network is congested and the ring buffer will become filled, and all syscalls will be on full buffers. Only when the ring buffer is filled will erasure coding start blocking.
Since there is always "space" to write output data we remove the parallel writing, since we are always writing to memory now and the gorotine synchronization overhead probably isn't worth taking. If output was blocked in the existing, we would still wait for it to unblock in parallelWriter, so it would make no difference there - except now the ringbuffer smoothes out the load.
There are some microoptimizations we could look at later. The biggest is probably that in most cases we could encode directly to the ring buffer - if we are not at a boundary. Also "force filling" the Read requests (ie blocking until a full read can be completed) could be investigated and maybe allow concurrent memcopy on read and write. But if this isn't by itself better I don't think it is overall worth it.
The 2MB are a bit big for my liking - but we don't have 128K-256K buffers, which I would probably go for.
Pending: Actual testing.
For now also vendor https://github.com/smallnest/ringbuffer
How to test this PR?
Test PutObject/Multipart with bigger files.
Types of changes