Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arguments to specify GET and PUT part size independently #949

Merged
merged 6 commits into from
Jul 30, 2024

Conversation

crrow
Copy link
Contributor

@crrow crrow commented Jul 21, 2024

Description of change

  1. Introduce a new CLI flag --multipart-upload-threshold to specify the part size for multipart uploads during PUT operations.
  2. Ensure that the object client interprets multipart-upload-threshold as the part size for initiating multipart uploads in PUT operations.

Relevant issues: #762

Does this change impact existing behavior?

No, this flag is optional, for old CLI arguments, multipart-upload-threshold will be set as same part-size.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

@crrow crrow had a problem deploying to PR integration tests July 21, 2024 07:54 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 07:54 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 07:54 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 07:54 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 07:54 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 07:54 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 07:54 — with GitHub Actions Failure
@crrow crrow force-pushed the feat/separate-part-size-config branch from a490fce to 349e742 Compare July 21, 2024 08:02
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 08:03 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 08:03 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 08:03 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 08:03 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 08:03 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 08:03 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 21, 2024 08:03 — with GitHub Actions Failure
Copy link
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @crrow!

The changes here don't match the direction we want to go. multipart_upload_threshold actually isn't used by Mountpoint right now, since we do not specify an object size in advance. Effectively, the threshold is never evaluated for Mountpoint's use case.

Allowing a value to be specified at the command-line for uploads is what we want, let me share what I think it should look like.

I think we should introduce two new arguments: --read-part-size <SIZE> and --write-part-size <SIZE>. When we create a meta request for GetObject or PutObject, we should then set that part size on the meta request. The meta request is a way for the AWS CRT to present some nice simple S3 request to us, while under the hood the CRT will change that into multiple GetObject requests in parallel or a multi-part upload when writing.

I hadn't realised, but we already have part-size made available on the MetaRequestOptions struct (see below). That means we don't have to worry about writing Rust-to-C bindings.

/// Set the part size of this request
pub fn part_size(&mut self, part_size: u64) -> &mut Self {
// SAFETY: we aren't moving out of the struct.
let options = unsafe { Pin::get_unchecked_mut(Pin::as_mut(&mut self.0)) };
options.inner.part_size = part_size;
self
}

Please do feel free to reach out on this, happy to discuss any questions.

@crrow
Copy link
Contributor Author

crrow commented Jul 24, 2024

@dannycjones

Oh, I kinda understand, we need to set part size for each put/get operation. So what we need to is changing the S3CrtClient, to make it holds these arguments? Or changing the obejct-client trait to make upper caller can change the part-size proactively?

And, should we deprecate the original part size now that we're providing per-request control?

@dannycjones
Copy link
Contributor

Hey @crrow,

Oh, I kinda understand, we need to set part size for each put/get operation. So what we need to is changing the S3CrtClient, to make it holds these arguments? Or changing the obejct-client trait to make upper caller can change the part-size proactively?

I was chatting with @monthonk, we think the former makes sense where we modify the client to now have separate part sizes for GET and PUT. This can then be read when creating those requests. That's probably the best option to keep things simple.

should we deprecate the original part size now that we're providing per-request control?

One thing that would be nice to keep is the original part_size(usize) setter method alongside some new setters, to minimize breaking changes for now (since this is used by other projects like https://github.com/awslabs/s3-connector-for-pytorch).

Meanwhile, the inner fields on S3CrtClientInner can be replaced and I'd also recommend replacing the part_size() on ObjectClient with separate getters for GET and PUT methods respectively.

@crrow crrow force-pushed the feat/separate-part-size-config branch from 349e742 to 38e143e Compare July 27, 2024 09:17
@crrow crrow had a problem deploying to PR integration tests July 27, 2024 09:17 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 27, 2024 09:17 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 27, 2024 09:17 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 27, 2024 09:17 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 27, 2024 09:17 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 27, 2024 09:17 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 27, 2024 09:17 — with GitHub Actions Failure
@crrow crrow force-pushed the feat/separate-part-size-config branch from 38e143e to 4366608 Compare July 27, 2024 09:20
@crrow crrow temporarily deployed to PR integration tests July 27, 2024 09:20 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 27, 2024 09:20 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 27, 2024 09:20 — with GitHub Actions Inactive
Copy link
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super close, just one comment.

Since #956 is now merged, can you make the change suggested and then merge from / rebase on main?

mountpoint-s3/src/cli.rs Outdated Show resolved Hide resolved
@crrow crrow had a problem deploying to PR integration tests July 29, 2024 16:34 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 29, 2024 16:34 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 29, 2024 16:34 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 29, 2024 16:34 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 29, 2024 16:34 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 29, 2024 16:34 — with GitHub Actions Failure
@crrow crrow had a problem deploying to PR integration tests July 29, 2024 16:34 — with GitHub Actions Failure
@crrow crrow force-pushed the feat/separate-part-size-config branch from 945a2c8 to 4a9f783 Compare July 29, 2024 16:35
@crrow crrow temporarily deployed to PR integration tests July 29, 2024 16:35 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 29, 2024 16:35 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 29, 2024 16:35 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 29, 2024 16:35 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 29, 2024 16:35 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 29, 2024 16:35 — with GitHub Actions Inactive
@crrow crrow temporarily deployed to PR integration tests July 29, 2024 16:35 — with GitHub Actions Inactive
@dannycjones dannycjones self-requested a review July 29, 2024 17:47
Copy link
Contributor

@dannycjones dannycjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution, @crrow!

I'll merge this to main now.

@dannycjones dannycjones added this pull request to the merge queue Jul 30, 2024
Merged via the queue into awslabs:main with commit 0fff132 Jul 30, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants