Skip to content

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Aug 28, 2025

close #4995
In SDK v1, S3 tranfers are managed by the transfer manager and a pool of threads to the transfers with a sync client. In SDK v2, the transfer manager is using an async client and the pool of threads is just used for preliminar work but not the tranfers. However, previous steps in publish dir ( check if is a directory, etc.) are using the sync client producing the same timeout errors.

This PR make the changes to limit the resources when using virtual threads is S3 transfers. A semaphore with a permit equal to the client's max connections is set in the Netflow's S3 client. It avoids to perform too many concurrent calls in the S3 client.

Tested with a pipeline with 15 tasks that generate 1000 files of 25 MB each one. It generated timeout errors when using in both 25.04 (SDK v1 and master (SDK v2). They disappear with this PR.

TODO

  • Memory issues ? -> Limit maximum concurrent publish dir transfers as in FilePorter
  • Tune the semaphore permit with a larger value than max connections?

Copy link

netlify bot commented Aug 28, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit f099b8b
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/68b708b60949fb000889fc92
😎 Deploy Preview https://deploy-preview-6369--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@bentsherman bentsherman self-requested a review August 28, 2025 15:56
@bentsherman bentsherman changed the title Limiting S3 sync client connections when using virtual threads Limit S3 client connections when using virtual threads Aug 28, 2025
@bentsherman
Copy link
Member

I have removed the config options we discussed and updated the docs to show the desired state.

  • The sync client should use the semaphore to limit virtual threads based on the maxConnections. All calls to the sync client are wrapped with runWithPermit to enforce this

  • The async client should limit itself internally based on either the maxConnections or target throughput. Neither of these settings are specified by default in AwsS3Config, so the default behavior should be that the async client uses a target throughput of 10 Gbps

  • The sync client uses a default maxConnections of 50 to ensure that it always has a connection limit

@bentsherman bentsherman added this to the 25.10 milestone Aug 28, 2025
@jorgee
Copy link
Contributor Author

jorgee commented Sep 1, 2025

As commented last week, I got some OutOfMemory issues when using virtual threads with small heap sizes (<2GB). So, I have inspected a heap dump for 15K file uploads and as you can see in the following image, the main memory consumption comes from two parts:
image

  1. The XML deserializer used internally by the AWS Sync client is storing the XMLInputFactory using thread locals, a instance of this is replicated in each thread. These objects are keeping the XML request' responses, so the retained size of these objects is more than 500MB in this case. For the case of the issue (~60K transfer), it could be 2GB.

  2. The async request handling used in the trasnsfer manager client is also consuming a lot of heap space with large number of transfers. For instance, in the case of the following figure, it is doing a bucket-to-bucket copy but, as the size exceeds the multipart-threshold, so a copy is converted to several multi-part copy requests. For 15K copies, it is creating more than 50k async requests. Each request is using 12KB, so it is consuming 600MB. In the case of the issue (60K), it could consume another 2GB.

image

As they are internal to the AWS SDK implementation, I think we can not currently fix it. So, I think the easier option would be to limit the number of publish transfers, as it is done in the file porter. We could have the same issues for other providers. However, in case we don't want to limit it, other options that could work is moving the S3 sync to async as well as adding a limit of concurrent async operations submitted to the async client, to limit the number of AsyncHandlers.
@bentsherman, @pditommaso, what do you think?

@pditommaso
Copy link
Member

not sure it's worth optimising for virtual threads until they are not property supported by the AWS sdk

@bentsherman
Copy link
Member

@jorgee is it possible to use a sync client for S3 transfers? then we could apply the semaphore to it and ensure that there are no more than a few hundred concurrent requests by limiting aws.client.maxConnections.

Right now it seems that the async client is allocating thousands of request handlers even with a smaller max concurrency, is that right?

It seems like we should move towards sync clients anyway so that we can just control it with virtual threads + semaphore.

It will be nice if the AWS SDK can provide the target throughput functionality using virtual threads, but we also need to find some kind of solution since many customers have been asking for it, and we have no control over AWS timeline.

@jorgee
Copy link
Contributor Author

jorgee commented Sep 2, 2025

The async client is mandatory for the S3 transfer manager in AWS SDKv2. We should manage multi-part stuff in Nextflow again, and I think in sdk v1 we also managed some transfers with S3 transfer manager. So we should also rewrite them to not use the transfer manager.

@bentsherman
Copy link
Member

Well, I don't know if we should go that far. I doubt we'll be able to manage the transfers as well as the transfer manager.

I have an idea that I'd like to try. I'll push it if it works

@bentsherman
Copy link
Member

I managed to wrap the S3 transfer manager in a synchronous interface. It just calls the async client and immediately waits for the result. I also wrapped these calls in the semaphore limiter so that the async client is controlled directly by aws.client.maxConnections.

This should ensure that the async client doesn't create too many requests when using virtual threads.

I also kept the target throughput setting so that users can control the maximum throughput, but it will still be limited by maxConnections.

If AWS ever improves the SDK to support virtual threads natively, we need only remove the wrapper class while the config options remain the same. This way we don't have to wait for AWS to give us a solution to customer issues.

@bentsherman
Copy link
Member

@jorgee can you test the PR with your benchmark and see if the OOM errors are resolved?

@jorgee
Copy link
Contributor Author

jorgee commented Sep 2, 2025

I will do. I will also try to test with this AWS SDK v2 preview version aws/aws-sdk-java-v2#6268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve stability of virtual threads at scale
3 participants