Skip to content

Conversation

@imsayari404
Copy link
Contributor

@imsayari404 imsayari404 commented Aug 31, 2025

Description

AWS SDK v1 is in maintenance mode and will be deprecated. Migrating to AWS SDK v2 ensures long-term support, improved performance, async client support, and better compatibility with future AWS features.

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... 
* ... 

Hive Connector Changes
* ... 
* ... 

If release note is NOT required, use:

== NO RELEASE NOTE ==

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Aug 31, 2025
@imsayari404 imsayari404 force-pushed the prestos3filesystem branch 2 times, most recently from 5983801 to 4785c37 Compare September 10, 2025 09:51
Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imsayari404, thank you. I'll be reviewing the current changes this week, but can you please look into the test failures?

Also, have you been able to run any manual integration tests with AWS S3?

@imsayari404 imsayari404 force-pushed the prestos3filesystem branch 5 times, most recently from 674d3e3 to 8daef65 Compare October 9, 2025 09:07
@imsayari404 imsayari404 changed the title Migrate presto-hive PrestoS3FileSystem to AWS SDK v2 for better compatibility chore(Hive): Migrate presto-hive PrestoS3FileSystem to AWS SDK v2 for better compatibility Oct 9, 2025
@imsayari404 imsayari404 force-pushed the prestos3filesystem branch 3 times, most recently from 0985501 to 9767dbe Compare October 13, 2025 05:41
Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @imsayari404! I have done a partial review and will add more in my second round.

AWS_EXEC_READ(ObjectCannedACL.AWS_EXEC_READ),
BUCKET_OWNER_FULL_CONTROL(ObjectCannedACL.BUCKET_OWNER_FULL_CONTROL),
BUCKET_OWNER_READ(ObjectCannedACL.BUCKET_OWNER_READ),
// LOG_DELIVERY_WRITE is not available in AWS SDK v2, removing it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document this change

Comment on lines 73 to 81
synchronized S3Client getS3Client(Configuration config, HiveClientConfig clientConfig)
{
if (s3Client != null) {
return s3Client;
}

s3Client = buildS3Client(config, clientConfig);
return s3Client;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As PrestoS3SelectClient uses selectObjectContent and in V2, 'selectObjectContent()' is available only on the S3AsyncClient, I don't think we need this anymore, as this class PrestoS3ClientFactory is only used by PrestoS3SelectClient.

return s3AsyncClient;
}

private S3Client buildS3Client(Configuration config, HiveClientConfig clientConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines 190 to 191
.retryPolicy(retryPolicyBuilder -> retryPolicyBuilder
.numRetries(maxErrorRetries))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the AWS SDK transition docs, we should use ClientOverrideConfiguration.builder().retryStrategy(b -> b.maxAttempts(...)) when upgrading from ClientConfiguration().withMaxErrorRetry(...).

Comment on lines 189 to 199
.overrideConfiguration(builder -> builder
.retryPolicy(retryPolicyBuilder -> retryPolicyBuilder
.numRetries(maxErrorRetries))
.apiCallTimeout(java.time.Duration.ofMillis(socketTimeout.toMillis()))
.apiCallAttemptTimeout(java.time.Duration.ofMillis(connectTimeout.toMillis()))
.putAdvancedOption(SdkAdvancedClientOption.USER_AGENT_PREFIX, userAgentPrefix)
.putAdvancedOption(SdkAdvancedClientOption.USER_AGENT_SUFFIX, s3UserAgentSuffix))
.httpClientBuilder(NettyNioAsyncHttpClient.builder()
.maxConcurrency(maxConnections)
.connectionTimeout(java.time.Duration.ofMillis(connectTimeout.toMillis()))
.readTimeout(java.time.Duration.ofMillis(socketTimeout.toMillis())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a ClientOverrideConfiguration.builder() as a separate variable for better readability.

filesToMount.put("hive_s3_insert_overwrite/hadoop-core-site.xml", hadoopCoreSitePath);
if (isSslEnabledTest) {
try {
// Copy dynamically generated keystore files into target/test-classes so that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert this?

Path trustStoreTarget = targetDir.resolve("truststore.jks");

synchronized (SSL_LOCK) {
// Copy freshly generated keystores, replacing if they exist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert


try {
this.minIOContainer.start();
log.info("MinIO started on port: " + minIOContainer.getMinioApiEndpoint().getPort());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this as debug log

Comment on lines 152 to 158
S3Client testClient = S3Client.builder()
.endpointOverride(URI.create("http://localhost:" + minIOContainer.getMinioApiEndpoint().getPort()))
.region(Region.US_EAST_1)
.credentialsProvider(StaticCredentialsProvider.create(
AwsBasicCredentials.create(ACCESS_KEY, SECRET_KEY)))
.forcePathStyle(true)
.build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a utility method for this, as the client is being created in multiple methods


testClient.listBuckets();
testClient.close();
log.info("MinIO is ready");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep all these test logs as debug logs

Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another round of review

private static final MediaType OCTET_STREAM_MEDIA_TYPE = MediaType.create("application", "octet-stream");
private static final Set<String> GLACIER_STORAGE_CLASSES = ImmutableSet.of(Glacier.toString(), DeepArchive.toString());
private static final Set<String> GLACIER_STORAGE_CLASSES = ImmutableSet.of(
StorageClass.GLACIER.toString(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import

private static final Set<String> GLACIER_STORAGE_CLASSES = ImmutableSet.of(Glacier.toString(), DeepArchive.toString());
private static final Set<String> GLACIER_STORAGE_CLASSES = ImmutableSet.of(
StorageClass.GLACIER.toString(),
StorageClass.DEEP_ARCHIVE.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import

StorageClass.GLACIER.toString(),
StorageClass.DEEP_ARCHIVE.toString());

// Configuration fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this comment is required

this.s3IamRole = conf.get(S3_IAM_ROLE, defaults.getS3IamRole());
this.s3IamRoleSessionName = conf.get(S3_IAM_ROLE_SESSION_NAME, defaults.getS3IamRoleSessionName());

// Validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not required

// Initialize clients
this.credentialsProvider = createAwsCredentialsProvider(uri, conf);
this.s3 = createAmazonS3Client(conf, configuration);
this.s3 = createS3Client(conf, connectTimeout, socketTimeout, maxConnections,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are already passing conf as a method argument, let's move the extraction of the other method arguments(connectTimeout and so on) to the createS3Client utility method itself. They are not being used anywhere else.

@Override
public int read()
{
// This stream is wrapped with BufferedInputStream, so this method should never be called
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert

Comment on lines 1238 to 1216
case 404: // NOT_FOUND
throw new FileNotFoundException("File does not exist: " + path);
case HTTP_FORBIDDEN:
case HTTP_BAD_REQUEST:
case 403: // FORBIDDEN
case 400: // BAD_REQUEST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not using static constants anymore?

Comment on lines 1219 to 1224
if (in instanceof S3ObjectInputStream) {
((S3ObjectInputStream) in).abort();
}
else {
in.close();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not valid anymore? I think we can still use abort

Comment on lines 1361 to 1536
switch (storageClass) {
case STANDARD:
return StorageClass.STANDARD;
case REDUCED_REDUNDANCY:
return StorageClass.REDUCED_REDUNDANCY;
case GLACIER:
return StorageClass.GLACIER;
case DEEP_ARCHIVE:
return StorageClass.DEEP_ARCHIVE;
case INTELLIGENT_TIERING:
return StorageClass.INTELLIGENT_TIERING;
default:
return StorageClass.STANDARD;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Presto's documentation we only support STANDARD and INTELLIGENT_TIERING

Comment on lines 1404 to 1412
/**
* Helper function used to work around the fact that if you use an S3 bucket with an '_' that java.net.URI
* behaves differently and sets the host value to null whereas S3 buckets without '_' have a properly
* set host field. '_' is only allowed in S3 bucket names in us-east-1.
*
* @param uri The URI from which to extract a host value.
* @return The host value where uri.getAuthority() is used when uri.getHost() returns null as long as no UserInfo is present.
* @throws IllegalArgumentException If the bucket can not be determined from the URI.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please revert

@imjalpreet imjalpreet changed the title chore(Hive): Migrate presto-hive PrestoS3FileSystem to AWS SDK v2 for better compatibility feat(connector): Migrate presto-hive PrestoS3FileSystem to AWS SDK v2 for better compatibility Oct 24, 2025
imsayari404 and others added 8 commits November 20, 2025 09:15
1. SdkClient Unable to marshall request to JSON: Key cannot be empty
2. IO unencrypted-content-length header is not set on an encrypted object
3. NoClassDefFound net/jpountz/lz4/LZ4Factory
4. org.apache.hadoop.fs.UnsupportedFileSystemException No FileSystem for scheme "s3"

Co-authored-by: Sayari Mukherjee <[email protected]>
Co-authored-by: Sayari Mukherjee <[email protected]>
@imsayari404 imsayari404 force-pushed the prestos3filesystem branch 2 times, most recently from 7ec29fa to 2ec201c Compare November 24, 2025 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants