Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix read from multiple s3 regions #1453
Fix read from multiple s3 regions #1453
Changes from 16 commits
7f52a1a
b53e89f
7699a76
eb5e491
0c61ac8
327dbac
b4fccf2
48bb811
8404e6b
53951f5
51fb6ff
0cd06c4
64fbdab
37d9ec2
74e78ae
4ff4a7d
b56f2ee
2a4cee1
9cc3a30
ba5ef76
8f06a15
e5cac02
9652baf
bc2adc7
4b83fc0
7f207bf
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since
oss
scheme uses this path, does it also supportresolve_s3_region
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kevinjqliu . This is a really good catch. I didn't find too much information regarding support of oss regions by
pyarrow.fs.resolve_s3_region
. But I tried it on my end and it doens't seem to work as it throws me an error complaining the bucket cannot be found.This could be a problem though, especially if the same bucket name is used by both Aliyun and AWS. In which case the user-provided bucket region for Aliyun could be wrongly overwritten (by the resolved AWS one).
I separate the
oss
path froms3
for now as I'm not sure if we want to tackle on theoss
now (and I feel we probably want to treat the two protocol differently?). I also break the_initialize_fs
code chunk into smaller blocks to make it a bit easier for future modification.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think its supported, the underlying call is looking for
x-amz-bucket-region
which i dont think aliyun will sethttps://github.com/apache/arrow/blob/48d5151b87f1b8f977344c7ac20cb0810e46f733/cpp/src/arrow/filesystem/s3fs.cc#L660
since we're using both scheme and bucket to cache FS, this should be fine right? For the case of
oss://bucket
ands3://bucket
.yea lets just deal with s3 for now. Btw fsspec splits construct per fs, i think it looks pretty clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking that, I should have looked at it :-)
Yes, there is no issue after the change now. What I was thinking is for the
oss://bucket
scenario (ignore the caching behavior). If the bucket used byoss
also exists in AWS, then the previous version (before your comment) would try to resolve the bucket and use it to overwrite the defaul setting. This will be wrong though, becauseoss
bucket region cannot be resolved using pyarrow.I updated the test case to take this into account and also added an integration test for multiple filesystem read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about doing this lookup only when the region is not provided explicitly? I think this will do another call to S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Fokko, my understanding is that the problem occurs when the provided
region
doesn't match the data file bucket region, and that will fail the file read for pyarrow. And by overwriting the bucket region (fall back to provided region), we make sure the real bucket region that a data file is stored takes precedence. (this function is cached when usingfs_by_scheme
, so it will be called only for new bucket that's not resolved previously to save calls to S3)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are these 3 cases we're worried about:
We have 2 options here
region
when provided, fallback toresolve_s3_region
resolve_s3_region
resolve_s3_region
, fall back toregion
Option 1 is difficult since we dont know that the provided
region
is wrong until we try to use the FileIO.The code above uses option 2 which will always make an extra call to S3 to get the correct bucket region. This extra call to S3 is cached though, so
resolve_s3_region
is only called once per bucket.This is similar to the
cache_regions
option for s3fs.core.S3FileSystemI like option 3, we can resolve the bucket region and fallback to the provided
region
. It might be confusing to the enduser when aregion
is specified but the FileIO uses a different region, so lets add a warning for that.Something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating on this, I want to make sure that the user is aware of it, and I think we do that right with the warning.
For some additional context, for Java we don't have this issue because when you try to query the wrong region, the AWS SDK returns an HTTP 301 to the correct region. This introduces another 200 call but that's okay. The PyArrow implementation (that I believe uses the AWS C++ SDK underneath), just throws an error that it got a 301. We saw that in the past for example here: #515 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I think this solves #1041 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it is :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I noticed that we first pass in the path here, and the IO as a second argument. For
_read_all_delete_files
it is the other way around. How about unifying this?