Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] read from multiple s3 regions #1279

Closed
kevinjqliu opened this issue Oct 31, 2024 · 9 comments
Closed

[bug] read from multiple s3 regions #1279

kevinjqliu opened this issue Oct 31, 2024 · 9 comments
Assignees

Comments

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Oct 31, 2024

Similar to #1041

Apache Iceberg version

None

Please describe the bug 🐞

Problem

I want to read files from multiple s3 regions. For example, my metadata files are in us-west-2 but my data files are in us-east-1. This is not possible currently.

Context

Reading a file in pyarrow requires a location and a file system implementation, fs. For example, location="s3://blah/foo.parquet" and fs=S3FileSystem.

def new_input(self, location: str) -> PyArrowFile:
"""Get a PyArrowFile instance to read bytes from the file at the given location.
Args:
location (str): A URI or a path to a local file.
Returns:
PyArrowFile: A PyArrowFile instance for the given location.
"""
scheme, netloc, path = self.parse_location(location)
return PyArrowFile(
fs=self.fs_by_scheme(scheme, netloc),
location=location,
path=path,
buffer_size=int(self.properties.get(BUFFER_SIZE, ONE_MEGABYTE)),
)

The fs is used to access the files in s3. And is initialized with the given S3_REGION according to the S3 configuration.

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
from pyarrow.fs import S3FileSystem
client_kwargs: Dict[str, Any] = {
"endpoint_override": self.properties.get(S3_ENDPOINT),
"access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
"secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
"session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
"region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
}
if proxy_uri := self.properties.get(S3_PROXY_URI):
client_kwargs["proxy_options"] = proxy_uri
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
client_kwargs["connect_timeout"] = float(connect_timeout)
return S3FileSystem(**client_kwargs)

This means only 1 S3 region is allowed.

Possible Solution

Create multiple instances of S3FileSystem, one for each region. And fetch the corresponding instance based on location. pyarrow.fs.resolve_s3_region(bucket) can determine the correct region

@kevinjqliu
Copy link
Contributor Author

Maybe similar issue for GCS/Azure, since we only cached 1 instance of each FileSystem

@danhphan
Copy link

danhphan commented Nov 9, 2024

Hey @kevinjqliu, I will happy to work on this task. Thanks

@kevinjqliu
Copy link
Contributor Author

@danhphan assigned to you! LMK if you have any questions

@danhphan
Copy link

Thanks @kevinjqliu , I'm reading the code base.

Can you please give me an example of expected unit-tests for the feature if possible? For instance, if we create the follow s3_fileio with "s3.region": "us-east-1" in the session_properties. Then we create an input_file on s3 bucket of warehouse, which is actually located in "eu-central-1" region, what should be the expected results?

session_properties: Properties = {
    "s3.endpoint": "http://localhost:9000",
    "s3.access-key-id": "admin",
    "s3.secret-access-key": "password",
    "s3.region": "us-east-1",
    "s3.session-token": "s3.session-token",
    **UNIFIED_AWS_SESSION_PROPERTIES,
}

s3_fileio = PyArrowFileIO(properties=session_properties)
print(s3_fileio.properties['s3.region']) #--> us-east-1

filename = str(uuid.uuid4())
input_file = s3_fileio.new_input(location=f"s3://warehouse/{filename}")
print(pyarrow.fs.resolve_s3_region('warehouse')) #--> eu-central-1

output_file = s3_fileio.new_output(location=f"s3://foo/{filename}")
print(pyarrow.fs.resolve_s3_region('foo')) #--> us-east-1

I'm thinking may be in the def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem in your above comments, we can assign the value for "region" in client_kwargs based on the value of netloc (or s3 bucket), but not sure if it is the right direction.

Like: "region": pyarrow.fs.resolve_s3_region(netloc),

 def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: 
     if scheme in {"s3", "s3a", "s3n"}: 
         from pyarrow.fs import S3FileSystem 
  
         client_kwargs: Dict[str, Any] = { 
             "endpoint_override": self.properties.get(S3_ENDPOINT), 
             "access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID), 
             "secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY), 
             "session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN), 
             "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION), 
         } 

Thank you.

@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Nov 10, 2024

what should be the expected results?

Given 2 files in different regions, I want to read them transparently without knowing which region they belong.
Currently, we create a single PyArrow FS for S3 that is region-specific. We can create many region-specific PyArrow FS or create them on the fly.

pyarrow.fs.resolve_s3_region can help determine which region to use. However, we cannot currently override the endpoint for minio in tests. See apache/arrow#43713

"s3.region": "us-east-1",

Perhaps we also need to think about configuration as well. Setting the region property here assumes that the FileIO will be specific to a region.

@danhphan
Copy link

Yes @kevinjqliu , seems that I still not able to fully understand the requirement for this change.

I think I will need more time to read the codes, and may be try some simpler tasks first, to learn more on the codebase.

Un-assign me on the task for now if anyone can help me to do it. Thank you.

@danhphan danhphan removed their assignment Nov 20, 2024
@jiakai-li
Copy link
Contributor

Hey @kevinjqliu , can I work on this issue if it is still needs to be worked on? Thanks!

@kevinjqliu
Copy link
Contributor Author

yes! assigned to you

@kevinjqliu
Copy link
Contributor Author

Closed by #1453

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants