Skip to content

[bug] read from multiple s3 regions #1279

Closed
@kevinjqliu

Description

@kevinjqliu

Similar to #1041

Apache Iceberg version

None

Please describe the bug 🐞

Problem

I want to read files from multiple s3 regions. For example, my metadata files are in us-west-2 but my data files are in us-east-1. This is not possible currently.

Context

Reading a file in pyarrow requires a location and a file system implementation, fs. For example, location="s3://blah/foo.parquet" and fs=S3FileSystem.

def new_input(self, location: str) -> PyArrowFile:
"""Get a PyArrowFile instance to read bytes from the file at the given location.
Args:
location (str): A URI or a path to a local file.
Returns:
PyArrowFile: A PyArrowFile instance for the given location.
"""
scheme, netloc, path = self.parse_location(location)
return PyArrowFile(
fs=self.fs_by_scheme(scheme, netloc),
location=location,
path=path,
buffer_size=int(self.properties.get(BUFFER_SIZE, ONE_MEGABYTE)),
)

The fs is used to access the files in s3. And is initialized with the given S3_REGION according to the S3 configuration.

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
from pyarrow.fs import S3FileSystem
client_kwargs: Dict[str, Any] = {
"endpoint_override": self.properties.get(S3_ENDPOINT),
"access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
"secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
"session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
"region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
}
if proxy_uri := self.properties.get(S3_PROXY_URI):
client_kwargs["proxy_options"] = proxy_uri
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
client_kwargs["connect_timeout"] = float(connect_timeout)
return S3FileSystem(**client_kwargs)

This means only 1 S3 region is allowed.

Possible Solution

Create multiple instances of S3FileSystem, one for each region. And fetch the corresponding instance based on location. pyarrow.fs.resolve_s3_region(bucket) can determine the correct region

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions