Skip to content

Added force virtual addressing configuration for S3, Alibaba OSS protocol to use PyArrowFileIO #1392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs/docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ For the FileIO there are several configuration options available:
| s3.region | us-west-2 | Sets the region of the bucket |
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
| s3.force-virtual-addressing | False | Configure the style of requests. Set `False` to use path-style request and `True` for virtual-hosted-style request. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps just copy/paste S3FileSystem docs
https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html

force_virtual_addressing bool, default False 
Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access.


<!-- markdown-link-check-enable-->

Expand Down
2 changes: 2 additions & 0 deletions pyiceberg/io/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
S3_SIGNER_ENDPOINT_DEFAULT = "v1/aws/s3/sign"
S3_ROLE_ARN = "s3.role-arn"
S3_ROLE_SESSION_NAME = "s3.role-session-name"
S3_FORCE_VIRTUAL_ADDRESSING = "s3.force-virtual-addressing"
HDFS_HOST = "hdfs.host"
HDFS_PORT = "hdfs.port"
HDFS_USER = "hdfs.user"
Expand Down Expand Up @@ -304,6 +305,7 @@ def delete(self, location: Union[str, InputFile, OutputFile]) -> None:
"s3": [ARROW_FILE_IO, FSSPEC_FILE_IO],
"s3a": [ARROW_FILE_IO, FSSPEC_FILE_IO],
"s3n": [ARROW_FILE_IO, FSSPEC_FILE_IO],
"oss": [ARROW_FILE_IO],
"gs": [ARROW_FILE_IO],
"file": [ARROW_FILE_IO, FSSPEC_FILE_IO],
"hdfs": [ARROW_FILE_IO],
Expand Down
6 changes: 5 additions & 1 deletion pyiceberg/io/pyarrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@
S3_ROLE_SESSION_NAME,
S3_SECRET_ACCESS_KEY,
S3_SESSION_TOKEN,
S3_FORCE_VIRTUAL_ADDRESSING,
FileIO,
InputFile,
InputStream,
Expand Down Expand Up @@ -350,7 +351,7 @@ def parse_location(location: str) -> Tuple[str, str, str]:
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}"

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
if scheme in {"s3", "s3a", "s3n", "oss", "r2"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is oss? I've never heard of it. And does S3FileSystem support both oss and r2?

Copy link
Contributor Author

@helmiazizm helmiazizm Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oss is protocol for Alibaba Cloud Object Storage Service, and it's compatible with S3 API as long as the URL is in virtual address style.
Screenshot 2024-12-03 093309
From my quick test it looks like the class does support OSS, but not sure yet about R2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugging out R2. It uses account ID to create the URL, which is quite different from how S3 natively set the virtual hosted style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for testing it out! do you mind editing the PR description and mention this PR only supports oss

from pyarrow.fs import S3FileSystem

client_kwargs: Dict[str, Any] = {
Expand All @@ -373,6 +374,9 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste
if session_name := get_first_property_value(self.properties, S3_ROLE_SESSION_NAME, AWS_ROLE_SESSION_NAME):
client_kwargs["session_name"] = session_name

if force_virtual_addressing := self.properties.get(S3_FORCE_VIRTUAL_ADDRESSING):
client_kwargs["force_virtual_addressing"] = property_as_bool(self.properties, force_virtual_addressing, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return S3FileSystem(**client_kwargs)
elif scheme in ("hdfs", "viewfs"):
from pyarrow.fs import HadoopFileSystem
Expand Down
Loading