-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added force virtual addressing configuration for S3, Alibaba OSS protocol to use PyArrowFileIO #1392
Conversation
@@ -373,6 +374,9 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste | |||
if session_name := get_first_property_value(self.properties, S3_ROLE_SESSION_NAME, AWS_ROLE_SESSION_NAME): | |||
client_kwargs["session_name"] = session_name | |||
|
|||
if force_virtual_addressing := self.properties.get(S3_FORCE_VIRTUAL_ADDRESSING): | |||
client_kwargs["force_virtual_addressing"] = property_as_bool(self.properties, force_virtual_addressing, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force_virtual_addressing
docs
https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html
pyiceberg/io/pyarrow.py
Outdated
@@ -350,7 +351,7 @@ def parse_location(location: str) -> Tuple[str, str, str]: | |||
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}" | |||
|
|||
def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: | |||
if scheme in {"s3", "s3a", "s3n"}: | |||
if scheme in {"s3", "s3a", "s3n", "oss", "r2"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is oss
? I've never heard of it. And does S3FileSystem support both oss and r2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oss
is protocol for Alibaba Cloud Object Storage Service, and it's compatible with S3 API as long as the URL is in virtual address style.
From my quick test it looks like the class does support OSS, but not sure yet about R2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plugging out R2. It uses account ID to create the URL, which is quite different from how S3 natively set the virtual hosted style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for testing it out! do you mind editing the PR description and mention this PR only supports oss
mkdocs/docs/configuration.md
Outdated
@@ -115,6 +115,7 @@ For the FileIO there are several configuration options available: | |||
| s3.region | us-west-2 | Sets the region of the bucket | | |||
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. | | |||
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. | | |||
| s3.force-virtual-addressing | False | Configure the style of requests. Set `False` to use path-style request and `True` for virtual-hosted-style request. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps just copy/paste S3FileSystem docs
https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html
force_virtual_addressing bool, default False
Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access.
reopened #21, and added a few comments on this PR. Thanks for the contribution! |
@@ -350,7 +351,7 @@ def parse_location(location: str) -> Tuple[str, str, str]: | |||
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}" | |||
|
|||
def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: | |||
if scheme in {"s3", "s3a", "s3n"}: | |||
if scheme in {"s3", "s3a", "s3n", "oss"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also update the docs to mention oss
https://py.iceberg.apache.org/configuration/#fileio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also add oss
to fsspec?
iceberg-python/pyiceberg/io/fsspec.py
Lines 245 to 255 in 68e17af
SCHEME_TO_FS = { | |
"": _file, | |
"file": _file, | |
"s3": _s3, | |
"s3a": _s3, | |
"s3n": _s3, | |
"abfs": _adls, | |
"abfss": _adls, | |
"gs": _gs, | |
"gcs": _gs, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like theres https://github.com/fsspec/ossfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like theres https://github.com/fsspec/ossfs
The project hasn't been maintained for almost a year now, and when I tried it on DuckDB it returns an unknown dependency error (?). I'd say it's probably safer to use s3fs than ossfs (not that I've tried it, though). I'll try making a new pull request for that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks for looking into it! lets keep the scope just pyarrow for now
looks like theres a linter issue, you can run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for the contribution!
I've tried to run |
@helmiazizm that's a intermittent issue with the CI, I've retrigger the run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for updating the docs as well @helmiazizm 🙌
* Added force virtual addressing configuration for S3. Also added oss and r2 protocol. * Rewrote force virtual addressing as written in PyArrow documentation * Added the missing r2 key value in schema_to_file_io * Removed R2 protocol for now * Linter fix * Updated documentation for OSS support * Another linter fix
Added the option to enforce virtual hosted style request for S3 PyArrow IO as it already has the support by PyArrow. Also added
oss://
protocol into the list of S3 PyArrow scheme. This was addressed in #21 before.