Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added force virtual addressing configuration for S3, Alibaba OSS protocol to use PyArrowFileIO #1392

Merged
merged 7 commits into from
Dec 9, 2024

Conversation

helmiazizm
Copy link
Contributor

@helmiazizm helmiazizm commented Dec 2, 2024

Added the option to enforce virtual hosted style request for S3 PyArrow IO as it already has the support by PyArrow. Also added oss:// protocol into the list of S3 PyArrow scheme. This was addressed in #21 before.

@@ -373,6 +374,9 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste
if session_name := get_first_property_value(self.properties, S3_ROLE_SESSION_NAME, AWS_ROLE_SESSION_NAME):
client_kwargs["session_name"] = session_name

if force_virtual_addressing := self.properties.get(S3_FORCE_VIRTUAL_ADDRESSING):
client_kwargs["force_virtual_addressing"] = property_as_bool(self.properties, force_virtual_addressing, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -350,7 +351,7 @@ def parse_location(location: str) -> Tuple[str, str, str]:
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}"

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
if scheme in {"s3", "s3a", "s3n", "oss", "r2"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is oss? I've never heard of it. And does S3FileSystem support both oss and r2?

Copy link
Contributor Author

@helmiazizm helmiazizm Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oss is protocol for Alibaba Cloud Object Storage Service, and it's compatible with S3 API as long as the URL is in virtual address style.
Screenshot 2024-12-03 093309
From my quick test it looks like the class does support OSS, but not sure yet about R2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugging out R2. It uses account ID to create the URL, which is quite different from how S3 natively set the virtual hosted style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for testing it out! do you mind editing the PR description and mention this PR only supports oss

pyiceberg/io/__init__.py Show resolved Hide resolved
@@ -115,6 +115,7 @@ For the FileIO there are several configuration options available:
| s3.region | us-west-2 | Sets the region of the bucket |
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
| s3.force-virtual-addressing | False | Configure the style of requests. Set `False` to use path-style request and `True` for virtual-hosted-style request. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps just copy/paste S3FileSystem docs
https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html

force_virtual_addressing bool, default False 
Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access.

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Dec 2, 2024

reopened #21, and added a few comments on this PR. Thanks for the contribution!

@helmiazizm helmiazizm changed the title Added force virtual addressing configuration for S3 Added force virtual addressing configuration for S3, oss and r2 protocol to use PyArrowFileIO Dec 3, 2024
@helmiazizm helmiazizm changed the title Added force virtual addressing configuration for S3, oss and r2 protocol to use PyArrowFileIO Added force virtual addressing configuration for S3, Alibaba OSS protocol to use PyArrowFileIO Dec 3, 2024
@@ -350,7 +351,7 @@ def parse_location(location: str) -> Tuple[str, str, str]:
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}"

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
if scheme in {"s3", "s3a", "s3n"}:
if scheme in {"s3", "s3a", "s3n", "oss"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also update the docs to mention oss
https://py.iceberg.apache.org/configuration/#fileio

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add oss to fsspec?

SCHEME_TO_FS = {
"": _file,
"file": _file,
"s3": _s3,
"s3a": _s3,
"s3n": _s3,
"abfs": _adls,
"abfss": _adls,
"gs": _gs,
"gcs": _gs,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@helmiazizm helmiazizm Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like theres https://github.com/fsspec/ossfs

The project hasn't been maintained for almost a year now, and when I tried it on DuckDB it returns an unknown dependency error (?). I'd say it's probably safer to use s3fs than ossfs (not that I've tried it, though). I'll try making a new pull request for that later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks for looking into it! lets keep the scope just pyarrow for now

@kevinjqliu
Copy link
Contributor

looks like theres a linter issue, you can run make lint locally to resolve it

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for the contribution!

@helmiazizm
Copy link
Contributor Author

I've tried to run make lint on python3.9 and it succeed, yet the test still failed. Is there any other thing I need to change?

@kevinjqliu
Copy link
Contributor

@helmiazizm that's a intermittent issue with the CI, I've retrigger the run

@kevinjqliu kevinjqliu requested a review from Fokko December 5, 2024 18:29
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for updating the docs as well @helmiazizm 🙌

@Fokko Fokko merged commit d82f8f7 into apache:main Dec 9, 2024
8 checks passed
@helmiazizm helmiazizm deleted the add-s3-virtual-addressing branch December 9, 2024 14:58
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 24, 2024
* Added force virtual addressing configuration for S3. Also added oss and r2 protocol.

* Rewrote force virtual addressing as written in PyArrow documentation

* Added the missing r2 key value in schema_to_file_io

* Removed R2 protocol for now

* Linter fix

* Updated documentation for OSS support

* Another linter fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants