Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add arg to enable blob chunking and allow custom chunk sizes #137

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

brianwcook
Copy link

Blob chunking is necessary for uploading very large artifacts. Quay.io, for example, can store very large artifacts but will not accept a single blob put over 20GB. Chunking support existed in the code but there was no way to activate it. The code comments indicated that it would be automatically enabled for blob sizes over 1024 bytes but no code to do that actually existed.

This code adds args to all the relevant functions starting with OrasClient.Push to allow explicit enablement of blob chunking. It also increases the fault chunk size significantly. The 1024 byte chunk size which is currently present only ensures that the data transfer will be extremely slow. For me it tested at around 10 KB/S). The new value results in transfer speeds of more like 10 MB/S. Finally the blob size is also user specifiable via arguments. Larger chunk sizes can result in faster uploads in some cases.

To test:

`import oras.client

client = oras.client.OrasClient()
client.push(files=["random.txt"], target="quay.io/bcook/tuneablechunks-b:1", do_chunked=True, chunk_size=10000000)
`

Copy link
Contributor

@vsoch vsoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dev-zero - the chunked upload is not currently exposed because it was flaky for different registries. If you want to move forward, you'll want to rebase against master (the extra provider.py/client.py files were refactored) and then I added some comments to address. Thanks!

@@ -6,3 +6,4 @@ oras.egg-info/
env
__pycache__
.python-version
/venv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the leading slash for venv in the local directory.

@@ -250,8 +253,13 @@ def upload_blob(
blob, container, layer, refresh_headers=refresh_headers
)
else:
print("chunked upload ON.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to add logging, please use the oras.logger.

@@ -219,20 +219,23 @@ def upload_blob(
container: container_type,
layer: dict,
do_chunked: bool = False,
chunk_size=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should default to the default chunk size instead of None. If it's a shared value, put in defaults.py and reference from both / several function inputs.

"""

default_chunk_size = 16777216
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here - this should be the default size, from defaults.py, in the function.

@@ -250,8 +253,13 @@ def upload_blob(
blob, container, layer, refresh_headers=refresh_headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment above:

 # Chunked for large, otherwise POST and PUT
# This is currently disabled unless the user asks for it, as
# it doesn't seem to work for all registries

Chunked was not working well for different registries so I am hesitant to add it back.

@@ -585,7 +601,9 @@ def chunked_upload(
# Read the blob in chunks, for each do a patch
start = 0
with open(blob, "rb") as fd:
for chunk in oras.utils.read_in_chunks(fd):
for chunk in oras.utils.read_in_chunks(fd, chunk_size=chunk_size):
print("uploading chunk starting at " + str(start))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the oras.logger here.


:param blob: path to blob to upload
:type blob: str
:param container: parsed container URI
:type container: oras.container.Container or str
:param layer: dict from oras.oci.NewLayer
:type layer: dict
:param do_chunked: if true use chunked upload. This allows upload of larger oci artifacts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the caller wants to do a chunked upload, they can use the function directly.

@@ -689,6 +707,10 @@ def push(self, *args, **kwargs) -> requests.Response:
:type target: str
:param refresh_headers: if true or None, headers are refreshed
:type refresh_headers: bool
:param do_chunked: if true do chunked blob upload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would better be exposed through a separate push_chunked or similar function, if it's a good idea to add it back.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. It would require duplicating ~ 150 lines of code which would be otherwise unchanged. Is that really what you want?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if you like the separate function approach better than using an argument, I can rename existing 'push' to '_push' and create new, small 'push' and 'push_chunked'. Let me know how I should proceed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial design I had was fairly simple:

oras-py/oras/provider.py

Lines 147 to 170 in 07ae3a1

def _upload_blob(
self, blob: str, container: Union[str, oras.container.Container], layer: dict
) -> requests.Response:
"""
Prepare and upload a blob.
Sizes > 1024 are uploaded via a chunked approach (post, patch+, put)
and <= 1024 is a single post then put.
:param blob: path to blob to upload
:type blob: str
:param container: parsed container URI
:type container: oras.container.Container or str
:param layer: dict from oras.oci.NewLayer
:type layer: dict
"""
blob = os.path.abspath(blob)
container = self.get_container(container)
size = layer["size"]
# Chunked for large, otherwise POST and PUT
if size < 1024:
return self._put_upload(blob, container, layer)
return self._chunked_upload(blob, container, layer)
and based on size. I think we'd want to try a similar approach, with a boolean that defaults to False. For my comment above, I do think we want a clean separation of the two functionalities (akin to how it was done in the link above).

@brianwcook
Copy link
Author

I understand about the flaky registry behavior. We have emerging use cases that use file sizes which are too large to upload with one PUT. I am working with Quay.io engineering to ensure that chunked-upload works correctly (it currently does not) but one impediment is lack of support for chunked upload in clients to reproduce errors. Do you have a convention for marking something experimental or unstable? I am not suggesting that chunked upload be automatically enabled for any push; the user will need to purposefully select that mode.

@isinyaaa isinyaaa mentioned this pull request Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants