add arg to enable blob chunking and allow custom chunk sizes #137

brianwcook · 2024-06-01T13:37:20Z

Blob chunking is necessary for uploading very large artifacts. Quay.io, for example, can store very large artifacts but will not accept a single blob put over 20GB. Chunking support existed in the code but there was no way to activate it. The code comments indicated that it would be automatically enabled for blob sizes over 1024 bytes but no code to do that actually existed.

This code adds args to all the relevant functions starting with OrasClient.Push to allow explicit enablement of blob chunking. It also increases the fault chunk size significantly. The 1024 byte chunk size which is currently present only ensures that the data transfer will be extremely slow. For me it tested at around 10 KB/S). The new value results in transfer speeds of more like 10 MB/S. Finally the blob size is also user specifiable via arguments. Larger chunk sizes can result in faster uploads in some cases.

To test:

`import oras.client

client = oras.client.OrasClient()
client.push(files=["random.txt"], target="quay.io/bcook/tuneablechunks-b:1", do_chunked=True, chunk_size=10000000)
`

vsoch

Thanks @dev-zero - the chunked upload is not currently exposed because it was flaky for different registries. If you want to move forward, you'll want to rebase against master (the extra provider.py/client.py files were refactored) and then I added some comments to address. Thanks!

vsoch · 2024-06-01T20:04:28Z

.gitignore

@@ -6,3 +6,4 @@ oras.egg-info/
 env
 __pycache__
 .python-version
+/venv


I don't think you need the leading slash for venv in the local directory.

vsoch · 2024-06-01T20:05:19Z

oras/provider.py

@@ -250,8 +253,13 @@ def upload_blob(
                blob, container, layer, refresh_headers=refresh_headers
            )
        else:
+            print("chunked upload ON.")


If you want to add logging, please use the oras.logger.

vsoch · 2024-06-01T20:05:45Z

oras/provider.py

@@ -219,20 +219,23 @@ def upload_blob(
        container: container_type,
        layer: dict,
        do_chunked: bool = False,
+        chunk_size=None,


This should default to the default chunk size instead of None. If it's a shared value, put in defaults.py and reference from both / several function inputs.

vsoch · 2024-06-01T20:06:08Z

oras/provider.py

        """
+
+        default_chunk_size = 16777216


Ditto here - this should be the default size, from defaults.py, in the function.

vsoch · 2024-06-01T20:51:47Z

oras/provider.py

@@ -250,8 +253,13 @@ def upload_blob(
                blob, container, layer, refresh_headers=refresh_headers


Please see the comment above:

# Chunked for large, otherwise POST and PUT # This is currently disabled unless the user asks for it, as # it doesn't seem to work for all registries

Chunked was not working well for different registries so I am hesitant to add it back.

vsoch · 2024-06-01T20:52:23Z

oras/provider.py

@@ -585,7 +601,9 @@ def chunked_upload(
        # Read the blob in chunks, for each do a patch
        start = 0
        with open(blob, "rb") as fd:
-            for chunk in oras.utils.read_in_chunks(fd):
+            for chunk in oras.utils.read_in_chunks(fd, chunk_size=chunk_size):
+                print("uploading chunk starting at " + str(start))


Please use the oras.logger here.

vsoch · 2024-06-01T20:53:46Z

oras/provider.py


        :param blob: path to blob to upload
        :type blob: str
        :param container:  parsed container URI
        :type container: oras.container.Container or str
        :param layer: dict from oras.oci.NewLayer
        :type layer: dict
+        :param do_chunked: if true use chunked upload. This allows upload of larger oci artifacts.


If the caller wants to do a chunked upload, they can use the function directly.

vsoch · 2024-06-01T20:54:56Z

oras/provider.py

@@ -689,6 +707,10 @@ def push(self, *args, **kwargs) -> requests.Response:
        :type target: str
        :param refresh_headers: if true or None, headers are refreshed
        :type refresh_headers: bool
+        :param do_chunked: if true do chunked blob upload


This would better be exposed through a separate push_chunked or similar function, if it's a good idea to add it back.

I don't understand this comment. It would require duplicating ~ 150 lines of code which would be otherwise unchanged. Is that really what you want?

Alternatively, if you like the separate function approach better than using an argument, I can rename existing 'push' to '_push' and create new, small 'push' and 'push_chunked'. Let me know how I should proceed.

The initial design I had was fairly simple:

oras-py/oras/provider.py

Lines 147 to 170 in 07ae3a1

def _upload_blob(

self, blob: str, container: Union[str, oras.container.Container], layer: dict

) -> requests.Response:

"""

Prepare and upload a blob.

Sizes > 1024 are uploaded via a chunked approach (post, patch+, put)

and <= 1024 is a single post then put.

:param blob: path to blob to upload

:type blob: str

:param container: parsed container URI

:type container: oras.container.Container or str

:param layer: dict from oras.oci.NewLayer

:type layer: dict

"""

blob = os.path.abspath(blob)

container = self.get_container(container)

size = layer["size"]

# Chunked for large, otherwise POST and PUT

if size < 1024:

return self._put_upload(blob, container, layer)

return self._chunked_upload(blob, container, layer)

and based on size. I think we'd want to try a similar approach, with a boolean that defaults to False. For my comment above, I do think we want a clean separation of the two functionalities (akin to how it was done in the link above).

brianwcook · 2024-06-02T12:46:31Z

I understand about the flaky registry behavior. We have emerging use cases that use file sizes which are too large to upload with one PUT. I am working with Quay.io engineering to ensure that chunked-upload works correctly (it currently does not) but one impediment is lack of support for chunked upload in clients to reproduce errors. Do you have a convention for marking something experimental or unstable? I am not suggesting that chunked upload be automatically enabled for any push; the user will need to purposefully select that mode.

isinyaaa · 2024-10-29T18:13:58Z

@vsoch @brianwcook this can be closed :)

add arg to enable blob chunking and allow custom chunk sizes

082bec3

brianwcook requested review from vsoch and SteveLasker as code owners June 1, 2024 13:37

vsoch reviewed Jun 1, 2024

View reviewed changes

isinyaaa mentioned this pull request Aug 19, 2024

Enable chunked uploads #150

Merged

vsoch closed this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add arg to enable blob chunking and allow custom chunk sizes #137

add arg to enable blob chunking and allow custom chunk sizes #137

Uh oh!

brianwcook commented Jun 1, 2024

Uh oh!

vsoch left a comment

Uh oh!

vsoch Jun 1, 2024

Uh oh!

vsoch Jun 1, 2024

Uh oh!

vsoch Jun 1, 2024

Uh oh!

vsoch Jun 1, 2024

Uh oh!

vsoch Jun 1, 2024

Uh oh!

vsoch Jun 1, 2024

Uh oh!

vsoch Jun 1, 2024

Uh oh!

vsoch Jun 1, 2024

Uh oh!

brianwcook Jun 2, 2024

Uh oh!

brianwcook Jun 2, 2024

Uh oh!

vsoch Jun 2, 2024

Uh oh!

brianwcook commented Jun 2, 2024

Uh oh!

isinyaaa commented Oct 29, 2024

Uh oh!

Uh oh!

		@@ -250,8 +253,13 @@ def upload_blob(
		blob, container, layer, refresh_headers=refresh_headers

	def _upload_blob(
	self, blob: str, container: Union[str, oras.container.Container], layer: dict
	) -> requests.Response:
	"""
	Prepare and upload a blob.

	Sizes > 1024 are uploaded via a chunked approach (post, patch+, put)
	and <= 1024 is a single post then put.

	:param blob: path to blob to upload
	:type blob: str
	:param container: parsed container URI
	:type container: oras.container.Container or str
	:param layer: dict from oras.oci.NewLayer
	:type layer: dict
	"""
	blob = os.path.abspath(blob)
	container = self.get_container(container)
	size = layer["size"]

	# Chunked for large, otherwise POST and PUT
	if size < 1024:
	return self._put_upload(blob, container, layer)
	return self._chunked_upload(blob, container, layer)

add arg to enable blob chunking and allow custom chunk sizes #137

add arg to enable blob chunking and allow custom chunk sizes #137

Uh oh!

Conversation

brianwcook commented Jun 1, 2024

Uh oh!

vsoch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brianwcook commented Jun 2, 2024

Uh oh!

isinyaaa commented Oct 29, 2024

Uh oh!

Uh oh!