-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add arg to enable blob chunking and allow custom chunk sizes #137
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dev-zero - the chunked upload is not currently exposed because it was flaky for different registries. If you want to move forward, you'll want to rebase against master (the extra provider.py/client.py files were refactored) and then I added some comments to address. Thanks!
@@ -6,3 +6,4 @@ oras.egg-info/ | |||
env | |||
__pycache__ | |||
.python-version | |||
/venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the leading slash for venv in the local directory.
@@ -250,8 +253,13 @@ def upload_blob( | |||
blob, container, layer, refresh_headers=refresh_headers | |||
) | |||
else: | |||
print("chunked upload ON.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to add logging, please use the oras.logger.
@@ -219,20 +219,23 @@ def upload_blob( | |||
container: container_type, | |||
layer: dict, | |||
do_chunked: bool = False, | |||
chunk_size=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should default to the default chunk size instead of None. If it's a shared value, put in defaults.py and reference from both / several function inputs.
""" | ||
|
||
default_chunk_size = 16777216 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here - this should be the default size, from defaults.py, in the function.
@@ -250,8 +253,13 @@ def upload_blob( | |||
blob, container, layer, refresh_headers=refresh_headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the comment above:
# Chunked for large, otherwise POST and PUT
# This is currently disabled unless the user asks for it, as
# it doesn't seem to work for all registries
Chunked was not working well for different registries so I am hesitant to add it back.
@@ -585,7 +601,9 @@ def chunked_upload( | |||
# Read the blob in chunks, for each do a patch | |||
start = 0 | |||
with open(blob, "rb") as fd: | |||
for chunk in oras.utils.read_in_chunks(fd): | |||
for chunk in oras.utils.read_in_chunks(fd, chunk_size=chunk_size): | |||
print("uploading chunk starting at " + str(start)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the oras.logger here.
|
||
:param blob: path to blob to upload | ||
:type blob: str | ||
:param container: parsed container URI | ||
:type container: oras.container.Container or str | ||
:param layer: dict from oras.oci.NewLayer | ||
:type layer: dict | ||
:param do_chunked: if true use chunked upload. This allows upload of larger oci artifacts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the caller wants to do a chunked upload, they can use the function directly.
@@ -689,6 +707,10 @@ def push(self, *args, **kwargs) -> requests.Response: | |||
:type target: str | |||
:param refresh_headers: if true or None, headers are refreshed | |||
:type refresh_headers: bool | |||
:param do_chunked: if true do chunked blob upload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would better be exposed through a separate push_chunked
or similar function, if it's a good idea to add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. It would require duplicating ~ 150 lines of code which would be otherwise unchanged. Is that really what you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if you like the separate function approach better than using an argument, I can rename existing 'push' to '_push' and create new, small 'push' and 'push_chunked'. Let me know how I should proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial design I had was fairly simple:
Lines 147 to 170 in 07ae3a1
def _upload_blob( | |
self, blob: str, container: Union[str, oras.container.Container], layer: dict | |
) -> requests.Response: | |
""" | |
Prepare and upload a blob. | |
Sizes > 1024 are uploaded via a chunked approach (post, patch+, put) | |
and <= 1024 is a single post then put. | |
:param blob: path to blob to upload | |
:type blob: str | |
:param container: parsed container URI | |
:type container: oras.container.Container or str | |
:param layer: dict from oras.oci.NewLayer | |
:type layer: dict | |
""" | |
blob = os.path.abspath(blob) | |
container = self.get_container(container) | |
size = layer["size"] | |
# Chunked for large, otherwise POST and PUT | |
if size < 1024: | |
return self._put_upload(blob, container, layer) | |
return self._chunked_upload(blob, container, layer) |
I understand about the flaky registry behavior. We have emerging use cases that use file sizes which are too large to upload with one PUT. I am working with Quay.io engineering to ensure that chunked-upload works correctly (it currently does not) but one impediment is lack of support for chunked upload in clients to reproduce errors. Do you have a convention for marking something experimental or unstable? I am not suggesting that chunked upload be automatically enabled for any push; the user will need to purposefully select that mode. |
Blob chunking is necessary for uploading very large artifacts. Quay.io, for example, can store very large artifacts but will not accept a single blob put over 20GB. Chunking support existed in the code but there was no way to activate it. The code comments indicated that it would be automatically enabled for blob sizes over 1024 bytes but no code to do that actually existed.
This code adds args to all the relevant functions starting with OrasClient.Push to allow explicit enablement of blob chunking. It also increases the fault chunk size significantly. The 1024 byte chunk size which is currently present only ensures that the data transfer will be extremely slow. For me it tested at around 10 KB/S). The new value results in transfer speeds of more like 10 MB/S. Finally the blob size is also user specifiable via arguments. Larger chunk sizes can result in faster uploads in some cases.
To test:
`import oras.client
client = oras.client.OrasClient()
client.push(files=["random.txt"], target="quay.io/bcook/tuneablechunks-b:1", do_chunked=True, chunk_size=10000000)
`