You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When writing some contiguous elements to an uncompressed array (or chunk of an array), it is possible to use a simple seek-and-write directly on disk. This operation should be nearly free, and nearly independent of the chunk size. However, the cost seems to scale with the chunk size, just as for compressed arrays; I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.
Here is a simple script to showcase the effect:
importzarrimportnumpyasnpimporttimeimportosimportshutil# Sizes to testarray_sizes= [10**3, 10**4, 10**5, 10**6, 10**7, 10**8]
# Data to write: an array of 10 random floatsdata_to_write=np.random.rand(10)
# Results storageresults= []
forsizeinarray_sizes:
print(f"Testing array size: {size}")
# Define paths for Zarr arraysuncompressed_path=f'zarr_uncompressed_{size}.zarr'compressed_path=f'zarr_compressed_{size}.zarr'# Remove existing arrays if they existforpathin [uncompressed_path, compressed_path]:
ifos.path.exists(path):
shutil.rmtree(path)
# Create uncompressed Zarr arrayuncompressed_array=zarr.open(
uncompressed_path,
mode='w',
shape=(size,),
chunks=(size,), # Single chunk covering the entire arraydtype='float64',
compressor=None# No compression
)
# Create compressed Zarr array (default compressor is zlib)compressed_array=zarr.open(
compressed_path,
mode='w',
shape=(size,),
chunks=(size,), # Single chunkdtype='float64'# Using default compression
)
# Initialize arrays with zerosuncompressed_array[:] =0.0compressed_array[:] =0.0# Define the index range for writingstart_index=size//2# Middle of the arrayend_index=start_index+10# Measure write speed for uncompressed arraystart_time=time.time()
uncompressed_array[start_index:end_index] =data_to_writeuncompressed_write_time=time.time() -start_time# Measure write speed for compressed arraystart_time=time.time()
compressed_array[start_index:end_index] =data_to_writecompressed_write_time=time.time() -start_time# Store the resultsresults.append({
'array_size': size,
'uncompressed_write_time': uncompressed_write_time,
'compressed_write_time': compressed_write_time
})
print(f"Uncompressed write time: {uncompressed_write_time:.6f} seconds")
print(f"Compressed write time: {compressed_write_time:.6f} seconds\n")
I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.
Yes, this is what zarr is doing, and as you discovered, it's leaving a lot of performance on the table. I think you ran these tests with zarr-python version 2, is that correct? With the upcoming release of zarr-python 3, I don't think we will be applying performance optimizations to zarr-python 2, but this kind of improvement would absolutely be in scope for the v3 codebase.
Would it be possible for you to try this with zarr-python 3 (you can get the beta release by installing zarr-python=3.0.0b2). zarr-python 3 adds an additional degree of freedom in the form of the sharding codec, which can pack multiple independently addressible chunks inside a single file (a "shard"). This storage layout is designed for parallel reads, but if the chunks are not compressed, I think it could support parallel writes as well, which would be of great interest for many applications. Nobody is actively working on this now but I think it would be very exciting.
When writing some contiguous elements to an uncompressed array (or chunk of an array), it is possible to use a simple seek-and-write directly on disk. This operation should be nearly free, and nearly independent of the chunk size. However, the cost seems to scale with the chunk size, just as for compressed arrays; I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.
Here is a simple script to showcase the effect:
The text was updated successfully, but these errors were encountered: