Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessarily slow writes to uncompressed arrays #2518

Open
jbuckman opened this issue Nov 26, 2024 · 2 comments
Open

Unnecessarily slow writes to uncompressed arrays #2518

jbuckman opened this issue Nov 26, 2024 · 2 comments
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.) V2 Affects the v2 branch

Comments

@jbuckman
Copy link

jbuckman commented Nov 26, 2024

When writing some contiguous elements to an uncompressed array (or chunk of an array), it is possible to use a simple seek-and-write directly on disk. This operation should be nearly free, and nearly independent of the chunk size. However, the cost seems to scale with the chunk size, just as for compressed arrays; I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.

Here is a simple script to showcase the effect:

import zarr
import numpy as np
import time
import os
import shutil

# Sizes to test
array_sizes = [10**3, 10**4, 10**5, 10**6, 10**7, 10**8]

# Data to write: an array of 10 random floats
data_to_write = np.random.rand(10)

# Results storage
results = []

for size in array_sizes:
    print(f"Testing array size: {size}")
    # Define paths for Zarr arrays
    uncompressed_path = f'zarr_uncompressed_{size}.zarr'
    compressed_path = f'zarr_compressed_{size}.zarr'
    # Remove existing arrays if they exist
    for path in [uncompressed_path, compressed_path]:
        if os.path.exists(path):
            shutil.rmtree(path)
    # Create uncompressed Zarr array
    uncompressed_array = zarr.open(
        uncompressed_path,
        mode='w',
        shape=(size,),
        chunks=(size,),  # Single chunk covering the entire array
        dtype='float64',
        compressor=None  # No compression
    )
    # Create compressed Zarr array (default compressor is zlib)
    compressed_array = zarr.open(
        compressed_path,
        mode='w',
        shape=(size,),
        chunks=(size,),  # Single chunk
        dtype='float64'  # Using default compression
    )
    # Initialize arrays with zeros
    uncompressed_array[:] = 0.0
    compressed_array[:] = 0.0
    # Define the index range for writing
    start_index = size // 2  # Middle of the array
    end_index = start_index + 10
    # Measure write speed for uncompressed array
    start_time = time.time()
    uncompressed_array[start_index:end_index] = data_to_write
    uncompressed_write_time = time.time() - start_time
    # Measure write speed for compressed array
    start_time = time.time()
    compressed_array[start_index:end_index] = data_to_write
    compressed_write_time = time.time() - start_time
    # Store the results
    results.append({
        'array_size': size,
        'uncompressed_write_time': uncompressed_write_time,
        'compressed_write_time': compressed_write_time
    })
    print(f"Uncompressed write time: {uncompressed_write_time:.6f} seconds")
    print(f"Compressed write time:   {compressed_write_time:.6f} seconds\n")
Testing array size: 1000
Uncompressed write time: 0.000219 seconds
Compressed write time:   0.000234 seconds

Testing array size: 10000
Uncompressed write time: 0.000454 seconds
Compressed write time:   0.000268 seconds

Testing array size: 100000
Uncompressed write time: 0.001391 seconds
Compressed write time:   0.000645 seconds

Testing array size: 1000000
Uncompressed write time: 0.015249 seconds
Compressed write time:   0.001800 seconds

Testing array size: 10000000
Uncompressed write time: 0.196500 seconds
Compressed write time:   0.029215 seconds

Testing array size: 100000000
Uncompressed write time: 1.841548 seconds
Compressed write time:   0.242806 seconds
@d-v-b
Copy link
Contributor

d-v-b commented Nov 26, 2024

I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.

Yes, this is what zarr is doing, and as you discovered, it's leaving a lot of performance on the table. I think you ran these tests with zarr-python version 2, is that correct? With the upcoming release of zarr-python 3, I don't think we will be applying performance optimizations to zarr-python 2, but this kind of improvement would absolutely be in scope for the v3 codebase.

Would it be possible for you to try this with zarr-python 3 (you can get the beta release by installing zarr-python=3.0.0b2). zarr-python 3 adds an additional degree of freedom in the form of the sharding codec, which can pack multiple independently addressible chunks inside a single file (a "shard"). This storage layout is designed for parallel reads, but if the chunks are not compressed, I think it could support parallel writes as well, which would be of great interest for many applications. Nobody is actively working on this now but I think it would be very exciting.

@jbuckman
Copy link
Author

The issue is still around in 3.0.0b2:

zarr.__version__ = '3.0.0b2'
Testing array size: 1000
Uncompressed write time: 0.001760 seconds
Compressed write time:   0.001034 seconds

Testing array size: 10000
Uncompressed write time: 0.001367 seconds
Compressed write time:   0.001034 seconds

Testing array size: 100000
Uncompressed write time: 0.002127 seconds
Compressed write time:   0.002159 seconds

Testing array size: 1000000
Uncompressed write time: 0.010875 seconds
Compressed write time:   0.008503 seconds

Testing array size: 10000000
Uncompressed write time: 0.060695 seconds
Compressed write time:   0.080951 seconds

Testing array size: 100000000
Uncompressed write time: 1.151494 seconds
Compressed write time:   1.073868 seconds

Is there some re-configuring I need to do of the zarrays to enable the sharding behavior?

@dstansby dstansby added V2 Affects the v2 branch performance Potential issues with Zarr performance (I/O, memory, etc.) labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.) V2 Affects the v2 branch
Projects
None yet
Development

No branches or pull requests

3 participants