Unnecessarily slow writes to uncompressed arrays #2518

jbuckman · 2024-11-26T00:23:14Z

When writing some contiguous elements to an uncompressed array (or chunk of an array), it is possible to use a simple seek-and-write directly on disk. This operation should be nearly free, and nearly independent of the chunk size. However, the cost seems to scale with the chunk size, just as for compressed arrays; I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.

Here is a simple script to showcase the effect:

import zarr
import numpy as np
import time
import os
import shutil

# Sizes to test
array_sizes = [10**3, 10**4, 10**5, 10**6, 10**7, 10**8]

# Data to write: an array of 10 random floats
data_to_write = np.random.rand(10)

# Results storage
results = []

for size in array_sizes:
    print(f"Testing array size: {size}")
    # Define paths for Zarr arrays
    uncompressed_path = f'zarr_uncompressed_{size}.zarr'
    compressed_path = f'zarr_compressed_{size}.zarr'
    # Remove existing arrays if they exist
    for path in [uncompressed_path, compressed_path]:
        if os.path.exists(path):
            shutil.rmtree(path)
    # Create uncompressed Zarr array
    uncompressed_array = zarr.open(
        uncompressed_path,
        mode='w',
        shape=(size,),
        chunks=(size,),  # Single chunk covering the entire array
        dtype='float64',
        compressor=None  # No compression
    )
    # Create compressed Zarr array (default compressor is zlib)
    compressed_array = zarr.open(
        compressed_path,
        mode='w',
        shape=(size,),
        chunks=(size,),  # Single chunk
        dtype='float64'  # Using default compression
    )
    # Initialize arrays with zeros
    uncompressed_array[:] = 0.0
    compressed_array[:] = 0.0
    # Define the index range for writing
    start_index = size // 2  # Middle of the array
    end_index = start_index + 10
    # Measure write speed for uncompressed array
    start_time = time.time()
    uncompressed_array[start_index:end_index] = data_to_write
    uncompressed_write_time = time.time() - start_time
    # Measure write speed for compressed array
    start_time = time.time()
    compressed_array[start_index:end_index] = data_to_write
    compressed_write_time = time.time() - start_time
    # Store the results
    results.append({
        'array_size': size,
        'uncompressed_write_time': uncompressed_write_time,
        'compressed_write_time': compressed_write_time
    })
    print(f"Uncompressed write time: {uncompressed_write_time:.6f} seconds")
    print(f"Compressed write time:   {compressed_write_time:.6f} seconds\n")

Testing array size: 1000
Uncompressed write time: 0.000219 seconds
Compressed write time:   0.000234 seconds

Testing array size: 10000
Uncompressed write time: 0.000454 seconds
Compressed write time:   0.000268 seconds

Testing array size: 100000
Uncompressed write time: 0.001391 seconds
Compressed write time:   0.000645 seconds

Testing array size: 1000000
Uncompressed write time: 0.015249 seconds
Compressed write time:   0.001800 seconds

Testing array size: 10000000
Uncompressed write time: 0.196500 seconds
Compressed write time:   0.029215 seconds

Testing array size: 100000000
Uncompressed write time: 1.841548 seconds
Compressed write time:   0.242806 seconds

The text was updated successfully, but these errors were encountered:

d-v-b · 2024-11-26T09:13:07Z

I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.

Yes, this is what zarr is doing, and as you discovered, it's leaving a lot of performance on the table. I think you ran these tests with zarr-python version 2, is that correct? With the upcoming release of zarr-python 3, I don't think we will be applying performance optimizations to zarr-python 2, but this kind of improvement would absolutely be in scope for the v3 codebase.

Would it be possible for you to try this with zarr-python 3 (you can get the beta release by installing zarr-python=3.0.0b2). zarr-python 3 adds an additional degree of freedom in the form of the sharding codec, which can pack multiple independently addressible chunks inside a single file (a "shard"). This storage layout is designed for parallel reads, but if the chunks are not compressed, I think it could support parallel writes as well, which would be of great interest for many applications. Nobody is actively working on this now but I think it would be very exciting.

jbuckman · 2024-11-28T03:11:40Z

The issue is still around in 3.0.0b2:

zarr.__version__ = '3.0.0b2'
Testing array size: 1000
Uncompressed write time: 0.001760 seconds
Compressed write time:   0.001034 seconds

Testing array size: 10000
Uncompressed write time: 0.001367 seconds
Compressed write time:   0.001034 seconds

Testing array size: 100000
Uncompressed write time: 0.002127 seconds
Compressed write time:   0.002159 seconds

Testing array size: 1000000
Uncompressed write time: 0.010875 seconds
Compressed write time:   0.008503 seconds

Testing array size: 10000000
Uncompressed write time: 0.060695 seconds
Compressed write time:   0.080951 seconds

Testing array size: 100000000
Uncompressed write time: 1.151494 seconds
Compressed write time:   1.073868 seconds

Is there some re-configuring I need to do of the zarrays to enable the sharding behavior?

This was referenced Dec 1, 2024

Monthly issue metrics report MSanKeys963/zarr-python#5

Open

Monthly issue metrics report enthusiastdev121/zarr-python#15

Open

dstansby added V2 Affects the v2 branch performance Potential issues with Zarr performance (I/O, memory, etc.) labels Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessarily slow writes to uncompressed arrays #2518

Unnecessarily slow writes to uncompressed arrays #2518

jbuckman commented Nov 26, 2024 •

edited

Loading

d-v-b commented Nov 26, 2024

jbuckman commented Nov 28, 2024

Unnecessarily slow writes to uncompressed arrays #2518

Unnecessarily slow writes to uncompressed arrays #2518

Comments

jbuckman commented Nov 26, 2024 • edited Loading

d-v-b commented Nov 26, 2024

jbuckman commented Nov 28, 2024

jbuckman commented Nov 26, 2024 •

edited

Loading