Memory usage increases during subsequent evaluations of cellpose model #539

tcompa · 2022-08-01T13:12:42Z

Hi there, and thanks for your support.

While working on a different project, with @mfranzon and @jluethi we noticed an unexpected increase of RAM usage during subsequent runs of cellpose segmentation with the nuclei model. I'll report here an example which is as self-contained as possible, but other pieces of information are scattered in our original issues.
The question is whether this behavior looks expected/normal, or whether we could try to mitigate it. Also we are wondering if it comes from cellpose or from torch.

Context

Our goal is to perform segmentation of 3D images with cellpose pre-trained nuclei model. We need to segment a certain number of arrays (say 20 of them), and each array may have shape like (30, 2160, 2560) and type uint16. The processing of different arrays (AKA the different cellpose calls) takes place sequentially, on a node which has 64G of memory and access to a GPU. The GPU memory is under control throughout the entire run (around 4 GiB out of 16 are used), while this issue concerns the standard RAM usage (which we monitor via mprof).

Code and results

As a minimal-working example, we load a single array of shape (30,2160,2560) and repeatedly compute the corresponding labels several times. If needed, we can find the best way to share the image folder - or use other data which are already easily available for testing.

The code looks like

import sys
import time

from skimage.io import imread
import numpy as np
from cellpose import core
from cellpose import models

def run_cellpose(img, model):
    t_start = time.perf_counter()
    print(f"START | shape: {img.shape}")
    sys.stdout.flush()
    mask, flows, styles, diams = model.eval(
        img,
        do_3D=True,
        channels=[0, 0],
        net_avg=False,
        augment=False,
        diameter=80.0,
        anisotropy=6.0,
        cellprob_threshold=0.0,
    )
    t_end = time.perf_counter()
    print(f"END  | num_labels={np.max(mask)}, elapsed_time={t_end-t_start:.3f}")
    sys.stdout.flush()
    return mask


# Read 3D stack of images (42 Z planes available)
num_z = 30
stack = np.empty((num_z, 2160, 2560), dtype=np.uint16)
for z in range(num_z):
    stack[z, :, :] = imread(f"images_v1/20200812-CardiomyocyteDifferentiation14-Cycle1_B05_T0001F002L01A01Z{z+1:02d}C01.png")

# Initialize cellpose
use_gpu = core.use_gpu()
model = models.Cellpose(gpu=use_gpu, model_type="nuclei")
print(f"End of initialization: num_z={num_z}, use_gpu={use_gpu}")

nruns = 10

for run in range(nruns):
    print(run)
    run_cellpose(stack, model)

This code runs through, and it takes approximately 320 seconds for each segmentation (finding around 3k labels). The memory trace during the first few iterations of the loop is shown below, and we notice that subsequent runs have a larger and larger memory usage - until this saturates after a few iterations. If we look at the plateau regions in the memory trace, for instance, their values (in GiB) are: 12, 13.8, 14.1, 14.1, .. Also the memory-usage peaks at the end of each cellpose calls are shifting up by a similar amount, accumulating about 2 GiB during the first 2-3 iterations.
The simplest explanation would be that cellpose or torch are caching something, but we couldn't identify what is being cached. Is this actually happening? If so, is there a way to deactivate this caching mechanism?

Expected behavior and why it matters

We would expect that subsequent runs on the same exact input require a very similar amount of memory - unless some caching is in-place. The relevance of this issue (for us) is that even if the memory accumulation seems mild (that's only 2 GiB more than expected), in more complex/heavy use cases (including additional parallelism) it may lead to memory errors (as we found in fractal-analytics-platform/fractal-client#109 (comment)). For this reason we'd really like to keep it under control, possibly by deactivating caching options (if any).

Environment

The python code above is submitted to a SLURM queue, and it runs on a node with a GPU available.

Relevant details on the python environment:

sys.version='3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]'
numpy.__version__='1.23.1'
torch.__version__='1.12.0+cu102'

The text was updated successfully, but these errors were encountered:

carsen-stringer · 2022-08-11T21:28:12Z

I have no idea, have you tried any garbage collecting? you could call cellpose as a process and then it will clean up (all those options are available on the CLI) but then you have to re-read in the saved masks

tcompa · 2022-08-29T15:09:17Z

Thanks for your comment.

I confirm that adding gc.collect() here and there (both within run_cellpose function, and especially right after each call to this function within the loop) does not lead to any relevant change in the memory trace.

At the moment we cannot go for the CLI path, since this labeling task is part of a more complex platform to process bio-images (https://github.com/fractal-analytics-platform/fractal), where tasks need to be python functions.

For now we'll just keep this issue in mind, and apply mitigation strategies (e.g. working at a lower resolution) if/when needed.

mayishazn · 2024-11-14T03:05:31Z

I have the same issue. It'll run and then eventually eat all the available RAM. I have to mark where it left off and restart the Kernel. gc.collect() did not help.

tcompa mentioned this issue Aug 1, 2022

Labeling-tasks integration into fractal + (Torch) memory errors fractal-analytics-platform/fractal-client#109

Closed

carsen-stringer added bug Something isn't working help wanted Extra attention is needed labels Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage increases during subsequent evaluations of cellpose model #539

Memory usage increases during subsequent evaluations of cellpose model #539

tcompa commented Aug 1, 2022

carsen-stringer commented Aug 11, 2022

tcompa commented Aug 29, 2022

mayishazn commented Nov 14, 2024

Memory usage increases during subsequent evaluations of cellpose model #539

Memory usage increases during subsequent evaluations of cellpose model #539

Comments

tcompa commented Aug 1, 2022

Context

Code and results

Expected behavior and why it matters

Environment

carsen-stringer commented Aug 11, 2022

tcompa commented Aug 29, 2022

mayishazn commented Nov 14, 2024