Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spykingcircus2: optimization and speedup for in vitro HD-MEA recordings with low SNR #3543

Open
Djoels opened this issue Nov 19, 2024 · 16 comments
Assignees
Labels
question General question regarding SI sorters Related to sorters module

Comments

@Djoels
Copy link

Djoels commented Nov 19, 2024

Running SC2 on 10minutes/37GB in vitro (from hiPSC cultures) HD-MEA recordings yields some surprisingly good results without too much finetuning. (using SI 0.101 for this)
I have a couple of questions about getting it to run smoother:

It takes 10k s (roughly 3hours) to run, is there any way I can configure it to run faster?
Tried changing the number of jobs to 80% (28 jobs), tried setting the chunk size so that it would take bigger chunks. It seems that many cores are used, but their individual memory usage is very low.
Do I preferably spin up a system with more cores (running in Azure currently).
I have access to GPU, but SC2 doesn't use GPU acceleration, right?

If I were to try and finetune a limited set of parameters, for the low SNR and 1023 readout channels use case, which should I finetune?
I'm assuming, the following, but I may have missed some:

  • detect_threshold,
  • radius_um,
  • not sure of the impact of matching engine method? is there a clear choice for the use case I'm describing?

I'm also having trouble identifying which step is in progress at a given point in time from the logging information.
Made a chart based on the code, to try and understand the flow.
Is there a way to enable logging to reflect the high-level steps (or would it be ok if I try to add logging)?
image

@Djoels Djoels changed the title spykingcircus2: optimization and speedup for HD-MEA with low SNR spykingcircus2: optimization and speedup for in vitro HD-MEA recordings with low SNR Nov 19, 2024
@alejoe91 alejoe91 added question General question regarding SI sorters Related to sorters module labels Nov 19, 2024
@yger
Copy link
Collaborator

yger commented Nov 19, 2024

Thanks for your interest. In fact, I have a working branch that should speed up quite drastically the whole algorithm, making use of recent changes discussed with @samuelgarcia . Currently, no GPU are used, but with a powerful machine and tens of core, this should be much faster than numbers you are reported.
The best would be, if you are willing to, to share your 37GB file (or even smaller one if you want) such that I can valide everything on it, before merging everything into master. Then we can discuss the optimization of the parameters. Your pipeline is correct, this is the generic workflow of the algorithm

@Djoels
Copy link
Author

Djoels commented Nov 20, 2024

Thank you very kindly for being willing to have a look at this. I'm eager to learn about how you troubleshoot this as it is an effort I'm quite novice at.

I've created a separate blob container in azure with recording and minimal code:
https://storageczispikesort.blob.core.windows.net/safesharecont01?sv=2023-01-03&spr=https%2Chttp&st=2024-11-21T00%3A00%3A00Z&se=2024-11-28T00%3A00%3A00Z&sr=c&sp=rl&sig=o%2FNBVdjwvdN%2F8quqQ2kjrVMgkKBp3eNp3l87UKG3KdM%3D

you can download this using the azcopy command:

azcopy copy "<URL_FROM_ABOVE>" <local_dir> --recursive

It contains a hybrid ground truth recording under the hybridgt_20241011 directory, and the code to read the recording and ground truth sorting data and perform basic sorting is in read_minimal.py file
I also added a pip freeze in freeze.txt of a environment in which it took me 10000 seconds approximately.

Update @yger let me know if you have access issues, should be working until 28/11, if I set it up right.

@yger
Copy link
Collaborator

yger commented Nov 27, 2024

Ok, so I've made some tests, thanks for sharing the data. I've a working branch called "sc2_recording_slices" that will be merged soon, and with the default params in this branch, the code takes 3000s to run on my machine with 28 jobs. Important point is that because I have enough RAM, the file is written to memory and that might speed up everything also. You need to double check that this is also your case. Thus, we have an improvement, but it is still (a bit) long.

However, given that the longest step is the template matching, one other possibility to speed everything again is to switch the default template matching engine from circus-omp-svd to wobble. I won't discuss in depth the differences, but results should be somehow similar, and matching should be faster (x1.5 I would say). I'll test that. So you can try on your side by updating the params of sc2.

params = {"matching" : {'method' : 'wobble'}}

I'll keep digging a bit, and keep you posted

@yger
Copy link
Collaborator

yger commented Nov 27, 2024

I must mention that in this new branch, you have now the option to use GPU during the fitting procedure (both with circus-omp-svd and wobble). You can do so by setting

params = {"matching" : {'engine' : 'torch', 'torch_device' : 'cuda'}}

However, this has not yet been properly benchmarked, and I don't know what is the best. Few cores with GPU, or lots of cores without GPU. I should also have asked, but are you using Linux or Windows? That might have an imporant role also...

@Djoels
Copy link
Author

Djoels commented Nov 27, 2024

Thank you so much for looking into this: these are some exciting developments!

I am working on a Linux distribution, the RAM can be tailored for a given situation, some example (pure CPU) configurations:

  • Standard_D15_v2: 20 cores, 140 GB RAM, 1000 GB disk
  • Standard_D32s_v3: 32 cores, 128GB RAM, 256GB storage
  • Standard_D64_v3: 64 cores, 256GB RAM, 1600GB storage

About the caching of the recording in memory, I'm not so sure how to set these parameters (I left them to the default values):
'cache_preprocessing': {'mode': 'memory', 'memory_limit': 0.5, 'delete_cache': True},
If I have 140GB of RAM at my disposal, it should suffice to have 50% aka 70GB allocated for the recording, right?

Note: on trying to rerun, I get this notification, right before the "detect peaks using locally_exclusive" step.
Recording too large to be preloaded in RAM...

@yger
Copy link
Collaborator

yger commented Nov 28, 2024

Ok, good to know that you are on Linux, because this is what I'm using also, and the multiprocessing mode of spikeinterface is known to be better there than on Windows.

The caching can be optimized, given your amount of RAM. Indeed, what the code will do is try to fit into RAM the preprocessed recording (in float32, so size might be more than original one if raw data are in int16), if 0.5 (memory_limit) of your RAM is free and available, and big enough to receive the recording. But given the fact that you have the warning, the recording is not preloaded in RAM. You can try to increase memory_limit if you are willing to devote more RAM to SC2. This is not a "major" deal if you do not preload in RAM, and will be like that anyway for very long recordings, but because you have multiple passes over the data (to find peaks, to match peaks, ...), be sure that you have an SSD drive there, because this is the main bottleneck. Plus, without caching, every time chunks are reloaded, preprocessing steps are re-applied. This is ok as long as you do not have too complicated preprocessing steps, but otherwise this is good to know that there might be some speed gain there. If not enough RAM, you can still cache the preprocessed file to folder, but it requieres you to have enough disk space then. I'll keep playing and we'll push the PR into main, I'll let you know

@Djoels
Copy link
Author

Djoels commented Nov 28, 2024

When I'm running SC2 again on a similar recording, I see this in the command line:
write_memory_recording: 0%| | 0/602 [00:00<?, ?it/s]

However it doesn't appear to be moving, maybe this is to do with issues in visualizing the progress, I can't recall seeing it go to completion, all of a sudden it is just done I guess.

CPU usage is a 100% (on all but 2 cores) during this time:
image

update: I also tried a way bigger cluster:
image
with as it stands the same issue...

@yger
Copy link
Collaborator

yger commented Nov 28, 2024

Then I guess this is just a display issue with the progress bar. Weird, because I never saw that but I'll look into it

@Djoels
Copy link
Author

Djoels commented Nov 28, 2024

4 hours later, still the output hasn't changed, somehow the system seems completely stalled on this.
maybe I should try the zarr approach?

Update: tried the zarr approach and the exact same thing occured, on the 64core system. All cores are completely used, no output in the command line, nothing seems to happen.

Trying folder mode as it seems to be the only remaining way to go about it...

@yger
Copy link
Collaborator

yger commented Nov 29, 2024

Ok, then forget about the caching, but this is strange. As said, caching is a plus, but speedup should not be major, I'll redo some more benchmark with/without to test that. Waiting for the new branch, you can also change the fitting engine to wobble, this would already make a gain

@samuelgarcia
Copy link
Member

@Djoels : congrats for making this high quality diagram. We should add it in the doc at some point to describe the sorting components.

@Djoels
Copy link
Author

Djoels commented Jan 8, 2025

About the chart: you can adapt it to your needs with on the draw.io website by importing the svg in attachment (I originally made it using a draw.io extension on confluence).
spykingcircus2_algorithm.drawio.svg.zip

About the issue: I'll soon start up some experiments on SLURM, which should be a more standardized Linux version and hopefully fix the issues with caching.

@samuelgarcia
Copy link
Member

Thank you.

SLURM something is badly estimating the memory used because the sharedmem is counted for every process.
I experience this in my lab, this depend on the system and some slur option.
In short sometimes a simple top do show high ram usage but slurm kill the job because it use to much ram. Tell me if it works for you.

@yger
Copy link
Collaborator

yger commented Jan 16, 2025

One key PR for SC2 has been merged, there are some remainning ones that are in the pipeline, but you should get quite a speedup already Let me know how it gets, I'll do some tests also on my side. The clustering will evolve slightly in the next days, but we'll see

@Djoels
Copy link
Author

Djoels commented Jan 20, 2025

It saddens me to say that I appear to have the exact same condition on SLURM:
ample resources, but somehow the multiprocessing bit gets completely stuck. All processes have enough resources but are stuck in the "sleeping" status, waiting for some condition which I don't know.
The main command is also stuck on status S...
I would expect this error to be specific to one environment, but being as it happens in both Azure and SLURM I guess it has to do with my conda environment? Is there a specific requirement (other than HDBscan) for python version?
For instance I run in python 3.12 -> is this too old/new?
also I find that the iterations of write_memory_recording doesn't update but it takes ages to get there in itself.

Will try a minimal version of my code soon. hopefully I'll figure it out from that what is causing this.

@Djoels
Copy link
Author

Djoels commented Jan 22, 2025

I figured out that it had to do with my environment. In azure the default environments start from a number of libraries. After creating an environment from scratch, I succeeded in getting spykingcircus2 (and TDC2, which also suffered from this problem) working.

The environment I created:

name: si_minimal
channels:
 - conda-forge
 - pytorch
 - nvidia
dependencies:
 - python=3.12
 - matplotlib
 - pytorch
 - torchvision
 - torchaudio
 - pyyaml
 - optuna
 - line_profiler
 - natsort
 - pynvml
 - ipython
 - pandas
 - h5py
 - hdbscan
 - python-dotenv
 - pip
 - pip:
    - kilosort==4.0.12
    - shybrid
    - docker
    - git+https://github.com/SpikeInterface/spikeinterface.git
    - herdingspikes
    - mountainsort5

I did however stumble upon the issue of leaked shared_memory objects and decided to use the "no-cache" mode because this caused the sorting run to stop:

/anaconda/envs/azureml_py312/lib/python3.12/multiprocessing/resource_tracker.py:254: 
UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown

TDC2 runs now take roughly 20 minutes to run with 14 available cores (37gb, 600s recording)
SC2 runs takes some 15 minutes with 45 cores (37gb, 600s recording also)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question regarding SI sorters Related to sorters module
Projects
None yet
Development

No branches or pull requests

4 participants