Performance investigation #112

neworderofjamie · 2024-08-27T13:33:02Z

Because EventProp is really fast, Amdahl's law once again strikes and CPU-side overheads start to become problematic, especially when training on large datasets like SSC. Training one batch takes approximately 25ms but there's 2ms 'gaps' between batches. With 2359 batches in the training set this corresponds to about 1 minute of actual training computation and 5s of time spend between batches per-epoch. Examining this period in Nsight Systems shows the following (memcpy coming in from the left is readout and only appears massive as it was added to the command queue a long time before - actual time is tiny purple bar):

Biggest blocks of GPU time are:

Batch reduction (300µs)
Spike time memcpy (350µs)

Biggest blocks of CPU time (i.e. GPU idle time) are:

Around asynchronous memcpys to symbol (Alpha, MomentScale1 and MomentScale2 being set on the Adam optimisers associated with 3 connections).

Between end of batch custom updates and spike time memcpy. Standard Python profile of SpikeInput.set_input shows:

ncalls	tottime	cumtime	percall	filename:lineno(function)
2671	0.408	2.543	0.001	spike_input.py:50(set_input)
2671	0.751	1.374	0.001	data.py:186(batch_spikes)
8013	0.711	0.711	0.000	model_preprocessor.py:56(push_to_device)
2671	0.216	0.216	0.000	data.py:211()
2673	0.070	0.179	0.000	shape_base.py:219(vstack)
2673	0.093	0.108	0.000	shape_base.py:81(atleast_2d)

suggesting that, overall, this function accounts for around half of the 5s inter-batch time (matching the Nsight Systems profile) and the Python processing of the PreprocessedSpikes data structure is more expensive than the synchronous CUDA memcpys.

Possible ways to improve these overheads include:

Make spike memcpy asynchronous: this could save around 60µs per-batch by overlapping copying of spike times (first big block) and calling utils.data.calc_start_spikes.
Use CUDA streams and double-buffer spike memcpying: while this could save around 290µs per-batch, the Python processing of PreprocessedSpikes data structures would still be problematic.
Copying multiple batches to GPU: this is what Thomas's code does and would obviously help
Inspired by Spyx, we could replace the spike source array with a simpler model that uses a dense 1-bit tensor to store spikes. Based on 700 input channels, 1000 timesteps and a maximum of 20295 spikes per-example (from SSC) both data formats use around 85KByte per example but 1-bit tensors do not require the same level of processing as the current data structure and could be copied in a single memcpy.
More thought is maybe required about how to dynamic parameter setting more efficient

I think, when balancing performance with attempting to maintain backward-compatibility, adding support for copying multiple-batches to the GPU at a time while keeping the current data structure would probably be the best option.

The text was updated successfully, but these errors were encountered:

neworderofjamie added the enhancement New feature or request label Aug 27, 2024

genn-team deleted a comment from littleblack666333 Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance investigation #112

Performance investigation #112

neworderofjamie commented Aug 27, 2024 •

edited

Loading

Performance investigation #112

Performance investigation #112

Comments

neworderofjamie commented Aug 27, 2024 • edited Loading

neworderofjamie commented Aug 27, 2024 •

edited

Loading