Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance investigation #112

Open
neworderofjamie opened this issue Aug 27, 2024 · 1 comment
Open

Performance investigation #112

neworderofjamie opened this issue Aug 27, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@neworderofjamie
Copy link
Contributor

neworderofjamie commented Aug 27, 2024

Because EventProp is really fast, Amdahl's law once again strikes and CPU-side overheads start to become problematic, especially when training on large datasets like SSC. Training one batch takes approximately 25ms but there's 2ms 'gaps' between batches. With 2359 batches in the training set this corresponds to about 1 minute of actual training computation and 5s of time spend between batches per-epoch. Examining this period in Nsight Systems shows the following (memcpy coming in from the left is readout and only appears massive as it was added to the command queue a long time before - actual time is tiny purple bar):

image

Biggest blocks of GPU time are:

  • Batch reduction (300µs)
  • Spike time memcpy (350µs)

Biggest blocks of CPU time (i.e. GPU idle time) are:

  • Around asynchronous memcpys to symbol (Alpha, MomentScale1 and MomentScale2 being set on the Adam optimisers associated with 3 connections).

  • Between end of batch custom updates and spike time memcpy. Standard Python profile of SpikeInput.set_input shows:

    ncalls tottime percall cumtime percall filename:lineno(function)
    2671 0.408 0.000 2.543 0.001 spike_input.py:50(set_input)
    2671 0.751 0.000 1.374 0.001 data.py:186(batch_spikes)
    8013 0.711 0.000 0.711 0.000 model_preprocessor.py:56(push_to_device)
    2671 0.216 0.000 0.216 0.000 data.py:211()
    2673 0.070 0.000 0.179 0.000 shape_base.py:219(vstack)
    2673 0.093 0.000 0.108 0.000 shape_base.py:81(atleast_2d)

    suggesting that, overall, this function accounts for around half of the 5s inter-batch time (matching the Nsight Systems profile) and the Python processing of the PreprocessedSpikes data structure is more expensive than the synchronous CUDA memcpys.

Possible ways to improve these overheads include:

  • Make spike memcpy asynchronous: this could save around 60µs per-batch by overlapping copying of spike times (first big block) and calling utils.data.calc_start_spikes.
  • Use CUDA streams and double-buffer spike memcpying: while this could save around 290µs per-batch, the Python processing of PreprocessedSpikes data structures would still be problematic.
  • Copying multiple batches to GPU: this is what Thomas's code does and would obviously help
  • Inspired by Spyx, we could replace the spike source array with a simpler model that uses a dense 1-bit tensor to store spikes. Based on 700 input channels, 1000 timesteps and a maximum of 20295 spikes per-example (from SSC) both data formats use around 85KByte per example but 1-bit tensors do not require the same level of processing as the current data structure and could be copied in a single memcpy.
  • More thought is maybe required about how to dynamic parameter setting more efficient

I think, when balancing performance with attempting to maintain backward-compatibility, adding support for copying multiple-batches to the GPU at a time while keeping the current data structure would probably be the best option.

@neworderofjamie neworderofjamie added the enhancement New feature or request label Aug 27, 2024
@genn-team genn-team deleted a comment from littleblack666333 Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants
@neworderofjamie and others