We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@FabianSchubert and @tnowotny - our discussion got me thinking, these operations would be pretty simple to implement.
Max reduction currently generates code something like (thread per synapse/neuron):
scalar lrMax = SCALAR_MIN; for(unsigned int batch = 0; batch < 512; batch++) { const unsigned int batchOffset = size * batch; const scalar lValue = group->Value[batchOffset + lid]; lrMax = fmax(lrMax, lValue); } group->Max[lid] = lrMax;
Argmax would simply look like:
unsigned int lrArgMax = 0; scalar lrMax = SCALAR_MIN; for(unsigned int batch = 0; batch < 512; batch++) { const unsigned int batchOffset = size * batch; const scalar lValue = group->Value[batchOffset + lid]; if(lValue > lrMax) { lrMax = lValue; lrArgMax = batch; } } group->ArgMax[lid] = lrArgMax;
These are a little gnarlier but max currently looks like (32-thread warp per batch):
scalar lrMax = SCALAR_MIN; for(unsigned int idx = lane; idx < group->size; idx += 32) { const scalar lVal = group->Val[batchOffset + idx]; lrMax = fmax(lrMax, lVal); } lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 16)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 8)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 4)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 2)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 1)); if(lane == 0) { group->Max[batch] = lrMax; }
Argmax would look like (totally untested):
unsigned int lrLaneArgMax = 0; scalar lrLaneMax = SCALAR_MIN; for(unsigned int idx = lane; idx < group->size; idx += 32) { const scalar lVal = group->Val[batchOffset + idx]; if(lValue > lrMax) { lrLaneMax = lValue; lrLaneArgMax = idx; } } scalar lrMax = lrLaneMax; lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 16)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 8)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 4)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 2)); lrMax = fmax(lrMax, __shfl_down_sync(0xFFFFFFFF, lrMax, 1)); // Find which lane(s) provided max value const unsigned int ballot = __ballot_sync(0xFFFFFFFF, lrMax == lrLaneMax); // If this is the first lane (arbitrary decision), use the index of the max within the neurons we've considered if(lane == (__ffs(ballot) − 1)) { group->ArgMax[batch] = lrLaneArgMax; }
Most of this will be trivia and involve:
VarAccessMode
VarAccess
getReductionOperation
getReductionInitialValue
CodeStream
The text was updated successfully, but these errors were encountered:
No branches or pull requests
@FabianSchubert and @tnowotny - our discussion got me thinking, these operations would be pretty simple to implement.
Batch reduction
Max reduction currently generates code something like (thread per synapse/neuron):
Argmax would simply look like:
Neuron reductions
These are a little gnarlier but max currently looks like (32-thread warp per batch):
Argmax would look like (totally untested):
Most of this will be trivia and involve:
VarAccessMode
and correspondingVarAccess
to https://github.com/genn-team/genn/blob/master/include/genn/genn/varAccess.hgetReductionOperation
andgetReductionInitialValue
in https://github.com/genn-team/genn/blob/master/src/genn/genn/code_generator/codeGenUtils.cc to generate and update multiple values (convention is when the code being generated is potentially a bit longer you takeCodeStream
by reference and write to that as it will respect indentation etc.The text was updated successfully, but these errors were encountered: