-
Notifications
You must be signed in to change notification settings - Fork 257
Description
I have an All2All Context Manager where I am performing the std::move() on TKParallelTensor and it is throwing this error:
terminate called after throwing an instance of std::out_of_range
std::out_of_range
what(): map::at
I believe the error is happening from the parallel_tensor.cuh file's destroy method when it is called on an stale TKParallelTensor that has been moved. In that case, the stale tensor's local_rank and local_world_size value are set to -1, so this line
brokers_.at({local_rank_, local_world_size_}).sync(); // must sync before destroying the tensor
Will throw an out of range error, since each rank only has a broker for its current local_rank and local_world_size, which isn't {-1, -1} for a reasonable multi-gpu setup. A simple way to reproduce the error is to create a TKParallelTensor in one scope, std::move() the TKParallelTensor so another object owns it, and then leave the scope so the tensor gets destroyed and throws the error.
Is this an intentional design choice so that TKParallelTensor objects shouldn't be moved?