Skip to content

Move + Destroy bug in TKParallelTensor #180

@kingsleykimm

Description

@kingsleykimm

I have an All2All Context Manager where I am performing the std::move() on TKParallelTensor and it is throwing this error:

terminate called after throwing an instance of std::out_of_range
std::out_of_range
  what():  map::at

I believe the error is happening from the parallel_tensor.cuh file's destroy method when it is called on an stale TKParallelTensor that has been moved. In that case, the stale tensor's local_rank and local_world_size value are set to -1, so this line

brokers_.at({local_rank_, local_world_size_}).sync(); // must sync before destroying the tensor

Will throw an out of range error, since each rank only has a broker for its current local_rank and local_world_size, which isn't {-1, -1} for a reasonable multi-gpu setup. A simple way to reproduce the error is to create a TKParallelTensor in one scope, std::move() the TKParallelTensor so another object owns it, and then leave the scope so the tensor gets destroyed and throws the error.

Is this an intentional design choice so that TKParallelTensor objects shouldn't be moved?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions