Move + Destroy bug in TKParallelTensor

I have an All2All Context Manager where I am performing the std::move() on TKParallelTensor and it is throwing this error:
```
terminate called after throwing an instance of std::out_of_range
std::out_of_range
  what():  map::at
```

I believe the error is happening from the [parallel_tensor.cuh](https://github.com/HazyResearch/ThunderKittens/blob/main/include/pyutils/parallel_tensor.cuh) file's destroy method when it is called on an stale TKParallelTensor that has been moved. In that case, the stale tensor's local_rank and local_world_size value are set to -1, so this line
```
brokers_.at({local_rank_, local_world_size_}).sync(); // must sync before destroying the tensor
```

Will throw an out of range error, since each rank only has a broker for its current local_rank and local_world_size, which isn't {-1, -1} for a reasonable multi-gpu setup. A simple way to reproduce the error is to create a TKParallelTensor in one scope, std::move() the TKParallelTensor  so another object owns it, and then leave the scope so the tensor gets destroyed and throws the error. 

Is this an intentional design choice so that TKParallelTensor objects shouldn't be moved?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move + Destroy bug in TKParallelTensor #180

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Move + Destroy bug in TKParallelTensor #180

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions