Support a context-based syntax to partition all resources into different groups, for example, dividing all thread blocks into two groups, or partitioning GPU resources into two streams to perform different tasks (specialization).
with T.group("compute"):
...
with T.group("communication"):
...
In addition, it should support inserting barrier operations to guarantee correct synchronization between these groups.
Proposal: can refer to the T.Persistent primitive in TileLang.