- Implement a barrier mechanism and support efficient barrier operations, including:
T.create_barrier(num_arrive=num_arrive)
T.arrive(barrier)
T.wait(barrier)
T.sync(barrier)
so that it can be used to do p2p synchronization or group synchronization.
- Implement memory fence to ensure correct memory ordering and access semantics.