-
Notifications
You must be signed in to change notification settings - Fork 9
Epoch Allocation And Scoping
The current epoch allocation/bit layout follows this pattern:
Baseline epoch Layout:
w-1 .............. w-h-1 ...............w-h-c-1 ....................0
| <EpochHeader> ... | <EpochCategory> ... | <SeqEpochID> |
where h = epoch_header_num_bits (~2 bits)
c = epoch_category_num_bits (1 bit)
w = sizeof(EpochType) * 8 (i.e., 64-bit/int64_t field)
Rooted epoch extension breakdown:
w-h-c-1 w-h-c-n-1 0
^ ^ ^
| .... n .... | ..........|
<NodeType> <SeqEpochID>
where n = sizeof(NodeType) (configurable, default 16 bits)
Thus, with default sizes:
- a collective epoch has 61 bits for the sequential epoch ID;
- a rooted epoch has 16 less bits or 45 bits for the sequential epoch ID (assuming number of nodes fits in 16-bits).
Each EpochWindow
manages allocation/deallocation for epochs of a particular type. The type is all the high bits minus the SeqEpochID
(set to zero). EpochWindow
uses IntegralSet
to compactly store terminated epochs. As epochs complete out of order, the data structure grows in size, but it's generally very efficient.
Epoch Scoping Motivation
Correct collective epoch allocation relies on ordered execution across nodes/ranks. Specifically, within a particular type of epoch, the sequential epoch is repeatedly incremented by 1 each time a new epoch is allocated until a free epoch is found. This presents two problems:
- Work units that create collective epochs must be ordered (even across "modules")
- There is a potential race in terms of discovery of epoch termination and re-use. For this to be a problem, an application would have to process an insane number of epochs to wrap around. The conscious design choice here was to not use the first free epoch returned from the
IntegralSet
---instead, use the next one in the sequence until a free one is found. This means as long as the program is generally reasonable the race is highly improbable.
Epoch scoping has the potential to solve both these problems (and make races even more unlikely). We could also attempt to completely eradicate the possibility of a race---but that might not worth the effort or communication cost.
Initially, I proposed a design in #1052/#1053 with a low number of bits (~5-8 bits) dedicated to epoch scopes. However, even a simple program that uses epoch scopes can fail with a couple layers of nesting.
For collective epochs:
w-h-c-1 .............w-h-c-s-1............0
| <EpochScopeID> | <SeqEpochID> |
where s = number of bits allocated for the epoch scope
Note: I believe that rooted epochs do not need a scope in any use case since they are created in a single location
Fundamental Problem:
The key here is that once a scope is allocated, there is a logical "task" that uses that scope along with a "continuation" following it that might also use it. If these are concurrent and they both create collective epochs, execution order will cause non-determinism. Note that in the implementation epoch scopes propagate. So if a handler is within an epoch scope and creates a new collective epoch it will inherit that epoch scope (similar semantics to how it would inherit the epoch).
- Instead of allocating a small number of bits for the epoch scope increase it to a large chunk (maybe 50% or more?) of the bits available for
SeqEpochID
. (Thus, we may need to expect wrap around for some programs?) - Once the user allocates a collective epoch scope, each epoch it returns (e.g., though
scope.makeEpochCollective()
) should actually return a new, free epoch scope for each call. - Epoch propagation works as it is currently implemented now (just propagates on new collective epochs created and does not change unless a new epoch scope is allocated)
- It's technically possible for the user to run out of bits for the epoch scope or even the epoch sequential ID. If the user has a well-structured program we can probably spin until allocation is possible for the next scope. If arbitrary dependencies are allowed across epoch scopes, that could cause a deadlock.