You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Map-side join is a special derivation of the Join operator, which can be turned into a plain MapElements operation, looking up the other side of the data to join in through a random-access API.
We'd like the write down such join operations through the existing Join operator. The special derivation mentioned above is merely a runtime characteristic which does not alter the semantics of the operation. In other words, a map-side join might be more efficient in certain situations, but the operator would still produce the results if carried out through a conventional (hash or sort based) join implementation. Prerequisites:
We'd need to provide support for data sets that can be accessed randomly.
We'd need to provide some way of hinting an executor to either do or not to do a map-side join "optimization" when it's possible.
The join operation would be effectively only a left or a right join (with zero or exactly one joined-value to be practical) - inner join semantics can be implemented through both of these.
On the other hand, if we are about to join two data sets which are already ordered by the "join key" and one or the other would provide the possibility to seek, we'd be able to optimize such a map-side join even more efficiently by turning the look-up into a re-seek. for a left-join, this would naturally support N join-values, which the random-access approach doesn't. As we can see, this is approach is a super-set of the random-access-approach. The canonical use-case would be to join to distinct databases with the same "primary key" (a typical key/value store delivers items ordered by the "key").
We clearly need more elaboration here, before starting with it. TBD
The text was updated successfully, but these errors were encountered:
The main issue here is how to ensure that the operation is guaranteed to have the same outcome as if it would be executed in classical reduce-state-by-key manner. There is no problem with this when there is no windowing (i.e. batch windowing), but with some other windowing, it might be tricky, as it would probably require that the random access storage would "understand" the windowing and that it would be able to store multiple windowed values for the same key.
I'd suggest, that we restrict the scope of this issue only to non-windowed (batch windowed) joins. Then it might solve part of the comments in #38.
Map-side join is a special derivation of the
Join
operator, which can be turned into a plainMapElements
operation, looking up the other side of the data to join in through a random-access API.We'd like the write down such join operations through the existing
Join
operator. The special derivation mentioned above is merely a runtime characteristic which does not alter the semantics of the operation. In other words, a map-side join might be more efficient in certain situations, but the operator would still produce the results if carried out through a conventional (hash or sort based) join implementation. Prerequisites:On the other hand, if we are about to join two data sets which are already ordered by the "join key" and one or the other would provide the possibility to seek, we'd be able to optimize such a map-side join even more efficiently by turning the look-up into a re-seek. for a left-join, this would naturally support N join-values, which the random-access approach doesn't. As we can see, this is approach is a super-set of the random-access-approach. The canonical use-case would be to join to distinct databases with the same "primary key" (a typical key/value store delivers items ordered by the "key").
We clearly need more elaboration here, before starting with it. TBD
The text was updated successfully, but these errors were encountered: