You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Execution engines can do much better at optimizations if they transparently know what types they are working with. Efforts in the Spark as well as the Flink community proof optimization potentials in this regard. As a layer above such execution engines Euphoria must provide type specific information to executors in order to stay relevant (in terms of performance). In this ticket we'll focus on Flink.
Motivation
The primary background for this ticket was triggered by an attempt to shave off some of the overhead mentioned in #13 and #14.
Experiment
Using opaque data types with general purpose serialization has considerable, negative effects on optimizations that Flink tries to apply by default. I was able to see the effect in an experiment, where the "core" operation of the flow is the following (basically just a windowed word-count):
Changing Euphoria's flink batch executor such that it uses Flink's native and Flink's pojo based serializers for the types involved during the reduce operation, squeezed out about half of the original execution time of the program. The amount of data shuffled was approximately the same. Unfortunately, such an approach required me to explicitly provide the return type of the .keyBy function to Euphoria's FlinkExecutor's internals. I was not able to derive the return type of a lambda in an automatic manner without the user having to explicitly state it.
What to do next
Find a non-verbose way for clients to provide required type information to the executor translators, i.e. return types of UDFs (we might not really succeed here and might end up with some verbose construct; maybe we can make the verbose constructs and the implied optimizations merely an opt-in?)
Leverage the types in Euphoria's Flink batch and stream executors.
Leverage type information for objects used internally in both flink executors to avoid general purpose serialization by kryo when possible.
Using TupleX instead of Pair and Tripple is a good start.
For internal types, utilizing POJOs instead of opaque data types is another easy gain.
Note: it will not help much if we switch over to Tuple but fail to provide type information of the actually used fields.
Attention: proper delegation of the window types from one operator to another due to attached windowing may require substantial changes to the current executor code.
The text was updated successfully, but these errors were encountered:
We shouldn't forget that to allow Flink to fully operate on serialized data, it needs to determine the right offset of the desired data. Flink does it through the possibility of setting the number of the field(s) in tuples and names of the filed(s) in POJO. E.g.:
Execution engines can do much better at optimizations if they transparently know what types they are working with. Efforts in the Spark as well as the Flink community proof optimization potentials in this regard. As a layer above such execution engines Euphoria must provide type specific information to executors in order to stay relevant (in terms of performance). In this ticket we'll focus on Flink.
Motivation
The primary background for this ticket was triggered by an attempt to shave off some of the overhead mentioned in #13 and #14.
Experiment
Using opaque data types with general purpose serialization has considerable, negative effects on optimizations that Flink tries to apply by default. I was able to see the effect in an experiment, where the "core" operation of the flow is the following (basically just a windowed word-count):
Changing Euphoria's flink batch executor such that it uses Flink's native and Flink's pojo based serializers for the types involved during the reduce operation, squeezed out about half of the original execution time of the program. The amount of data shuffled was approximately the same. Unfortunately, such an approach required me to explicitly provide the return type of the
.keyBy
function to Euphoria'sFlinkExecutor
's internals. I was not able to derive the return type of a lambda in an automatic manner without the user having to explicitly state it.What to do next
TupleX
instead ofPair
andTripple
is a good start.Tuple
but fail to provide type information of the actually used fields.The text was updated successfully, but these errors were encountered: