-
Notifications
You must be signed in to change notification settings - Fork 11
Concepts
xitep edited this page May 10, 2017
·
11 revisions
Euphoria client API consists of the following main items:
-
Dataset: Represents a logical view on a - typically distributed - set of data. Such a data set may be files persisted on disk, data read from a Kafka topic, or transient data being the result of a computation. An important characteristic of a data set is:
- bounded: A data set can be said to be bounded if it is limited in its size, i.e. the data set "has an end".
- unbounded: A data set which is not bounded, i.e. doesn't have an end is said to be unbouded.
-
DataSource: A
DataSource
is the unit delivering data from external storages to Euphoria programs for processing. Hadoop'sInputFormat
s are such sources and can be leveraged in Euphoria through a wrapper calledHadoopSource
. Another such a source isKafkaSource
to consume data from a Kafka topic. -
DataSink: In contrast to a
DataSource
aDataSink
is the unit responsible for writing out a data set to an application external destination. For example, Hadoop'sOutputFormat
s can be leveraged in Euphoria throughHadoopSink
to persist data to disk or HDFS. -
Operator: An operator defines a transformation of a data set into another one. The exact form of the translformation itself is defined by the specific operator. There are four basic operators:
-
FlatMap
: Takes a user provided function and applies it to every element of the input data set, while allowing the user code to produce zero, one, or more outputs for each of the processed input elements. The type of the output elements can - and typically are - distinct from the type of the input elements. -
Repartition
: Allows changing the parallelism of a data set by assigning each input element a target partition on which it is consequently to be processed. User code rarely needs to use this operator directly since partitioning is often provided as part of other, higher-level operators. -
ReduceStateByKey
: Reduces its input elements of a particular group/key to zero, one, or many values. This is a rather complex operator rarely used directly by user code. Nevertheless, it is the base of other higher-level operators. -
Union
: A union allows viewing two data sets as one in subsequent processing.Unions
are restricted two input data sets of the same element type.
-
-
Flow: A chain of transformations/operators of data sets. The output of one operator typically serves as the input to another. Such chains form a directed acyclic graph (DAG) starting with one or more
DataSources
while ending in one or moreDataSinks
. One chain of transformations or simply flow is considered independend of another.
- A walk through a simple, non-windowed word count
- A walk through a windowed, word-count-like program