Skip to content

Concepts

xitep edited this page May 10, 2017 · 11 revisions

Terms

Euphoria client API consists of the following main items:

  • Dataset: Represents a logical view on a - typically distributed - set of data. Such a data set may be files persisted on disk, data read from a Kafka topic, or transient data being the result of a computation. An important characteristic of a data set is:
    • bounded: A data set can be said to be bounded if it is limited in its size, i.e. the data set "has an end".
    • unbounded: A data set which is not bounded, i.e. doesn't have an end is said to be unbouded.
  • DataSource: A DataSource is the unit delivering data from external storages to Euphoria programs for processing. Hadoop's InputFormats are such sources and can be leveraged in Euphoria through a wrapper called HadoopSource. Another such a source is KafkaSource to consume data from a Kafka topic.
  • DataSink: In contrast to a DataSource a DataSink is the unit responsible for writing out a data set to an application external destination. For example, Hadoop's OutputFormats can be leveraged in Euphoria through HadoopSink to persist data to disk or HDFS.
  • Operator: An operator defines a transformation of a data set into another one. The exact form of the translformation itself is defined by the specific operator. There are four basic operators:
    • FlatMap: Takes a user provided function and applies it to every element of the input data set, while allowing the user code to produce zero, one, or more outputs for each of the processed input elements. The type of the output elements can - and typically are - distinct from the type of the input elements.
    • Repartition: Allows changing the parallelism of a data set by assigning each input element a target partition on which it is consequently to be processed. User code rarely needs to use this operator directly since partitioning is often provided as part of other, higher-level operators.
    • ReduceStateByKey: Reduces its input elements of a particular group/key to zero, one, or many values. This is a rather complex operator rarely used directly by user code. Nevertheless, it is the base of other higher-level operators.
    • Union: A union allows viewing two data sets as one in subsequent processing. Unions are restricted two input data sets of the same element type.
  • Flow: A chain of transformations/operators of data sets. The output of one operator typically serves as the input to another. Such chains form a directed acyclic graph (DAG) starting with one or more DataSources while ending in one or more DataSinks. One chain of transformations or simply flow is considered independend of another.

Examples

  1. A walk through a simple, non-windowed word count
  2. A walk through a windowed, word-count-like program
Clone this wiki locally