-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2015013-DataFlowEnvironment.tex
435 lines (300 loc) · 17.3 KB
/
2015013-DataFlowEnvironment.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
%rubber: module pdflatex
\documentclass[11pt,a4paper]{article}
\usepackage{microtype}\usepackage{mathptmx}
\usepackage{sdp_doc} % SDP style file
\usepackage[english]{babel}
\usepackage{listings}
\usepackage[pdfborderstyle={/S/U/W 1}]{hyperref}
\usepackage{amsthm}
\usepackage{thmtools}
\usepackage[dvipsnames]{xcolor}
\usepackage{algpseudocode}
\usepackage{longtable}
\declaretheoremstyle[spaceabove=6pt, spacebelow=6pt,
headfont=\normalfont\bfseries,
notefont=\mdseries, notebraces={(}{)},
bodyfont=\normalfont,
postheadspace=1em,
headpunct=\\,
notebraces=\ \ ,
qed=]{ExampleStyle}
\declaretheorem[style=ExampleStyle,shaded={rulecolor=Lavender,
rulewidth=1pt, margin=10pt, bgcolor={rgb}{0.98,0.98,1}}]{Example}
\definecolor{antiquewhite}{rgb}{0.98, 0.92, 0.86}
\lstset{ %
backgroundcolor=\color{antiquewhite}, % choose the background color; you must add \usepackage{color} or \usepackage{xcolor}
basicstyle=\footnotesize, % the size of the fonts that are used for the code
captionpos=b, % sets the caption-position to bottom
frame=single}
%%%%%%%%% START OF USER SETTINGS %%%%%%%%%%%%%%%%%%
% Enter here some information needed to fill in the template Title to appear
% on the front pages (will be filled in via the \sdpfrontpage command)
\newcommand{\bigdoctitle}{SDP Dataflow Architecture Study\xspace}
% Title to go in the "Document Status Sheet" Document number
\newcommand{\docnr}{SKA-TEL-SDP-0000083\_A0999\xspace}
% Context
\newcommand{\context}{(REP)}
% Revision
\newcommand{\revision}{0.92\xspace}
% Author(s)
\newcommand{\docauthor}{P.\ Braam, S.\ Zefirov\xspace}
% Lead author (goes in the footer)
\newcommand{\leadauthor}{P.\ Braam\xspace}
% Release
\newcommand{\release}{C\xspace}
% Date of the document release, format: Month YYYY (e.g., August 2008)
\newcommand{\docudate}{2016-03-01\xspace}
% Document classification
\newcommand{\classification}{Unrestricted}
% Status of the document (draft/final/etc.)
\newcommand{\docstatus}{Draft\xspace}
% Table with signatures
\newcommand{\signaturetable}{
\begin{tabularx}{\textwidth}{|X|X|X|}
\hline
Name & Designation & Affilitation\\
\hline
\\
\hline
Signature \& Date: & & \\
& & \\
& & \\
\hline
Name & Designation & Affilitation\\
\hline
& & \\
\hline
Signature \& Date: & & \\
& & \\
& & \\
\hline
\end{tabularx}
}
% Table with version numbers
\newcommand{\versiontable}{
\begin{tabularx}{\textwidth}{|X|X|X|X|}
\hline
\bf{Version} & {\bf Date of issue} & {\bf Prepared by} & {\bf Comments}\\
\hline
C & & & \\
\hline
\end{tabularx}
}
% Table with affiliations
\newcommand{\organisationtable}{
\begin{center}
\sffamily{\bf ORGANISATION DETAILS}\end{center}
\begin{table}[htbp]
\centering
\begin{tabular}[htbp]{|l|l|}
\hline
Name & Science Data Processor Consortium\\
\hline
\end{tabular}
\end{table}
}
%%%%%%%%%%%%% END OF USER SETTINGS %%%%%%%%%%%%%%%%%%%
\begin{document}
% load automatic pages
\sdpfrontpage
\sdptableofcontents
\sdplistoffigures
\sdplistoftables
% Add here the executive summary
\sdpsummary
This document refines a section of the SDP Summary Architecture Document by providing more information about data flow. After giving definitions, the requirements are summarized in a table (external to this document), and some 20 of these requirements are studied in more detail through sample discussions of possible implementation (i) conceptually using the architectural concepts, (ii) in Regent / Legion (iii) in the DNA DSL. We end the paper by mapping the requirements to some of the components in the SDP system.
(Section~\ref{sec:dataflow-definitions}) contains further definitions used in our description of data flow requirements and programs.
(Section~\ref{sec:dataflow-requirements}) refers to the requirements table. We present a significant set of sample data flow programs in
(Section~\ref{sec:dataflow-examples}). The final section describes how our work is relevant to elements of the SDP dataflow,
(Section~\ref{sec:dataflow-sdp-application}).
\sdpreferencedocs
\subsection*{Applicable Documents}
\iffalse
The following documents are applicable to the extent stated herein. In the
event of conflict between the contents of the applicable documents and this
document, \emph{the applicable documents} shall take precedence.
\begin{center}{
\begin{tabularx}{\textwidth}{|X|X|}
\hline
\bf{Reference} & \bf{Reference}\\
\bf{Number} & \\
\hline
AD01 & Science Data Processor Architecture\\\hline
\end{tabularx}}
\end{center}
\subsection*{Reference Documents}
The following documents are referenced in this document. In the event of
conflict between the contents of the referenced documents and this document,
\emph{this document} shall take precedence.
\begin{center}{
\begin{tabularx}{\textwidth}{|X|X|}
\hline
\bf{Reference} & \bf{Reference}\\
\bf{Number} & \\
\hline
RD01 & Science Data Processor Pipelines Element Design Report\\\hline
RD02 & COMP Element Design Report\\\hline
\end{tabularx}}
\end{center}
\fi
% The actual content goes here
\newpage
\section{Structure of this document}
We present an overview of the architecture of a data flow system and its use in section \ref{sec:dataflow-definitions}.
\section{Architectural overview of data flow}
\label{sec:dataflow-definitions}
\subsection{Overview of data flow programming}
A data flow description uses a directed graph where edges are called channels and nodes are called actors. Channels represent data movement, and possibly multiple channels transport data to actors, and the readiness of the data triggers a computation by the actor. The outputs of the actor are messages which are sent to other actors.
Some actors are initial actors and the runtime system sends them a {\em start} message. Similarly some actors inform the runtime system that they received their messages to indicate completion of the computation.
Many variations on the data flow model exist. In some cases actors can perform actions when data has arrived on just one of their input channels, in other cases they require all inputs to be ready. If the system is to offer high availability, the detection that no data has been received or that an actor has crashed requires intervention by the runtime system.
Pipeline programs are examples of programs that map directly to a data flow description.
\subsection{Example}
XXX Should we rewrite it in Regent?
\begin{Example}[A simple dataflow program]
A simple example is a program that reads two vector and computes the
sum product of them, i.e., at high level:
\begin{algorithmic}
\State $a\gets \textrm{readFile "fileA"}$; $b\gets \textrm{readFile "fileB"}$ ; $c\gets \textrm{+/} \quad a\times b$
\end{algorithmic}
\begin{lstlisting}
masterActor(input:fileName1,
input:fileName2,
input:clusterArchitecture) - schedules the work and starts it
ch1, ch2 = open(param:fileName1, N), open(param:fileName2, N)
fork(resource:computeNode,
actor:computeActor,
channel: ch1, channel: ch2,
output:collectorProc,
crash:ignore) -- get input upon fork
fork(collectorNode, process:collectorActor, input:computeNodes, crash:fail)
startDataFlow
result = wait(collectorNode)
computeActor
in1=fork(nodes:local,
actor:chFileRead,
input:ch1,
input:offset)
in2=fork(nodes:local,
actor:chFileRead,
input:ch2,
input:offset)
(A1, A2)=wait(in1, in2);
res=dotProductLocal(A1, A2)
join(output, res) -- sends res message to the collector, notifies
-- parent of clean exit
collectorActor
[partialDP ] = wait(input)
res = sum([partialDP])
join(res, parent) -- default case, join a parent and exit
\end{lstlisting}
\end{Example}
\subsubsection{Data flow and sequential programming}
SZTODO: Explain the example above, compare to sequential program, explain iteration space, explain iteration space partitioning, demonstrate possible schedulings.
\subsection{Dataflow languages}
\begin{enumerate}
\item
\item
\end{enumerate}
\section{Dataflow Requirements}
\label{sec:dataflow-requirements}
\section{The SKA SDP Dataflow Programs}
\label{sec:dataflow-examples}
\subsection{Slurm Interoperability}
reference:{\bf RUNTIME.SLURM}
The commandline arguments shall indicate:
\begin{enumerate}
\item number of nodes
\item number of processes per node
\item number of threads per process
\item memory per process
\item if a GPU accelerator is used
\item what network conduit is used, --net=
\end{enumerate}
If --net is missing, the number of nodes must be 1 which will be assumed if the --nodes parameters is missing and the program shall run on a single node.
SZ: a shell script to call SLURM and Regent program to display stats.
\subsection{Cluster awareness, Static Scheduling}
The data flow system shall make the programs aware of the cluster architecture. A sample data flow program will
\begin{enumerate}
\item Show it has found the nodes in the cluster
\item It will define a tree structure and name the top of the tree the coordinator node where it will run a process at level 0
\item Start processes on all level 1 nodes
\item The level 1 processes start island groups of other processes at level 2
\item The processes at level 2 can have access to $1...N$ cores per process, and $K$ GB of RAM per process.
\item The cluster architecture and process tree with its resources is printed out by the coordinator
\end{enumerate}
XXX Here I \textit{almost} understand how to do that. Can you give me some use case in the real world? EACH OF 10 NODES HAS EG 16GB OF RAM AND 12 cores - CREATE 4 ACTORS ON EACH NODE, EACH ACTOR USING 4GB, EACH USING 3 cores.
\subsection{Messaging - bandwidth and latency}
The data flow system shall use a high performance interconnect between processes. A latency and bandwidth study will demonstrate the utilization of the link.
\begin{enumerate}
\item The data flow system will spawn two processes that communicate in a client server fashion.
\item The client will send buffers of $K$ bytes each at least 3 times, where $K=1...2^{32}$.
\item When the buffer has been received, the server process will send a response to the client, upon which the client will send the next buffer.
\item The program will print out bandwidth and latency of the communication process.
\end{enumerate}
\subsection{Profiling information}
Event and debug logs shall be handled as follows. Under ~/.df-logs the data flow system will create multiple directories. Filenames surrounded by square brackets will be substituted by values:
\begin{itemize}
\item slurm/[jobid]/ - global parameters and output of the job here - like the nodes on which it is running
\item slurm/[jobid]/[network nodeid]/ - log files here
\item slurm/[jobid]/[task rank] - a symbolic link to the corresponding network-nodeid directory
\item local/pid’s/ - global parameters and output of the job here
\item local/pid’s/nodeid’s/logs
\item names/symbolic-links contain a string with [time-job-name-job params] and point to slurm jobid or process id directories
\end{itemize}
In this milestone we can rely on a shared file system to unify this into a global directory of log / debug / profiling information for the job.
The log files will leverage CPU performance counters, and a similar set of data for GPU.
The debug target and level can be amended dynamically.
\subsection{Dynamic Scheduling}
A program is executing a computation repeatedly using a set of actors. The actors each consume input sets. While the program is running, the input sets are adjusted to account for runtime differences in the running actors.
XXX Would gridding (Romein style) be a good example? Partitions of visibility data can be seen as input sets. NO - GRIDDING IS WAY WAY TOO COMPLICATED. INPUT SETS ARE ACTUALLY NOT RELEVANT - WHAT IS RELEVANT IS THAT A COORDINATING PROCESS SCHEDULES THE NEXT TASK WHERE RESOURCES ARE AVAILABLE.
\subsection{Binning irregular visibility data for parallelism}
A data flow program reads an HDF5 file with OSKAR simulated data. While mapping the file into a region, it creates a partition of the region indexed by bins $[u_i, u_i + \delta_i] x [v_j, v_j + \epsilon_j].$
The program demonstrates through further partitioning, access privileges or other means that it can perform computations on an a first subst of the bins (viz. while tiles on a checkerboard) sets of tiles in parallel, and during a next task on a second set of bins. The purpose of this demonstration is to acquire knowledge how to avoid atomic additions during gridding.
\subsection{HDF5 inter-operability}
A data flow program can start an actor upon completion of reading an HDF indexed set. The HDF5 fields will become a Regions fields. The program shall achieve I/O rates equal to 95\% of the iozone benchmark.
SZ: This is different from HDF5 support in FFI section. Here we measure throughput, there we bake up data for subsequent use.
\subsection{Resource Management}
The data flow system shall handle resource management in its scheduling.
\begin{enumerate}
\item A program will start 4 actors, two senders running on separate nodes, and two receivers running on a single node.
\item Each processes will be allowed to use $K$ bytes of memory.
\item Each sender will send $K$ bytes to each receiver, which acknowledges receipt and exits.
\item The program will demonstrate that if $K$ is more than $\frac{1}{2}$ of available RAM on the receiving node, the two receiving actors will be scheduled sequentially (to avoid deadlock). \item When $K$ is less than $\frac{1}{2}$ available RAM on the receiver, the two receiving actors will be scheduled concurrently.
\end{enumerate}
SZTODO: use privileges for this.
\subsection{Algorithmic Expressiveness}
A data flow program will perform a data dependent decision, invoke different actors in the branches of the decision and exit. Another data flow program will approximate a square root using Newton's algorithm and exit when the square of the result is sufficiently small.
\subsection{Scalability and bi-sectional bandwidth}
The program shall start an actor on each core on a subset of the cluster for a total of at least 15,000 cores. A communication scheme among the actors will transfer data which should approximate the bi-sectional bandwidth of the cluster with actors.
\subsection{Failout Actors}
An actor named $\bf C$ collecting data from $k$ other actors to compute $\bf C$'s output can be made aware of the liveness of its $k$-inputs and upon failure of up to $m$ of the $k$ inputs compute a partial answer without an error condition.
XXX Most hard program to implement using Legion. I have some ideas, but it certainly will require modification of Legion code. SKIP FOR NOW.
\subsection{Data Flow inside MPI}
An MPI program can use data flow.
XXX I read this as "MPI program controls computations with data flow". A PROGRAM CAN USE MPI COMMUNICATIONS AND LEGION STYLE TASKS (DO NOT TRY THIS IN REGENT).
\subsection{Foreign function interfaces: C++/OpenMP, and MPI programs execute inside Actors}
An actor in a data flow program shall be able to call C++ functions, parallelized with OpenMP and can leverage MPI programs.
XXX "C/C++ function interface" part should implement FFTW call. XXX FFTW IS JUST ONE EXAMPLE, KERNELS USING MPI MESSAGING AND OPENMP & OPENACC ARE MORE IMPORTANT.WE NEED ALL OF THESE PREFERRABLY
\subsubsection{HDF5 Support}
SZTODO: Legion supports "HDF5 attachment" - HDF5 group can become a data hierarchy in Legion. The C++ code: https://github.com/StanfordLegion/legion/tree/stable/test/hdf_attach Write Regent program for generating data for FFW (sine wave).
SZ: Here we prepare data for subsequent use in programs below. This is different from throughput measurement in "HDF5 interoperability" and sets up context for examples below.
\subsubsection{Calling FFTW}
SZTODO: Regent program reads HDF5 data and calls FFTW routine, then finds and reports peak frequency.
\subsubsection{OpenMP kernel in Region program}
SZTODO: Dot product using OpenMP-parallelized loops and HDF5 data from HDF5 support.
\subsubsection{MPI from data flow}
SZTODO: Regent program reading HDF5 data and calling MPI code for distributed dot product.
\subsubsection{OpenACC}
SZTODO: is openacc available on cluster?
\subsection{Support for SMP CPU and GPU}
A data flow program can be created which optionally runs an actor on the GPU and optionally runs another actor on an SMP CPU.
XXX There are already examples of that functionality in Legion (apps/circuit, for one). MAKE A SUPER SIMPLE EXAMPLE.
\subsection{Distributed Partial Data Transposition}
A first actor creates a Region with a two dimensional index space, naturally ordered row-wise (i.e. subsequent elements rows are contiguous in memory, different rows become contiguous memory segments). A set of $k$ second actors performs work on $k$ sets of columns of the region. The data is transfered to the second actors in such a manner that it is contiguous.
\section{Application of Data Flow to SDP}
\label{sec:dataflow-sdp-application}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: