parallel.lyx

#LyX 2.0 created this file. For more info see http://www.lyx.org/
\lyxformat 413
\begin_document
\begin_header
\textclass article
\begin_preamble
\usepackage{/accounts/gen/vis/paciorek/latex/paciorek-asa,times,graphics}
\input{/accounts/gen/vis/paciorek/latex/paciorekMacros}
%\renewcommand{\baselinestretch}{1.5}
\usepackage[unicode=true]{hyperref}
\hypersetup{unicode=true, pdfusetitle,
bookmarks=true,bookmarksnumbered=false,bookmarksopen=false,
 breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=true,}
\end_preamble
\use_default_options false
\begin_modules
knitr
\end_modules
\maintain_unincluded_children false
\language english
\language_package default
\inputencoding auto
\fontencoding global
\font_roman default
\font_sans default
\font_typewriter default
\font_default_family default
\use_non_tex_fonts false
\font_sc false
\font_osf false
\font_sf_scale 100
\font_tt_scale 100

\graphics default
\default_output_format default
\output_sync 0
\bibtex_command default
\index_command default
\paperfontsize 12
\spacing onehalf
\use_hyperref false
\papersize letterpaper
\use_geometry true
\use_amsmath 1
\use_esint 0
\use_mhchem 1
\use_mathdots 1
\cite_engine basic
\use_bibtopic false
\use_indices false
\paperorientation portrait
\suppress_date false
\use_refstyle 0
\index Index
\shortcut idx
\color #008000
\end_index
\leftmargin 1in
\topmargin 1in
\rightmargin 1in
\bottommargin 1in
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\paragraph_indentation default
\quotes_language english
\papercolumns 1
\papersides 1
\paperpagestyle default
\tracking_changes false
\output_changes false
\html_math_output 0
\html_css_as_file 0
\html_be_strict false
\end_header

\begin_body

\begin_layout Title
An Introduction to Parallel Processing in R, 
\begin_inset Newline newline
\end_inset

Including Use on Clusters and in the Cloud
\end_layout

\begin_layout Author
Chris Paciorek
\begin_inset Newline newline
\end_inset

Department of Statistics
\begin_inset Newline newline
\end_inset

University of California, Berkeley
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Plain Layout

<<setup, include=FALSE, cache=FALSE>>=
\end_layout

\begin_layout Plain Layout

options(replace.assign=TRUE, width=55)
\end_layout

\begin_layout Plain Layout

@
\end_layout

\end_inset


\end_layout

\begin_layout Chunk
<<read-chunk, echo=FALSE>>= 
\end_layout

\begin_layout Chunk
read_chunk('parallel.R')
\end_layout

\begin_layout Chunk
read_chunk('doMPIexample.R')
\end_layout

\begin_layout Chunk
read_chunk('RmpiExample.R')
\end_layout

\begin_layout Chunk
read_chunk('parallel.sh')
\end_layout

\begin_layout Chunk
read_chunk('starcluster.sh')
\end_layout

\begin_layout Chunk
read_chunk('pbd-mpi.R')
\end_layout

\begin_layout Chunk
read_chunk('pbd-apply.R')
\end_layout

\begin_layout Chunk
read_chunk('pbd-linalg.R')
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Section
How to follow and try out this material
\end_layout

\begin_layout Standard
Note that my examples here will be silly toy examples for the purpose of
 keeping things simple and focused on the parallelization approaches.
 The demo code embedded in this document is available in the various .R and
 .sh files in the Github repository, 
\begin_inset CommandInset href
LatexCommand href
target "https://github.com/berkeley-scf/parallelR-biostat-2015"

\end_inset

.
 This document was created using Sweave.
\end_layout

\begin_layout Standard
I will do demos on an Ubuntu Linux virtual machine (VM), on the biostat
 cluster, and on Amazon's AWS.
 
\end_layout

\begin_layout Standard
We'll use the 
\begin_inset CommandInset href
LatexCommand href
name "BCE Virtual Machine"
target "bce.berkeley.edu"

\end_inset

.
 You can run this on your laptop (see the 
\begin_inset CommandInset href
LatexCommand href
name "BCE installation instructions"
target "bce.berkeley.edu/install.html"

\end_inset

), and I encourage you to do so to follow along.
 Note: to allow for parallelization, before starting the BCE VM, go to 
\family typewriter
Machine > Settings
\family default
, select 
\family typewriter
System
\family default
 and increase the number of processors.
 You may also want to increase the amount of memory.
\end_layout

\begin_layout Standard
We'll also use BCE as the basis for the virtual machines we start on Amazon's
 AWS.
 
\end_layout

\begin_layout Section
Overview of parallel processing computers
\end_layout

\begin_layout Standard
There are two basic flavors of parallel processing (leaving aside GPUs):
 distributed memory and shared memory.
 With shared memory, multiple processors (which I'll call cores) share the
 same memory.
 With distributed memory, you have multiple nodes, each with their own memory.
 You can think of each node as a separate computer connected by a fast network.
 
\end_layout

\begin_layout Subsection
Some useful terminology:
\end_layout

\begin_layout Itemize

\emph on
cores
\emph default
: We'll use this term to mean the different processing units available on
 a single node.
\end_layout

\begin_layout Itemize

\emph on
nodes
\emph default
: We'll use this term to mean the different computers, each with their own
 distinct memory, that make up a cluster or supercomputer.
\end_layout

\begin_layout Itemize

\emph on
processes
\emph default
: computational tasks executing on a machine.
 A given program may start up multiple processes at once.
 Ideally we have no more processes than cores on a node.
\end_layout

\begin_layout Itemize

\emph on
thread
\emph default
s: multiple paths of execution within a single process; the OS sees the
 threads as a single process, but one can think of them as 'lightweight'
 processes.
 Ideally when considering the processes and their threads, we would have
 no more processes and threads combined than cores on a node.
\end_layout

\begin_layout Itemize

\emph on
forking
\emph default
: child processes are spawned that are identical to the parent, but with
 different process id's and their own memory.
\end_layout

\begin_layout Itemize

\emph on
sockets
\emph default
: some of R's parallel functionality involves creating new R processes (e.g.,
 starting processes via 
\emph on
Rscript
\emph default
) and communicating with them via a communication technology called sockets.
\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
[see George's pdf for graphical representation, p.
 23]
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Shared memory
\end_layout

\begin_layout Standard
For shared memory parallelism, each core is accessing the same memory so
 there is no need to pass information (in the form of messages) between
 different machines.
 But in some programming contexts one needs to be careful that activity
 on different cores doesn't mistakenly overwrite places in memory that are
 used by other cores.
\end_layout

\begin_layout Standard
The shared memory parallelism approaches that we'll cover are:
\end_layout

\begin_layout Enumerate
threaded linear algebra and 
\end_layout

\begin_layout Enumerate
simple parallelization of embarrassingly parallel computations.
\end_layout

\begin_layout Paragraph*
Threading
\end_layout

\begin_layout Standard
Threads are multiple paths of execution within a single process.
 Using 
\emph on
top
\emph default
 to monitor a job that is executing threaded code, you'll see the process
 using more than 100% of CPU.
 When this occurs, the process is using multiple cores, although it appears
 as a single process rather than as multiple processes.
 In general, threaded code will detect the number of cores available on
 a machine and make use of them.
 However, you can also explicitly control the number of threads available
 to a process.
 
\end_layout

\begin_layout Subsection
Distributed memory
\end_layout

\begin_layout Standard
Parallel programming for distributed memory parallelism requires passing
 messages between the different nodes.
 The standard protocol for doing this is MPI, of which there are various
 versions, including 
\emph on
openMPI
\emph default
.
 The Python package 
\emph on
mpi4py
\emph default
 implements MPI in Python and the R package 
\emph on
Rmpi
\emph default
 implements MPI in R.
 
\end_layout

\begin_layout Standard
Some of the distributed memory approaches that we'll cover are:
\end_layout

\begin_layout Enumerate
simple parallelization of embarrassingly parallel computations,
\end_layout

\begin_layout Enumerate
using MPI for explicit distributed memory processing, and
\end_layout

\begin_layout Enumerate
distributed linear algebra.
\end_layout

\begin_layout Subsection
Other type of parallel processing
\end_layout

\begin_layout Standard
We won't cover either of the following in this material.
\end_layout

\begin_layout Subsubsection
GPUs
\end_layout

\begin_layout Standard
GPUs (Graphics Processing Units) are processing units originally designed
 for rendering graphics on a computer quickly.
 This is done by having a large number of simple processing units for massively
 parallel calculation.
 The idea of general purpose GPU (GPGPU) computing is to exploit this capability
 for general computation.
 
\end_layout

\begin_layout Standard
In spring 2014, I gave a 
\begin_inset CommandInset href
LatexCommand href
name "workshop on using GPUs"
target "http://statistics.berkeley.edu/computing/gpu"

\end_inset

.
 One easy way to use a GPU is on an Amazon EC2 virtual machine.
 
\end_layout

\begin_layout Subsubsection
Spark and Hadoop
\end_layout

\begin_layout Standard
Spark and Hadoop are systems for implementing computations in a distributed
 memory environment, using the MapReduce approach.
 In fall 2014 I gave a 
\begin_inset CommandInset href
LatexCommand href
name "workshop on Spark"
target "http://statistics.berkeley.edu/computing/spark"

\end_inset

.
 One easy way to use Spark is on a cluster of Amazon EC2 virtual machines.
 
\end_layout

\begin_layout Section
Basic suggestions for parallelizing your code
\end_layout

\begin_layout Standard
The easiest situation is when your code is embarrassingly parallel, which
 means that the different tasks can be done independently and the results
 collected.
 When the tasks need to interact, things get much harder.
 Much of the material here is focused on embarrassingly parallel computation.
\end_layout

\begin_layout Standard
The following are some basic principles/suggestions for how to parallelize
 your computation.
\end_layout

\begin_layout Itemize
If you can do your computation on the cores of a single node using shared
 memory, that will be faster than using the same number of cores (or even
 somewhat more cores) across multiple nodes.
 Similarly, jobs with a lot of data/high memory requirements that one might
 think of as requiring Hadoop may in some cases be much faster if you can
 find a single machine with a lot of memory.
\end_layout

\begin_deeper
\begin_layout Itemize
That said, if you would run out of memory on a single node, then you'll
 need to use distributed memory.
\end_layout

\end_deeper
\begin_layout Itemize
If you have nested loops, you generally only want to parallelize at one
 level of the code.
 That said, there may be cases in which it is helpful to do both.
 Keep in mind whether your linear algebra is being threaded.
 Often you will want to parallelize over a loop and not use threaded linear
 algebra.
 (That said, if you have multiple nodes, you might have one task per node
 and use threaded linear algebra to exploit the cores on each node.)
\end_layout

\begin_layout Itemize
Often it makes sense to parallelize the outer loop when you have nested
 loops.
\end_layout

\begin_layout Itemize
You generally want to parallelize in such a way that your code is load-balanced
 and does not involve too much communication.
 
\end_layout

\begin_deeper
\begin_layout Itemize
If you have very few tasks, particularly if the tasks take different amounts
 of time, often some processors will be idle and your code poorly load-balanced.
\end_layout

\begin_layout Itemize
If you have very many tasks and each one takes little time, the communication
 overhead of starting and stopping the tasks will reduce efficiency.
\end_layout

\end_deeper
\begin_layout Standard
I'm happy to help discuss specific circumstances, so just email consult@stat.berk
eley.edu.
 The new Berkeley Research Computing (BRC) initiative is also providing
 consulting on efficient parallelization strategies.
 If my expertise is not sufficient, I can help you get assistance from BRC.
\end_layout

\begin_layout Paragraph
Static vs.
 dynamic assignment of tasks
\end_layout

\begin_layout Standard
Some of R's parallel functions allow you to say whether the tasks can be
 divided up and allocated to the workers at the beginning or whether tasks
 should be assigned individually as previous tasks complete.
 E.g., the 
\emph on
mc.preschedule
\emph default
 argument in 
\emph on
mclapply()
\emph default
 and the 
\emph on
.scheduling
\emph default
 argument in 
\emph on
parLapply()
\emph default
.
\end_layout

\begin_layout Standard
Basically if you have many tasks that each take similar time, you want to
 preschedule to reduce communication.
 If you have few tasks or tasks with highly variable completion times, you
 don't want to preschedule, to improve load-balancing.
\end_layout

\begin_layout Section
Threaded linear algebra and the BLAS 
\end_layout

\begin_layout Standard
The BLAS is the library of basic linear algebra operations (written in Fortran
 or C).
 A fast BLAS can greatly speed up linear algebra relative to the default
 BLAS on a machine.
 Some fast BLAS libraries are Intel's 
\emph on
MKL
\emph default
, AMD's 
\emph on
ACML
\emph default
, and the open source (and free) 
\emph on
openBLAS
\emph default
 (formerly 
\emph on
GotoBLAS
\emph default
).
 For the Mac, there is 
\emph on
vecLib
\emph default
 BLAS.
 All of these BLAS libraries are now threaded - if your computer has multiple
 cores and there are free resources, your linear algebra will use multiple
 cores, provided your program is linked against the specific BLAS and provided
 OMP_NUM_THREADS is not set to one.
 (Macs make use of VECLIB_MAXIMUM_THREADS rather than OMP_NUM_THREADS.)
\end_layout

\begin_layout Subsection
Fixing the number of threads (cores used)
\end_layout

\begin_layout Standard
In general, if you want to limit the number of threads used, you can set
 the OMP_NUM_THREADS environment variable on UNIX machine.
 This can be used in the context of R or C code that uses BLAS or your own
 threaded C code, but this does not work with Matlab.
 In the UNIX shell, you'd do this as follows (e.g.
 to limit to 3 cores):
\end_layout

\begin_layout Standard

\family typewriter
export OMP_NUM_THREADS=3 # bash
\end_layout

\begin_layout Standard

\family typewriter
setenv OMP_NUM_THREADS 3 # tcsh
\end_layout

\begin_layout Standard
If you are running R, you'd need to do this in your shell session before
 invoking R.
\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
warning off MATLAB:maxNumCompThreads:Deprecated 
\end_layout

\begin_layout Plain Layout
nslots = getenv('NSLOTS') 
\end_layout

\begin_layout Plain Layout
maxNumCompThreads(nslots); 
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Using threading in R
\end_layout

\begin_layout Standard
Threading in R is limited to linear algebra, for which R calls external
 BLAS and LAPACK libraries.
\end_layout

\begin_layout Standard
Here's some code that when run in an R executable linked to a threaded BLAS
 illustrates the speed of using a threaded BLAS:
\end_layout

\begin_layout Chunk
<<RlinAlg, cache=TRUE, eval=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Subsubsection
Setting up R with a fast BLAS
\end_layout

\begin_layout Standard
R on the Biostat cluster is linked to openBLAS.
 So you should be able to use OMP_NUM_THREADS to control the number of threads
 used for linear algebra on the Biostat cluster.
 
\end_layout

\begin_layout Standard
In general, the 
\begin_inset CommandInset href
LatexCommand href
name "R installation manual"
target "http://cran.r-project.org/manuals.html"

\end_inset

 gives information on how to link R to a fast BLAS.
 On Ubuntu Linux, if you install openBLAS as follows, the 
\emph on
/etc/alternatives
\emph default
 system will set 
\emph on
/usr/lib/libblas.so
\emph default
 to point to openBLAS.
 This is what is done in the BCE virtual machine.
 By default on Ubuntu, R will use the system BLAS, so it will as a result
 use openBLAS.
 
\end_layout

\begin_layout Chunk
<<etc-alternatives, eval=FALSE, engine='bash'>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
To use a fast, threaded BLAS enabled on your own Mac, do the following as
 the administrative user:
\end_layout

\begin_layout Chunk
<<MacBLAS, eval=FALSE, engine='bash'>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
this tests fine on Arwen and as SGE job on cluster
\end_layout

\end_inset


\end_layout

\begin_layout Subsubsection
Important warnings about use of threaded BLAS
\end_layout

\begin_layout Paragraph
Conflict between openBLAS and some parallel functionality in R
\end_layout

\begin_layout Standard
There are conflicts between forking in R and threaded BLAS that in some
 cases affect 
\emph on
foreach
\emph default
 (when using the 
\emph on
multicore
\emph default
 and 
\emph on
parallel
\emph default
 backends), 
\emph on
mclapply()
\emph default
, and (only if 
\emph on
cluster()
\emph default
 is set up with forking (not the default)) 
\emph on
par{L,S,}apply()
\emph default
.
 The result is that if linear algebra is used within your parallel code,
 R hangs.
 This affects (under somewhat different circumstances) both ACML and openBLAS.
\end_layout

\begin_layout Standard
To address this, before running an R job that does linear algebra, you can
 set OMP_NUM_THREADS to 1 to prevent the BLAS from doing threaded calculations.
 Alternatively, you can use MPI as the parallel backend (via 
\emph on
doMPI
\emph default
 in place of 
\emph on
doMC
\emph default
 or 
\emph on
doParallel
\emph default
 -- see Section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:Distributed-memory"

\end_inset

).
 You may also be able to convert your code to use 
\emph on
par{L,S,}apply() 
\emph default
[with the default PSOCK type] and avoid 
\emph on
foreach
\emph default
 entirely.
\end_layout

\begin_layout Paragraph
Conflict between threaded BLAS and R profiling
\end_layout

\begin_layout Standard
There is also a conflict between threaded BLAS and R profiling, so if you
 are using 
\emph on
Rprof()
\emph default
, you may need to set OMP_NUM_THREADS to one.
 This has definitely occurred with openBLAS; I'm not sure about other threaded
 BLAS libraries.
\end_layout

\begin_layout Paragraph
Speed and threaded BLAS
\end_layout

\begin_layout Standard
In many cases, using multiple threads for linear algebra operations will
 outperform using a single thread, but there is no guarantee that this will
 be the case, in particular for operations with small matrices and vectors.
 Testing with openBLAS suggests that sometimes a job may take more time
 when using multiple threads; this seems to be less likely with ACML.
 This presumably occurs because openBLAS is not doing a good job in detecting
 when the overhead of threading outweights the gains from distributing the
 computations.
 You can compare speeds by setting OMP_NUM_THREADS to different values.
 In cases where threaded linear algebra is slower than unthreaded, you would
 want to set OMP_NUM_THREADS to 1.
 
\end_layout

\begin_layout Standard
Therefore I recommend that you test any large jobs to compare performance
 with a single thread vs.
 multiple threads.
 Only if you see a substantive improvement with multiple threads does it
 make sense to have OMP_NUM_THREADS be greater than one.
\end_layout

\begin_layout Section
Basic shared memory parallel programming in R
\end_layout

\begin_layout Subsection
foreach
\end_layout

\begin_layout Standard
A simple way to exploit parallelism in R when you have an embarrassingly
 parallel problem (one where you can split the problem up into independent
 chunks) is to use the 
\emph on
foreach
\emph default
 package to do a for loop in parallel.
 For example, bootstrapping, random forests, simulation studies, cross-validatio
n and many other statistical methods can be handled in this way.
 You would not want to use 
\emph on
foreach
\emph default
 if the iterations were not independent of each other.
\end_layout

\begin_layout Standard
The 
\emph on
foreach
\emph default
 package provides a 
\emph on
foreach
\emph default
 command that allows you to do this easily.
 
\emph on
foreach
\emph default
 can use a variety of parallel 
\begin_inset Quotes eld
\end_inset

back-ends
\begin_inset Quotes erd
\end_inset

.
 It can use 
\emph on
Rmpi
\emph default
 to access cores in a distributed memory setting as discussed in Section
 
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:Distributed-memory"

\end_inset

 or the 
\emph on
parallel
\emph default
 or 
\emph on
multicore
\emph default
 packages to use shared memory cores.
 When using 
\emph on
parallel
\emph default
 or 
\emph on
multicore
\emph default
 as the back-end (these are equivalent to each other from the user perspective),
 you should see multiple processes (as many as you registered; ideally each
 at 100%) when you look at 
\emph on
top
\emph default
.
 The multiple processes are created by forking or using sockets; this is
 discussed a bit more later in this document.
\end_layout

\begin_layout Chunk
<<foreach, eval=FALSE, tidy=FALSE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
The result of 
\emph on
foreach
\emph default
 will generally be a list, unless 
\emph on
foreach
\emph default
 is able to put it into a simpler R object.
 Note that 
\emph on
foreach
\emph default
 also provides some additional functionality for collecting and managing
 the results that mean that you don't have to do some of the bookkeeping
 you would need to do if writing your own for loop.
\end_layout

\begin_layout Standard
You can debug by running serially using 
\emph on
%do%
\emph default
 rather than 
\emph on
%dopar%
\emph default
.
 Note that you may need to load packages within the 
\emph on
foreach
\emph default
 construct to ensure a package is available to all of the calculations.
\end_layout

\begin_layout Standard

\series bold
Caution
\series default
: Note that I didn't pay any attention to possible danger in generating
 random numbers in separate processes.
 More on this issue in the section on RNG (Section 
\begin_inset CommandInset ref
LatexCommand ref
reference "sec:RNG"

\end_inset

).
\end_layout

\begin_layout Subsection
parallel apply (parallel package)
\end_layout

\begin_layout Standard
The 
\emph on
parallel
\emph default
 package has the ability to parallelize the various 
\emph on
apply()
\emph default
 functions (
\emph on
apply
\emph default
, 
\emph on
lapply
\emph default
, 
\emph on
sapply
\emph default
, etc.) and parallelize vectorized functions.
 It's a bit hard to find the 
\begin_inset CommandInset href
LatexCommand href
name "vignette"
target "http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf"

\end_inset

 for the parallel package because parallel is not listed as one of the contribut
ed packages on CRAN.
\end_layout

\begin_layout Standard
Here are examples of some of the parallel apply-style commands.
\end_layout

\begin_layout Chunk
<<parallelApply, eval=TRUE, tidy=FALSE, cache=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
Note that some R packages can directly interact with the parallelization
 packages to work with multiple cores.
 E.g., the 
\emph on
boot
\emph default
 package can make use of the 
\emph on
multicore
\emph default
 package directly.
 
\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
see help on mcparallel and test if it applies to foreach
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Using mcparallel() to manually parallelize individual tasks
\end_layout

\begin_layout Standard
One can use 
\emph on
mcparallel()
\emph default
 in the 
\emph on
parallel
\emph default
 package to send different chunks of code to different processes.
 Here we would need to manage the number of tasks so that we don't have
 more tasks than available cores.
\end_layout

\begin_layout Chunk
<<mcparallel, eval=TRUE, cache=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
Note that 
\emph on
mcparallel()
\emph default
 also allows the use of the 
\emph on
mc.set.seed
\emph default
 argument as with 
\emph on
mclapply()
\emph default
.
\end_layout

\begin_layout Standard
Note that on the cluster, one should create only as many parallel blocks
 of code as were requested when submitting the job.
\end_layout

\begin_layout Subsection
Running on the Biostat cluster
\end_layout

\begin_layout Standard
Here's how you would submit an R job that uses shared memory on a single
 node (in this case requesting four cores of a maximum of 8 on a given core):
\end_layout

\begin_layout Standard

\family typewriter
qsub -pe smp 4 job.sh
\end_layout

\begin_layout Standard

\emph on
job.sh
\emph default
 would have the various lines starting with #$ as in the biostat cluster
 documentation and your R command would look like:
\end_layout

\begin_layout Standard

\family typewriter
R CMD BATCH --no-save file.R file.out
\end_layout

\begin_layout Standard
You can make use of the UNIX environment variable NSLOTS in your code to
 programmatically control the number of the number of cores your code uses
 (e.g., with foreach, mclapply, parLapply, etc.).
 To read this variable in R do:
\end_layout

\begin_layout Standard

\family typewriter
nCores <- Sys.getenv('NSLOTS')
\end_layout

\begin_layout Section
Distributed memory
\begin_inset CommandInset label
LatexCommand label
name "sec:Distributed-memory"

\end_inset


\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
batch.R has code to be executed as R batch job to test
\end_layout

\end_inset


\end_layout

\begin_layout Subsection
Use of MPI as a black box for embarrassingly parallel computation
\end_layout

\begin_layout Subsubsection
foreach with MPI in R on a single node
\end_layout

\begin_layout Standard
One reason to use Rmpi on a single node is that in some contexts the threaded
 BLAS (particularly openBLAS) conflicts with 
\emph on
multicore
\emph default
/
\emph on
parallel
\emph default
's forking functionality.
 This can cause 
\emph on
foreach
\emph default
 to hang when used with the 
\emph on
doParallel
\emph default
 or 
\emph on
doMC
\emph default
 parallel back ends (also with 
\emph on
mclapply()
\emph default
) and OMP_NUM_THREADS set to a value greater than one.
\end_layout

\begin_layout Standard
Here's R code for using 
\emph on
Rmpi
\emph default
 as the back-end to 
\emph on
foreach
\emph default
.
 
\end_layout

\begin_layout Chunk
<<Rmpi-foreach-oneNode, cache=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
One can also call 
\emph on
mpirun
\emph default
 to start R.
 Here one requests just a single process, as R will manage the task of spawning
 slaves.
\end_layout

\begin_layout Plain Layout

\family typewriter
mpirun -np 1 R --no-save
\end_layout

\begin_layout Plain Layout
or
\end_layout

\begin_layout Plain Layout

\family typewriter
mpirun -np 1 R CMD BATCH --no-save example.R example.out
\end_layout

\begin_layout Plain Layout
\begin_inset Note Note
status open

\begin_layout Plain Layout
[Ask Ryan if he has thoughts as to what the difference is, but not clear
 you'd really want to do this] 
\end_layout

\end_inset

Note that running R this way through 
\emph on
mpirun
\emph default
 is clunky as the usual tab completion and command history are not available
 and errors cause R to quit, because passing things through 
\emph on
mpirun
\emph default
 causes R to think it is not running interactively.
 Given this, at the moment it's not clear to me why you would want to invoke
 R using 
\emph on
mpirun
\emph default
, but I include it for completeness.
\end_layout

\end_inset


\end_layout

\begin_layout Standard
A caution concerning 
\emph on
Rmpi/doMPI
\emph default
: when you invoke 
\emph on
startMPIcluster()
\emph default
, all the slave R processes become 100% active and stay active until the
 cluster is closed.
 In addition, when 
\emph on
foreach
\emph default
 is actually running, the master process also becomes 100% active.
 So using this functionality involves some inefficiency in CPU usage.
 This inefficiency is not seen with a sockets cluster (see next) nor when
 using other 
\emph on
Rmpi
\emph default
 functionality - i.e., starting slaves with 
\emph on
mpi.spawn.Rslaves()
\emph default
 and then issuing commands to the slaves.
\end_layout

\begin_layout Subsubsection
foreach in R on multiple nodes
\end_layout

\begin_layout Standard

\emph on
[Note that at this point in the demo, we'll switch to use of a BCE-based
 EC2 virtual cluster, as we need access to multiple nodes.
 Note to self - just do starcluster sshmaster -u oski mycluster from laptop
 BCE.]
\end_layout

\begin_layout Standard
For this, we need to start R through the 
\emph on
mpirun
\emph default
 command so that inter-node communication is handled by MPI.
 The 
\family typewriter
-machinefile
\family default
 argument tells MPI what machines (nodes) to use for the computation.
 
\end_layout

\begin_layout Standard

\family typewriter
mpirun -machinefile .hosts -np 4 R CMD BATCH --no-save doMPIexample.R doMPIexample.
Rout
\end_layout

\begin_layout Standard

\family typewriter
mpirun -machinefile .hosts -np 4 R --no-save # for interactive use
\end_layout

\begin_layout Standard
When running on the Biostat cluster you can simply do:
\end_layout

\begin_layout Standard

\family typewriter
mpirun -np 1 R CMD BATCH --no-save doMPIexample.R doMPIexample.Rout
\end_layout

\begin_layout Standard
Here's the R code for using 
\emph on
Rmpi
\emph default
 as the back-end to 
\emph on
foreach
\emph default
.
 If you call 
\emph on
startMPIcluster()
\emph default
 with no arguments, it will start up one fewer worker processes than the
 value indicated by the np flag, so your R code will be more portable.
 Also you can leave off the -np and mpirun will start up as many processes
 as there are lines in the .hosts file.
\end_layout

\begin_layout Chunk
<<Rmpi-foreach-multipleNodes, eval=FALSE, cache=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
CAUTION: Note that the above with 
\family typewriter
-np 4 
\family default
should in general work with 
\family typewriter
-np 1
\family default
 as well (and does on the biostat cluster).
 However 
\family typewriter
-np 1
\family default
 does not work in the Amazon cluster nor the SCF (though it has in the past).
 I'm in the process of investigating this.
 
\end_layout

\begin_layout Subsubsection
Sockets in R example
\end_layout

\begin_layout Standard
One can also set up a cluster via sockets.
 You just need to specify a character vector with the machine names as the
 input to 
\emph on
makeCluster()
\emph default
.
 Here we can just start R as we normally would, without using 
\emph on
mpirun
\emph default
.
\end_layout

\begin_layout Chunk
<<sockets-multipleNodes, eval=FALSE, cache=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Subsection
MPI basics
\end_layout

\begin_layout Standard
There are multiple MPI implementations, of which 
\emph on
openMPI
\emph default
 and 
\emph on
mpich
\emph default
 are very common.
\end_layout

\begin_layout Standard
In MPI programming, the same code runs on all the machines.
 This is called SPMD (single program, multiple data).
 As we saw above, one invokes the same code (same program) multiple times,
 but the behavior of the code can be different based on querying the rank
 (ID) of the process.
 Since MPI operates in a distributed fashion, any transfer of information
 between processes must be done explicitly via send and receive calls (e.g.,
 
\emph on
MPI_Send
\emph default
, 
\emph on
MPI_Recv
\emph default
, 
\emph on
MPI_Isend
\emph default
, and 
\emph on
MPI_Irecv
\emph default
).
 (The 
\begin_inset Quotes eld
\end_inset

MPI_
\begin_inset Quotes erd
\end_inset

 is for C code; C++ just has 
\emph on
Send
\emph default
, 
\emph on
Recv
\emph default
, etc.)
\end_layout

\begin_layout Standard
The latter two of these functions (
\emph on
MPI_Isend
\emph default
 and 
\emph on
MPI_Irecv
\emph default
) are so-called non-blocking calls.
 One important concept to understand is the difference between blocking
 and non-blocking calls.
 Blocking calls wait until the call finishes, while non-blocking calls return
 and allow the code to continue.
 Non-blocking calls can be more efficient, but can lead to problems with
 synchronization between processes.
 
\end_layout

\begin_layout Standard
In addition to send and receive calls to transfer to and from specific processes
, there are calls that send out data to all processes (
\emph on
MPI_Scatter
\emph default
), gather data back (
\emph on
MPI_Gather
\emph default
) and perform reduction operations (
\emph on
MPI_Reduce
\emph default
).
\end_layout

\begin_layout Standard
Debugging MPI/
\emph on
Rmpi
\emph default
 code can be tricky because communication can hang, error messages from
 the workers may not be seen or readily accessible and it can be difficult
 to assess the state of the worker processes.
 
\end_layout

\begin_layout Subsubsection
Using Rmpi to write code for distributed computation
\end_layout

\begin_layout Standard
R users can use 
\emph on
Rmpi
\emph default
 to interface with MPI.
 
\end_layout

\begin_layout Standard
To start your R job do the following.
 Here one requests just a single process, as R will manage the task of spawning
 slaves.
\end_layout

\begin_layout Standard

\family typewriter
mpirun -machinefile .hosts -np 1 R CMD BATCH --no-save RmpiExample.R RmpiExample.Ro
ut
\end_layout

\begin_layout Standard
\begin_inset Note Note
status open

\begin_layout Plain Layout
[just mpirun -np and then print out mpi.comm.rank() does not print out for
 more than one process]
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Here's some example code that uses actual 
\emph on
Rmpi
\emph default
 syntax (as opposed to 
\emph on
foreach
\emph default
 with Rmpi as the back-end).
 This code runs in a master-slave paradigm where the master starts the slaves
 and invokes commands on them.
 
\end_layout

\begin_layout Chunk
<<Rmpi-usingMPIsyntax, cache=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
CAUTION: for some reason the 
\emph on
mpirun
\emph default
 syntax above is not distributing workers to the other nodes listed in the
 hosts file on the Amazon cluster and the SCF, but it should work on the
 Biostat cluster.
 This has worked in the past for me, so I'm not sure what is going on, but
 it may be a difference in the MPI version or something in the latest Ubuntu
 (Ubuntu 14 vs.
 Ubuntu 12).
 
\end_layout

\begin_layout Standard
You can use Rmpi on a single node, for example for debugging or prototyping.
 In this case you can just start R without using mpirun.
\end_layout

\begin_layout Standard
Note that if you do this in interactive mode, some of the usual functionality
 of command line R (tab completion, scrolling for history) is not enabled
 and errors will cause R to quit.
 This occurs because passing things through 
\emph on
mpirun
\emph default
 causes R to think it is not running interactively.
\end_layout

\begin_layout Subparagraph
Note on supercomputers
\end_layout

\begin_layout Standard
Note: in some cases a cluster/supercomputer will be set up so that 
\emph on
Rmpi
\emph default
 is loaded and the worker processes are already started when you start R.
 In this case you wouldn't need to load 
\emph on
Rmpi
\emph default
 or use 
\emph on
mpi.spawn.Rslaves().
 
\emph default
You can always check 
\emph on
mpi.comm.size()
\emph default
 to see if the workers are already set up.
\end_layout

\begin_layout Subsubsection
Using 
\emph on
pbdR
\emph default
 for distributed computation
\end_layout

\begin_layout Standard
There is a relatively new effort to enhance R's capability for distributed
 memory processing called 
\emph on
pbdR
\emph default
.
 
\emph on
pbdR
\emph default
 is designed for SPMD processing in batch mode, which means that you start
 up multiple processes in a non-interactive fashion using mpirun.
 The same code runs in each R process so you need to have the code behavior
 depend on the process ID, as we saw with 
\emph on
Rmpi
\emph default
 above.
\end_layout

\begin_layout Standard

\emph on
pbdR
\emph default
 provides the following capabilities:
\end_layout

\begin_layout Itemize
an alternative to 
\emph on
Rmpi
\emph default
 for interfacing with MPI,
\end_layout

\begin_layout Itemize
the ability to do some parallel apply-style computations, and
\end_layout

\begin_layout Itemize
the ability to do distributed linear algebra by interfacing to 
\emph on
ScaLapack.
\end_layout

\begin_layout Standard
Personally, I think the last of the three is the most exciting as it's a
 functionality not readily available in R or even more generally in other
 readily-accessible software.
\end_layout

\begin_layout Standard

\emph on
pbdR
\emph default
 is installed on the nodes in our example EC2 cluster, as detailed in Section
 
\begin_inset CommandInset ref
LatexCommand ref
reference "sub:Starting-a-Cluster"

\end_inset

.
\end_layout

\begin_layout Standard
Let's see some basic examples of using pbdR.
\end_layout

\begin_layout Standard
One starts pbdR via mpirun as follows:
\end_layout

\begin_layout Standard

\family typewriter
mpirun -machinefile .hosts -np 4 Rscript file.R > file.out
\end_layout

\begin_layout Paragraph
Interfacing with MPI
\end_layout

\begin_layout Standard
Here's an example of distributing an embarrassingly parallel calculation
 (estimating an integral via Monte Carlo - in this case estimating the value
 of 
\begin_inset Formula $\pi$
\end_inset

).
\end_layout

\begin_layout Chunk
<<pbd-mpi, cache=TRUE, eval=FALSE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Paragraph
Parallel apply
\end_layout

\begin_layout Standard
Here's some basic syntax for doing a distributed 
\emph on
apply()
\emph default
 on a matrix that is on one of the workers (i.e., the matrix is not distributed).
\end_layout

\begin_layout Chunk
<<pbd-apply, cache=TRUE, eval=FALSE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Paragraph
Distributed linear algebra
\end_layout

\begin_layout Standard
And here's how you would set up a distributed matrix and do linear algebra
 on it.
 Note that when working with large matrices, you would generally want to
 construct the matrices (or read from disk) in a parallel fashion rather
 than creating the full matrix on one worker.
\end_layout

\begin_layout Chunk
<<pbd-linalg, cache=TRUE, eval=FALSE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
As a quick, completely non-definitive point of comparison, doing the crossproduc
t and Cholesky for the 8192x8192 matrix on 3 EC2 nodes (2 cores per node)
 with -np 6 took 39 seconds for each operation, while doing with two threads
 on the master node took 64 seconds (crossproduct) and 23 seconds (Cholesky).
\end_layout

\begin_layout Subsection
Running on the Biostat cluster
\end_layout

\begin_layout Standard
Here's how you would run distributed R jobs based on 
\emph on
mpirun
\emph default
 on the Biostat cluster 
\end_layout

\begin_layout Standard

\family typewriter
qsub -pe orte 12 job.sh
\end_layout

\begin_layout Standard

\emph on
job.sh
\emph default
 would have the various lines starting with #$ as in the biostat cluster
 documentation.
\end_layout

\begin_layout Standard
For Rmpi/doMPI your R command inside 
\emph on
job.sh
\emph default
 would look like this, where R will start the worker processes (hence the
 
\family typewriter
-np 1
\family default
)
\end_layout

\begin_layout Standard

\family typewriter
mpirun -np 1 R CMD BATCH file.R file.out
\end_layout

\begin_layout Standard
or for 
\emph on
pbdR
\emph default
 where 
\emph on
mpirun
\emph default
 starts the workers in SPMD fashion:
\end_layout

\begin_layout Standard

\family typewriter
mpirun -np 12 Rscript file.R > file.out
\end_layout

\begin_layout Standard
In general on the Biostat cluster, you do not need to pass anything via
 
\family typewriter
-machinefile
\family default
 as the queueing software takes care of this.
\end_layout

\begin_layout Section
Using R on Cloud VMs and Cloud-based Clusters
\end_layout

\begin_layout Standard
For Amazon-based machines, we'll use the BCE image as our starting point
 and add software to it.
\end_layout

\begin_layout Standard
Amazon's service that provides cloud-based virtual machines is called EC2.
 To use EC2, you'll need an Amazon account.
 This might be an account you pay for with your own credit card, or an account
 that Biostat pays for through a grant.
 
\end_layout

\begin_layout Standard
Once you get an account, you'll receive authentication information in the
 form of two long strings of letters/numbers, called your access key ID
 and secret access key.

\series bold
 WARNING: do not share your Amazon authentication info (AWS_ACCESS_KEY_ID
 or AWS_SECRET_ACCESS_KEY) in any public setting, such as a public Github
 account.

\series default
 A user in my class did this and we ended up with thousands of dollars in
 charges from hackers starting up instances (probably to mine Bitcoin).
\end_layout

\begin_layout Subsection
Starting an Amazon EC2 Instance
\end_layout

\begin_layout Paragraph
Point-and-click via the EC2 console
\end_layout

\begin_layout Standard
To start up an EC2 virtual machine based on BCE, follow the instructions
 at 
\begin_inset CommandInset href
LatexCommand href
target "bce.berkeley.edu/install.html"

\end_inset

.
\end_layout

\begin_layout Standard
To logon, you'll need to make use of SSH keys.
 Here's how to get things going with BCE such that you can logon as the
 
\emph on
oski
\emph default
 user.
 You obtain the <IP_address> by clicking on the 
\begin_inset Quotes eld
\end_inset

Connect
\begin_inset Quotes erd
\end_inset

 button on the EC2 console.
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
begin{verbatim}
\end_layout

\begin_layout Plain Layout

# get the modify-for-aws.sh and add-parallel-tools.sh
\end_layout

\begin_layout Plain Layout

# scripts from the BCE webpage
\end_layout

\begin_layout Plain Layout

# run on the VM to do some setup
\end_layout

\begin_layout Plain Layout

ssh -i ~/.ssh/your_private_key_file ubuntu@<IP_address> 
\backslash

\end_layout

\begin_layout Plain Layout

   sudo bash modify-for-aws.sh  
\end_layout

\begin_layout Plain Layout

# now install software for parallel R functionality
\end_layout

\begin_layout Plain Layout

ssh -i ~/.ssh/your_private_key_file ubuntu@<IP_address> 
\backslash

\end_layout

\begin_layout Plain Layout

   sudo bash add-parallel-tools.sh
\end_layout

\begin_layout Plain Layout

# Now you can ssh in directly as the oski user.
\end_layout

\begin_layout Plain Layout

ssh -i ~/.ssh/your_private_key_file oski@<IP_address> 
\end_layout

\begin_layout Plain Layout


\backslash
end{verbatim}
\end_layout

\end_inset


\end_layout

\begin_layout Standard
Now you can logon to the instance as the 
\emph on
oski
\emph default
 user (or just do 
\family typewriter
su - oski
\family default
 if you are logged on as root) and do your work.
\end_layout

\begin_layout Standard
Once you are logged in to the VM, running parallel R jobs on a single node
 should be exactly the same as we've already seen on the BCE VM running
 on your laptop.
\end_layout

\begin_layout Paragraph
By scripting
\end_layout

\begin_layout Standard
For those of you in PH290, Aaron Culich of Berkeley Research Computing has
 set up a 
\begin_inset CommandInset href
LatexCommand href
name "scripted approach"
target "https://github.com/ucberkeley/ph290-bce-2015-spring"

\end_inset

 to start and access an EC2 instance from the command line using a tool
 called 
\emph on
vagrant
\emph default
.
\end_layout

\begin_layout Standard
The python 
\emph on
boto
\emph default
 package is also handy for working with EC2.
\end_layout

\begin_layout Subsection
Starting a Cluster of EC2 Instances
\begin_inset CommandInset label
LatexCommand label
name "sub:Starting-a-Cluster"

\end_inset


\end_layout

\begin_layout Standard
We'll use a tool called 
\begin_inset CommandInset href
LatexCommand href
name "Starcluster"
target "http://star.mit.edu/cluster/index.html"

\end_inset

 to start EC2 clusters.
 This manages setting up the networking between the nodes.
 You'll need Starcluster installed on your local machine.
 
\end_layout

\begin_layout Standard
The file 
\emph on
starcluster.config
\emph default
 contains some basic settings you can use to start up an EC2 cluster based
 on BCE.
 In particular it has the AMI IDs for the BCE EC2 images.
 You'll need to put your own AWS authentication information in the file
 (or set these as environment variables in your shell session).
 Rename the file as 
\emph on
config
\emph default
 and put it in a directory called 
\emph on
.starcluster
\emph default
 in your home directory.
 Then you can start an EC2 cluster and login as follows:
\end_layout

\begin_layout Chunk
<<starcluster-start, eval=FALSE, engine='bash'>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
When you start the cluster via 
\family typewriter
starcluster start
\family default
, you should see logging information that looks like this, followed by some
 instructions about how to restart, stop, and login to your cluster.
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Plain Layout


\backslash
begin{verbatim}
\end_layout

\begin_layout Plain Layout

StarCluster - (http://star.mit.edu/cluster) (v.
 0.95.6)
\end_layout

\begin_layout Plain Layout

Software Tools for Academics and Researchers (STAR)
\end_layout

\begin_layout Plain Layout

Please submit bug reports to starcluster@mit.edu
\end_layout

\begin_layout Plain Layout

\end_layout

\begin_layout Plain Layout

>>> Validating cluster template settings...
\end_layout

\begin_layout Plain Layout

>>> Cluster template settings are valid
\end_layout

\begin_layout Plain Layout

>>> Starting cluster...
\end_layout

\begin_layout Plain Layout

>>> Launching a 4-node cluster...
\end_layout

\begin_layout Plain Layout

>>> Creating security group @sc-mycluster...
\end_layout

\begin_layout Plain Layout

Reservation:r-e344eee9
\end_layout

\begin_layout Plain Layout

>>> Waiting for instances to propagate...
\end_layout

\begin_layout Plain Layout

4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Waiting for cluster to come up...
 (updating every 30s)
\end_layout

\begin_layout Plain Layout

>>> Waiting for all nodes to be in a 'running' state...
\end_layout

\begin_layout Plain Layout

4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Waiting for SSH to come up on all nodes...
\end_layout

\begin_layout Plain Layout

4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Waiting for cluster to come up took 1.591 mins
\end_layout

\begin_layout Plain Layout

>>> The master node is ec2-52-10-188-152.us-west-2.compute.amazonaws.com
\end_layout

\begin_layout Plain Layout

>>> Configuring cluster...
\end_layout

\begin_layout Plain Layout

>>> Running plugin starcluster.clustersetup.DefaultClusterSetup
\end_layout

\begin_layout Plain Layout

>>> Configuring hostnames...
\end_layout

\begin_layout Plain Layout

4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Creating cluster user: oski (uid: 1001, gid: 1001)
\end_layout

\begin_layout Plain Layout

4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Configuring scratch space for user(s): oski
\end_layout

\begin_layout Plain Layout

4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Configuring /etc/hosts on each node
\end_layout

\begin_layout Plain Layout

4/4 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Starting NFS server on master
\end_layout

\begin_layout Plain Layout

>>> Configuring NFS exports path(s):
\end_layout

\begin_layout Plain Layout

/home
\end_layout

\begin_layout Plain Layout

>>> Mounting all NFS export path(s) on 3 worker node(s)
\end_layout

\begin_layout Plain Layout

3/3 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 100%  
\end_layout

\begin_layout Plain Layout

>>> Setting up NFS took 0.070 mins
\end_layout

\begin_layout Plain Layout

>>> Configuring passwordless ssh for root
\end_layout

\begin_layout Plain Layout

>>> Configuring passwordless ssh for oski
\end_layout

\begin_layout Plain Layout

>>> Configuring cluster took 0.227 mins
\end_layout

\begin_layout Plain Layout

>>> Starting cluster took 1.860 mins
\end_layout

\begin_layout Plain Layout


\backslash
end{verbatim}
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The other commands after the 
\family typewriter
starcluster start
\family default
 customize the EC2 cluster with some parallel R (and Python) tools, as well
 as MPI.
 In particular note that I've created a 
\emph on
.hosts
\emph default
 file for use with 
\emph on
mpirun
\emph default
.
 The host file just contains lines with 
\emph on
master
\emph default
, 
\emph on
node001
\emph default
, 
\emph on
node002
\emph default
, ..., which are the internal names of the nodes (as named by Starcluster)
 in the cluster.
\end_layout

\begin_layout Section
Efficiency and Profiling
\end_layout

\begin_layout Subsection
General tips on efficient R code
\end_layout

\begin_layout Standard
See 
\begin_inset CommandInset href
LatexCommand href
name "Section 1 of Unit 6 from Statistics 243"
target "https://github.com/berkeley-stat243/stat243-fall-2014/blob/master/units/unit6-Rprog.pdf?raw=true"

\end_inset

 on efficient R code.
\end_layout

\begin_layout Standard
Here are a few high-level principles:
\end_layout

\begin_layout Enumerate
Write vectorized code, not for loops or apply-style statements.
\end_layout

\begin_layout Enumerate
Pre-allocate storage rather than allocating into a position in a vector
 or list that is beyond the length of that vector/list.
\end_layout

\begin_layout Enumerate
Make sure R is linked to a fast BLAS.
\end_layout

\begin_layout Enumerate
To get one or more values from a vector or list, use numeric indices rather
 than looking up/subsetting by name or boolean values.
\end_layout

\begin_layout Subsection
Benchmarking and profiling
\end_layout

\begin_layout Standard
Again my class notes (see above link) provide more details on the topics
 below.
\end_layout

\begin_layout Standard
The 
\emph on
system.time()
\emph default
 function allows you to time R code.
 The rbenchmark package provides a nice wrapper function, 
\emph on
benchmark()
\emph default
, that eases timing of code and comparison of timing of different code chunks.
 
\end_layout

\begin_layout Standard
The 
\emph on
Rprof()
\emph default
 function allows you to see how much time is spent in different parts of
 your code, helping to identify bottlenecks.
 
\emph on
Rprof()
\emph default
 works as follows.
 At set time intervals, it queries the R process to see what function is
 being evaluated and writes this info to disk.
 Then when the code is finished you can call 
\emph on
summaryRprof()
\emph default
 to collect those statistics.
 Note that if your code runs very quickly, you may not get enough queries
 accumulated to have useful information.
 See the 
\emph on
interval
\emph default
 argument to 
\emph on
Rprof()
\emph default
.
 In this case you may be able to write a loop that repeatedly evaluates
 your code.
 
\end_layout

\begin_layout Standard
I haven't used it, but Hadley Wickham's 
\begin_inset CommandInset href
LatexCommand href
name "lineprof package"
target "https://github.com/hadley/lineprof"

\end_inset

 provides a nice interface that allows you to see timing (as with 
\emph on
Rprof()
\emph default
) but also memory use by lines of your code.
 Some details are in 
\begin_inset CommandInset href
LatexCommand href
name "his book"
target "http://adv-r.had.co.nz/memory.html#memory-profiling"

\end_inset

.
\end_layout

\begin_layout Section
RNG 
\begin_inset CommandInset label
LatexCommand label
name "sec:RNG"

\end_inset


\end_layout

\begin_layout Standard
The key thing when thinking about random numbers in a parallel context is
 that you want to avoid having the same 'random' numbers occur on multiple
 processes.
 On a computer, random numbers are not actually random but are generated
 as a sequence of pseudo-random numbers designed to mimic true random numbers.
 The sequence is finite (but very long) and eventually repeats itself.
 When one sets a seed, one is choosing a position in that sequence to start
 from.
 Subsequent random numbers are based on that subsequence.
 All random numbers can be generated from one or more random uniform numbers,
 so we can just think about a sequence of values between 0 and 1.
 
\end_layout

\begin_layout Standard
The worst thing that could happen is that one sets things up in such a way
 that every process is using the same sequence of random numbers.
 This could happen if you mistakenly set the same seed in each process,
 e.g., using 
\family typewriter
set.seed(mySeed)
\family default
 in R on every process.
\end_layout

\begin_layout Standard
The naive approach is to use a different seed for each process.
 E.g., if your processes are numbered 
\begin_inset Formula $id=1,\ldots,p$
\end_inset

, with 
\emph on
id
\emph default
 unique to a process, using 
\family typewriter
set.seed(id)
\family default
 on each process.
 This is likely not to cause problems, but raises the danger that two (or
 more sequences) might overlap.
 For an algorithm with dependence on the full sequence, such as an MCMC,
 this probably won't cause big problems (though you likely wouldn't know
 if it did), but for something like simple simulation studies, some of your
 'independent' samples could be exact replicates of a sample on another
 process.
 Given the period length of the default generators in R, Matlab and Python,
 this is actually quite unlikely, but it is a bit sloppy.
\end_layout

\begin_layout Standard
One approach to avoid the problem is to do all your RNG on one process and
 distribute the random deviates, but this can be infeasible with many random
 numbers.
\end_layout

\begin_layout Standard
More generally to avoid this problem, the key is to use an algorithm that
 ensures sequences that do not overlap.
\end_layout

\begin_layout Subsection
Ensuring separate sequences in R
\end_layout

\begin_layout Standard
In R, there are two packages that deal with this, 
\emph on
rlecuyer
\emph default
 and 
\emph on
rsprng
\emph default
.
 We'll go over 
\emph on
rlecuyer
\emph default
, as I've heard that 
\emph on
rsprng
\emph default
 is deprecated (though there is no evidence of this on CRAN) and 
\emph on
rsprng
\emph default
 is (at the moment) not available for the Mac.
\end_layout

\begin_layout Standard
The L'Ecuyer algorithm has a period of 
\begin_inset Formula $2^{191}$
\end_inset

, which it divides into subsequences of length 
\begin_inset Formula $2^{127}$
\end_inset

.
 
\end_layout

\begin_layout Subsubsection
With the parallel package
\end_layout

\begin_layout Standard
Here's how you initialize independent sequences on different processes when
 using the 
\emph on
parallel
\emph default
 package's parallel apply functionality (illustrated here with 
\emph on
parSapply()
\emph default
.
\end_layout

\begin_layout Chunk
<<RNG-apply, eval=FALSE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
If you want to explicitly move from stream to stream, you can use 
\emph on
nextRNGStream()
\emph default
.
 For example:
\end_layout

\begin_layout Chunk
<<RNGstream, eval=FALSE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
When using 
\emph on
mclapply()
\emph default
, you can use the 
\emph on
mc.set.seed
\emph default
 argument as follows (note that 
\emph on
mc.set.seed
\emph default
 is TRUE by default, so you should get different seeds for the different
 processes by default), but one needs to invoke 
\family typewriter
RNGkind("L'Ecuyer-CMRG")
\family default
 to get independent streams via the L'Ecuyer algorithm.
\end_layout

\begin_layout Chunk
<<RNG-mclapply, eval=FALSE, tidy=FALSE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
The documentation for 
\emph on
mcparallel()
\emph default
 gives more information about reproducibility based on 
\emph on
mc.set.seed
\emph default
.
\end_layout

\begin_layout Subsubsection
With foreach
\end_layout

\begin_layout Paragraph*
Getting independent streams
\end_layout

\begin_layout Standard
One question is whether 
\emph on
foreach
\emph default
 deals with RNG correctly.
 This is not documented, but the developers (Revolution Analytics) are well
 aware of RNG issues.
 Digging into the underlying code reveals that the 
\emph on
doMC
\emph default
 and 
\emph on
doParallel
\emph default
 backends both invoke 
\emph on
mclapply()
\emph default
 and set 
\emph on
mc.set.seed
\emph default
 to TRUE by default.
 This suggests that the discussion above r.e.
 
\emph on
mclapply()
\emph default
 holds for 
\emph on
foreach
\emph default
 as well, so you should do 
\family typewriter
RNGkind("L'Ecuyer-CMRG")
\family default
 before your foreach call.
 For 
\emph on
doMPI
\emph default
, as of version 0.2, you can do something like this, which uses L'Ecuyer
 behind the scenes:
\begin_inset Newline newline
\end_inset


\family typewriter
cl <- makeCluster(nSlots)
\begin_inset Newline newline
\end_inset

setRngDoMPI(cl, seed=0)
\end_layout

\begin_layout Paragraph*
Ensuring reproducibility
\end_layout

\begin_layout Standard
While using 
\emph on
foreach
\emph default
 as just described should ensure that the streams on each worker are are
 distinct, it does not ensure reproducibility because task chunks may be
 assigned to workers differently in different runs and the substreams are
 specific to workers, not to tasks.
 
\end_layout

\begin_layout Standard
For 
\emph on
doMPI
\emph default
, you can specify a different RNG substream for each task chunk in a way
 that ensures reproducibility.
 Basically you provide a list called 
\emph on
.options.mpi
\emph default
 as an argument to 
\emph on
foreach
\emph default
, with 
\emph on
seed
\emph default
 as an element of the list:
\end_layout

\begin_layout Chunk
<<RNG-doMPI, eval=TRUE>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
That single seed then initializes the RNG for the first task, and subsequent
 tasks get separate substreams, using the L'Ecuyer algorithm, based on 
\emph on
nextRNGStream()
\emph default
.
 Note that the 
\emph on
doMPI
\emph default
 developers also suggest using the 
\emph on
chunkSize
\emph default
 option (also specified as an element of 
\emph on
.options.mpi
\emph default
) when using 
\emph on
seed
\emph default
.
 See 
\family typewriter
?
\begin_inset Quotes erd
\end_inset

doMPI-package
\begin_inset Quotes erd
\end_inset


\family default
 for more details.
\end_layout

\begin_layout Standard
For other backends, such as 
\emph on
doParallel
\emph default
, there is a package called 
\emph on
doRNG
\emph default
 that ensures that 
\emph on
foreach
\emph default
 loops are reproducible.
 Here's how you do it:
\end_layout

\begin_layout Chunk
<<RNG-doRNG>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\begin_layout Standard
Alternatively, you can do:
\end_layout

\begin_layout Chunk
<<RNG-doRNG2>>=
\end_layout

\begin_layout Chunk
@
\end_layout

\end_body
\end_document