-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathParallel-Computing-With-R.html
322 lines (286 loc) · 20.9 KB
/
Parallel-Computing-With-R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
<!DOCTYPE html>
<html>
<head>
<title>Parallel Computing</title>
<meta charset="utf-8">
<meta name="Description" content="R Language Tutorials for Advanced Statistics">
<meta name="Keywords" content="R, Tutorial, Machine learning, Statistics, Data Mining, Analytics, Data science, Linear Regression, Logistic Regression, Time series, Forecasting">
<meta name="Distribution" content="Global">
<meta name="Author" content="Selva Prabhakaran">
<meta name="Robots" content="index, follow">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="shortcut icon" href="/screenshots/iconb-64.png" type="image/x-icon" />
<link href="www/bootstrap.min.css" rel="stylesheet">
<link href="www/highlight.css" rel="stylesheet">
<link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
rel='stylesheet' type='text/css'>
<!-- Color Script -->
<style type="text/css">
a {
color: #3675C5;
color: rgb(25, 145, 248);
color: #4582ec;
color: #3F73D8;
}
li {
line-height: 1.65;
}
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
</style>
<!-- Add Google search -->
<script language="Javascript" type="text/javascript">
function my_search_google()
{
var query = document.getElementById("my-google-search").value;
window.open("http://google.com/search?q=" + query
+ "%20site:" + "http://r-statistics.co");
}
</script>
</head>
<body>
<div class="container">
<div class="masthead">
<!--
<ul class="nav nav-pills pull-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Table of contents<b class="caret"></b>
</a>
<ul class="dropdown-menu pull-right" role="menu">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</ul>
-->
<ul class="nav nav-pills pull-right">
<div class="input-group">
<form onsubmit="my_search_google()">
<input type="text" class="form-control" id="my-google-search" placeholder="Search..">
<form>
</div><!-- /input-group -->
</ul><!-- /.col-lg-6 -->
<h3 class="muted"><a href="/">r-statistics.co</a><small> by Selva Prabhakaran</small></h3>
<hr>
</div>
<div class="row">
<div class="col-xs-12 col-sm-3" id="nav">
<div class="well">
<li>
<ul class="list-unstyled">
<li class="dropdown-header"></li>
<li class="dropdown-header">Tutorial</li>
<li><a href="R-Tutorial.html">R Tutorial</a></li>
<li class="dropdown-header">ggplot2</li>
<li><a href="ggplot2-Tutorial-With-R.html">ggplot2 Short Tutorial</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part1-With-R-Code.html">ggplot2 Tutorial 1 - Intro</a></li>
<li><a href="Complete-Ggplot2-Tutorial-Part2-Customizing-Theme-With-R-Code.html">ggplot2 Tutorial 2 - Theme</a></li>
<li><a href="Top50-Ggplot2-Visualizations-MasterList-R-Code.html">ggplot2 Tutorial 3 - Masterlist</a></li>
<li><a href="ggplot2-cheatsheet.html">ggplot2 Quickref</a></li>
<li class="dropdown-header">Foundations</li>
<li><a href="Linear-Regression.html">Linear Regression</a></li>
<li><a href="Statistical-Tests-in-R.html">Statistical Tests</a></li>
<li><a href="Missing-Value-Treatment-With-R.html">Missing Value Treatment</a></li>
<li><a href="Outlier-Treatment-With-R.html">Outlier Analysis</a></li>
<li><a href="Variable-Selection-and-Importance-With-R.html">Feature Selection</a></li>
<li><a href="Model-Selection-in-R.html">Model Selection</a></li>
<li><a href="Logistic-Regression-With-R.html">Logistic Regression</a></li>
<li><a href="Environments.html">Advanced Linear Regression</a></li>
<li class="dropdown-header">Advanced Regression Models</li>
<li><a href="adv-regression-models.html">Advanced Regression Models</a></li>
<li class="dropdown-header">Time Series</li>
<li><a href="Time-Series-Analysis-With-R.html">Time Series Analysis</a></li>
<li><a href="Time-Series-Forecasting-With-R.html">Time Series Forecasting </a></li>
<li><a href="Time-Series-Forecasting-With-R-part2.html">More Time Series Forecasting</a></li>
<li class="dropdown-header">High Performance Computing</li>
<li><a href="Parallel-Computing-With-R.html">Parallel computing</a></li>
<li><a href="Strategies-To-Improve-And-Speedup-R-Code.html">Strategies to Speedup R code</a></li>
<li class="dropdown-header">Useful Techniques</li>
<li><a href="Association-Mining-With-R.html">Association Mining</a></li>
<li><a href="Multi-Dimensional-Scaling-With-R.html">Multi Dimensional Scaling</a></li>
<li><a href="Profiling.html">Optimization</a></li>
<li><a href="Information-Value-With-R.html">InformationValue package</a></li>
</ul>
</li>
</div>
<div class="well">
<p>Stay up-to-date. <a href="https://docs.google.com/forms/d/1xkMYkLNFU9U39Dd8S_2JC0p8B5t6_Yq6zUQjanQQJpY/viewform">Subscribe!</a></p>
<p><a href="https://docs.google.com/forms/d/13GrkCFcNa-TOIllQghsz2SIEbc-YqY9eJX02B19l5Ow/viewform">Chat!</a></p>
</div>
<h4>Contents</h4>
<ul class="list-unstyled" id="toc"></ul>
<!--
<hr>
<p><a href="/contribute.html">How to contribute</a></p>
<p><a class="btn btn-primary" href="">Edit this page</a></p>
-->
</div>
<div id="content" class="col-xs-12 col-sm-8 pull-right">
<h1>Parallel Computing</h1>
<blockquote>
<p>R provides a number of convenient facilities for parallel computing. The following method shows you how to setup and run a parallel process on your current multi-core device, without need for additional hardware.</p>
</blockquote>
<h2>Setting up for parallelization</h2>
<p>The number of parallel processes you can run simultaneously depends on the number of cores in your machine. If you are on a Windows PC, open ‘Task Manager’ => ‘Performance’ tab, and count the number of boxes below “CPU Usage History”. That is the maximum number of parallel processes you can run in your computer. You can practically use all of them for R computations, however, it is a good idea to leave out a core or two for background system processes. Here is how you can set up your R session for parallel processing:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Registering cores for parallel process</span>
<span class="kw">library</span>(doSNOW)
cl <-<span class="st"> </span><span class="kw">makeCluster</span>(<span class="dv">4</span>, <span class="dt">type=</span><span class="st">"SOCK"</span>) <span class="co"># 4 – number of cores</span>
<span class="kw">registerDoSNOW</span>(cl) <span class="co"># Register back end Cores for Parallel Computing</span></code></pre></div>
<h2>Running the parallel process</h2>
<p>Once the cores are set up to run computations in parallel, the ‘foreach’ loop (from foreach package) can run your functions in parallel by opening as many parallel R session as the number of cores you have registered. The difference between a regular for-loop and for-each loop is, the for-loop runs serially, i.e. your loop processes one value of loop-counter (i) at a time. While in for-each, the arguments you supply to the loop-counter (‘i’ in this case) will be run simultaneously at {number_of_cores_initialised} number of processes at a time. After running the functions defined inside the loop, it combines all the returned values based on the function supplied to the ‘.combine‘ argument.</p>
<h2>Parallel processing: some simple examples</h2>
<p>In the examples below, replace %dopar% with %do% to make it run as a non-parallel process.</p>
<h4>Example 1</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(foreach)
<span class="kw">foreach</span>(<span class="dt">i =</span> <span class="dv">1</span>:<span class="dv">28</span>) %dopar%<span class="st"> </span>{<span class="kw">sqrt</span>(i)} <span class="co"># example 1</span></code></pre></div>
<h4>Example 2</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># returned output values of the parallel process are combined using 'c()' function</span>
<span class="kw">foreach</span>(<span class="dt">i =</span> <span class="dv">1</span>:<span class="dv">28</span>,<span class="dt">.combine =</span> <span class="st">"c"</span>) %dopar%<span class="st"> </span>{<span class="kw">sqrt</span>(i)} <span class="co"># example 2</span></code></pre></div>
<h4>Example 3</h4>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># returned output values of the parallel process are combined using 'cbind()' function</span>
<span class="kw">foreach</span>(<span class="dt">i =</span> <span class="dv">1</span>:<span class="dv">28</span>,<span class="dt">.combine =</span> <span class="st">"cbind"</span>) %dopar%<span class="st"> </span>{letters[<span class="dv">1</span>:<span class="dv">4</span>]} <span class="co"># example 3 </span></code></pre></div>
<h4>Example 4</h4>
<p>You can also create your own combining function as you wish.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># combine using your custom defined function: "myCustomFunc()" and store in 'output' variable</span>
output <-<span class="st"> </span><span class="kw">foreach</span>(<span class="dt">i =</span> <span class="dv">1</span>:<span class="dv">28</span>, <span class="dt">.combine =</span> <span class="st">"myCustomFunc"</span>) %dopar%<span class="st"> </span>{
<span class="kw">sqrt</span>(i)
}</code></pre></div>
<p><code>myCustomFunc</code> above is just a placeholder. ## Further Customizing for packages and output aggregation</p>
<p>You are nearly there, just a couple more things left. If you are using functions from packages loaded to your global R environment, they may not work inside the for-each loop, because, multiple R sessions are instantiated for each parallel process. So you need to define the packages you need inside the foreach loop in the .packages argument. Additionally, if you have a set of variables to iterate over in a separate R object (like a data frame), you can even pass it as a separate iterating variable (allRowIndices) in this case, in the foreach statement. Here is a sample of the code to show how it might look like.</p>
<h2>Structure of a typical parallel processing code</h2>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">allRowIndices <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">1</span>:<span class="kw">nrow</span>(inputData)) <span class="co"># assign row indices of inputData, that will be processed in parallel</span>
output <-<span class="st"> </span><span class="kw">foreach</span> (<span class="dt">rowNum =</span> allRowIndices, <span class="dt">.combine =</span> rbind, <span class="dt">.packages =</span> <span class="kw">c</span>(<span class="st">"caret"</span>, <span class="st">"ggplot2"</span>, <span class="st">"Hmisc"</span>)) %dopar%<span class="st"> </span>{
<span class="co"># code to process each rowNum goes within this block.</span>
<span class="co"># 'n' rows will be processed simultaneously, where 'n' is number of registered cores.</span>
<span class="co"># after processing all rows, the returned value is combined using the function defined in `.combine` argument `rbind` in this case. The output thus aggregated is stored in output variable.</span>
<span class="co"># Finally, the packages required by functions in this block has to be mentioned within .packages argument.</span>
}
<span class="kw">stopCluster</span>(cl) <span class="co"># undo the parallel processing setup</span></code></pre></div>
<p>In the above code, the main component of parallelisation is the foreach loop and the three arguments that go along with it. The first argument (rownum) here is a row counter that iterates through all the rows in ‘allRowIndices’. The second one, ‘.combine’ is a function that will be used to aggregate the results of all computations from the rows. In this case, ‘rbind’ will be used to append the results in rows. Finally, the third one ‘.packages’, states which all packages will be needed for the functions used within the ‘foreach’ block. Note that, even if you have already included the packages before calling the ‘foreach’, you need to re-specify within this block, since, new R sessions will be opened for the parallel processing. With all these defined, the computations will be done in parallel based on the number of cores you had registered earlier and the results get combined and stored in <code>output</code>.</p>
<h2>A comparison between parallel and non-parallel process</h2>
<p>To demonstrate the processing times, a simple math operation is performed on each row of a 4-columned matrix created below. The time taken by a parallel vs non-parallel process is compared as the number of rows in inputData is gradually increased.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">inputData <-<span class="st"> </span><span class="kw">matrix</span>(<span class="dv">1</span>:<span class="dv">800000</span>, <span class="dt">ncol=</span><span class="dv">4</span>) <span class="co"># prepare input data</span>
<span class="kw">head</span>(inputData)
<span class="co">#> [,1] [,2] [,3] [,4]</span>
<span class="co">#> [1,] 1 200001 400001 600001</span>
<span class="co">#> [2,] 2 200002 400002 600002</span>
<span class="co">#> [3,] 3 200003 400003 600003</span>
<span class="co">#> [4,] 4 200004 400004 600004</span>
<span class="co">#> [5,] 5 200005 400005 600005</span>
<span class="co">#> [6,] 6 200006 400006 600006</span>
<span class="co"># For each row of inputData, we'll compute the output as follows: </span>
row output =<span class="st"> </span>col1 *<span class="st"> </span>col2 +<span class="st"> </span>col3 /<span class="st"> </span>col4</code></pre></div>
<h3>1. Non-parallel version</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">output_serial <-<span class="st"> </span><span class="kw">numeric</span>() <span class="co"># initialize output</span>
for (rowNum in <span class="kw">c</span>(<span class="dv">1</span>:<span class="kw">nrow</span>(inputData))) {
calculatedOutput <-<span class="st"> </span>inputData[rowNum, <span class="dv">1</span>] *<span class="st"> </span>inputData[rowNum, <span class="dv">2</span>] +<span class="st"> </span>inputData[rowNum, <span class="dv">3</span>] /<span class="st"> </span>inputData[rowNum, <span class="dv">4</span>] <span class="co"># compute output</span>
output_serial <-<span class="st"> </span><span class="kw">c</span>(output_serial, calculatedOutput) <span class="co"># append to output variable</span>
}</code></pre></div>
<h3>2. Parallel version</h3>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(doSNOW)
cl <-<span class="st"> </span><span class="kw">makeCluster</span>(<span class="dv">4</span>, <span class="dt">type=</span><span class="st">"SOCK"</span>) <span class="co"># 4 – number of cores</span>
<span class="kw">registerDoSNOW</span>(cl) <span class="co"># Register Backend Cores for Parallel Computing</span>
allRowIndices <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">1</span>:<span class="kw">nrow</span>(inputData)) <span class="co"># row numbers of inputData, that will be processed in parallel</span>
output_parallel <-<span class="st"> </span><span class="kw">foreach</span> (<span class="dt">rowNum =</span> allRowIndices, <span class="dt">.combine =</span> c) %dopar%<span class="st"> </span>{
calculatedOutput <-<span class="st"> </span>inputData[rowNum, <span class="dv">1</span>] *<span class="st"> </span>inputData[rowNum, <span class="dv">2</span>] +<span class="st"> </span>inputData[rowNum, <span class="dv">3</span>] /<span class="st"> </span>inputData[rowNum, <span class="dv">4</span>] <span class="co"># compute output</span>
<span class="kw">return</span> (calculatedOutput)
}</code></pre></div>
<p><img src='screenshots/Parallel-vs-Non-Parallel-Processing-times.png' width='360' height='284' /></p>
<h2>References</h2>
<ol style="list-style-type: decimal">
<li><a href="http://cran.r-project.org/web/packages/foreach/foreach.pdf">foreach pdf</a></li>
<li><a href="http://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf">foreach vignette</a></li>
<li><a href="http://cran.r-project.org/web/packages/foreach/vignettes/nested.pdf">nested vignette</a></li>
<li><a href="http://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf">doParllel vignette</a></li>
</ol>
</div>
</div>
<div class="footer">
<hr>
<p>© 2016-17 Selva Prabhakaran. Powered by <a href="http://jekyllrb.com/">jekyll</a>,
<a href="http://yihui.name/knitr/">knitr</a>, and
<a href="http://johnmacfarlane.net/pandoc/">pandoc</a>.
This work is licensed under the <a href="http://creativecommons.org/licenses/by-nc/3.0/">Creative Commons License.</a>
</p>
</div>
</div> <!-- /container -->
<script src="//code.jquery.com/jquery.js"></script>
<script src="www/bootstrap.min.js"></script>
<script src="www/toc.js"></script>
<!-- MathJax Script -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<!-- Google Analytics Code -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-69351797-1', 'auto');
ga('send', 'pageview');
</script>
<style type="text/css">
/* reduce spacing around math formula*/
.MathJax_Display {
margin: 0em 0em;
}
body {
font-family: 'Helvetica Neue', Roboto, Arial, sans-serif;
font-size: 16px;
line-height: 27px;
font-weight: 400;
}
blockquote p {
line-height: 1.75;
color: #717171;
}
.well li{
line-height: 28px;
}
li.dropdown-header {
display: block;
padding: 0px;
font-size: 14px;
}
</style>
</body>
</html>