-
Notifications
You must be signed in to change notification settings - Fork 2
Home
This is a central place to keep track of things that I come across during implementing quad ropes.
Sparseness and flattening seem to go hand in hand. Sparse quad ropes must not allocate an array for the entire size, so we need one variant of each function that preserves structure if the quad rope contains sparse elements as to maintain memory benefits; and one variant that does not preserve structure (or allocates a new underlying array) to maintain fast performance.
So it seems not possible to implement a sparse quad rope variant and a flattening variant completely independently. However, it would be possible to first implement a quad rope variant with two types of each function, such as to allocate a new underlying array for the entire size or not.
Flattening quad ropes seems to have surprisingly little benefit. Maybe the implementation is too high-level. However, it seems that there is no immediate performance improvement from flattening quad ropes into single arrays during e.g. map. Index-based tiling seems ineffective.
Possible reasons may be that the tiling is currently implemented by slicing, which allocates new memory on the heap on each recursive step. Solutions that do not allocate new memory however don't perform well either, but in the current implementation, the loop is not very tight and involves some arithmetic.
Other functions on ArraySlice
, however, also involve additional offset arithmetic, so I don't feel that this might be the root cause. Function ArraySlice.fastGet
is inlined and performs offset arithmetic on integers, which should be fast enough.
Update: After increasing s_max
to 32, quad ropes perform convincingly:
C:\Users\fbie\src\rad-trees>benchmark -t 16
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.6.2 or later
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date 2016-08-23T08:00:24
# Threads 16
QuadRope.init 4450000,0 ns 39528,47 64
QuadRope.map 4578125,0 ns 67507,72 64
QuadRope.reduce 2258593,8 ns 99262,83 128
QuadRope.zip 5629687,5 ns 99382,32 64
QuadRope.hfold 3516406,3 ns 46504,61 128
QuadRope.vfold 3571875,0 ns 45673,25 128
QuadRope.hreduce 3134375,0 ns 71583,80 128
QuadRope.vreduce 3153906,3 ns 71227,65 128
Array2D.init 3205468,8 ns 26054,68 128
Array2D.map 4062500,0 ns 41666,67 64
Array2D.reduce 3885937,5 ns 19557,27 64
Array2D.zip 4901562,5 ns 89319,50 64
Array2D.hfold 4076562,5 ns 399479,51 64
Array2D.vfold 4639062,5 ns 667416,98 64
Array2D.hreduce 1975000,0 ns 10925,09 128
Array2D.vreduce 2143750,0 ns 11170,63 128
C:\Users\fbie\src\rad-trees>benchmark -t 16 -m mmult -s 20
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.6.2 or later
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date 2016-08-23T08:02:20
# Threads 16
mmult Array2D 912695,3 ns 2069,04 512
mmult Array2D.Parallel 39325000,0 ns 6040718,32 8
mmult QuadRope 766406,3 ns 1647,02 512
mmult QuadRope.Parallel 208007,8 ns 2604,17 2048
Performance on the p3 server with 16 cores is not convincing. Seems like instantiating objects is much more expensive, because parallel reduce
on quad ropes is even slightly faster than on arrays and there, we do not re-build the quad rope tree structure.
C:\Users\fbie\src\rad-trees>benchmark -t 1
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.6.2 or later
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date 2016-08-23T06:33:36
# Threads 1
QuadRope.init 22112500,0 ns 64549,72 16
QuadRope.map 40987500,0 ns 231615,70 8
QuadRope.reduce 35550000,0 ns 64549,72 8
QuadRope.zip 57262500,0 ns 216105,04 8
QuadRope.hfold 42200000,0 ns 64549,72 8
QuadRope.vfold 41362500,0 ns 92233,10 8
QuadRope.hreduce 55675000,0 ns 64549,72 8
QuadRope.vreduce 57350000,0 ns 184466,20 8
Array2D.init 21718750,0 ns 370634,31 16
Array2D.map 38387500,0 ns 92233,10 8
Array2D.reduce 33050000,0 ns 64549,72 8
Array2D.zip 53725000,0 ns 79056,94 8
Array2D.hfold 33250000,0 ns 0,00 8
Array2D.vfold 34487500,0 ns 319776,40 8
Array2D.hreduce 33312500,0 ns 135015,43 8
Array2D.vreduce 36575000,0 ns 64549,72 8
C:\Users\fbie\src\rad-trees>benchmark -t 16
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.6.2 or later
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date 2016-08-23T06:35:18
# Threads 16
QuadRope.init 7393750,0 ns 69409,71 64
QuadRope.map 7796875,0 ns 433576,15 32
QuadRope.reduce 3323437,5 ns 36680,87 128
QuadRope.zip 8209375,0 ns 160057,77 32
QuadRope.hfold 4376562,5 ns 68880,09 64
QuadRope.vfold 4439062,5 ns 70821,84 64
QuadRope.hreduce 7226562,5 ns 92950,88 64
QuadRope.vreduce 7140625,0 ns 109746,39 64
Array2D.init 3428125,0 ns 46116,55 128
Array2D.map 4240625,0 ns 56192,10 64
Array2D.reduce 3836718,8 ns 80303,82 128
Array2D.zip 4953125,0 ns 27559,91 64
Array2D.hfold 5259375,0 ns 360783,77 64
Array2D.vfold 4757812,5 ns 293266,63 64
Array2D.hreduce 1925390,6 ns 8528,40 256
Array2D.vreduce 2110156,3 ns 19295,45 128
The performance of computing a van Der Corput sequence is much slower than on arrays, for unknown reasons. One reason might be balancing, so this here is a shot at improving the balancing algorithm:
/// Insert a quad rope in a list of quad ropes. Adjacent quad ropes
/// are merged using <code>merge</code> if their depths are equal. If
/// a merge is triggered, the function recurses on the list until
/// either the merge condition is not met anymore or there are no quad
/// ropes left to merge with.
let rec insert merge x xs =
match xs with
| [] -> [x]
| y :: xs when depth x = depth y -> insert merge (merge y x) xs
| _ -> x :: xs
/// Merge all quad ropes in the list using <code>merge</code>,
/// assuming that logical order is reversed, i.e. the west-most node
/// is at head.
let mergeAll merge xs =
match xs with
| [] -> invalidArg "xs" "List of nodes cannot be empty."
| [x] -> x
| _ -> List.reduce (fun r l -> merge l r) xs
/// Balance rope horizontally.
let hbalance rope =
let rec hbalance rope =
if isBalancedH rope then
rope
else
mergeAll flatNode (collect rope [])
and collect rope rs =
match rope with
| _ when isBalancedH rope -> insert flatNode rope rs
| Empty -> rs
| Node (_, _, _, ne, nw, Empty, Empty) -> collect ne (collect nw rs)
| Node (_, _, _, ne, nw, sw, se) ->
insert (node (hbalance ne) (hbalance nw) (hbalance sw) (hbalance se)) rs
| _ -> insert flatNode rope rs
hbalance rope
/// Balance rope vertically.
let vbalance rope =
let rec vbalance rope =
if isBalancedV rope then
rope
else
mergeAll thinNode (collect rope [])
and collect rope rs =
match rope with
| _ when isBalancedV rope -> insert thinNode rope rs
| Empty -> rs
| Node (_, _, _, Empty, nw, sw, Empty) -> collect sw (collect nw rs)
| Node (_, _, _, ne, nw, sw, se) ->
insert (node (vbalance ne) (vbalance nw) (vbalance sw) (vbalance se)) rs
| _ -> insert thinNode rope rs
vbalance rope
This is an algorithm that is much closer to Boehm et al.'s variant. This means that the resulting ropes are less compact. Even less desirable, the performance is slightly worse. We therefore return to the already implemented balancing variant and conclude that its performance is sufficient.
>benchmark -m vdc -s 15
# ...
vdc Array2D 1644140,6 ns 26329,79 256
vdc Array2D.Parallel 2775000,0 ns 58787,30 128
vdc QuadRope 2033593,8 ns 39330,68 128
vdc QuadRope.Parallel 2659375,0 ns 83268,20 128
Hence, the root cause for the slow performance of the vdc
benchmark must be located elsewhere.
It is straight-forward to avoid the copying overhead of array slicing notation by (re-) introducing views on arrays. We call those 'a ArraySlice
. Reallocation happens only if we absolutely must create a new array, because of for instance a call to map
.
Most operations are still slower than on arrays. However, from the four simple algorithms we use as examples, i.e. vanDerCorput-sequences, Fibonacci-sequences, matrix multiplication and prime factorization, quad ropes now outperform arrays in three of these, if also only by a small margin for matrix multiplication.
Especially, the task-based approach of quad ropes is able to deal much better with nested parallelism than default arrays.
>benchmark
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-08-03T09:25:52
QuadRope.init 26400000,0 ns 356243,91 16
QuadRope.map 47187500,0 ns 575090,57 8
QuadRope.reduce 41000000,0 ns 527046,28 8
QuadRope.zip 66675000,0 ns 1598827,70 4
QuadRope.hfold 47162500,0 ns 565347,73 8
QuadRope.vfold 49800000,0 ns 687689,37 8
QuadRope.hreduce 61537500,0 ns 1078595,41 8
QuadRope.vreduce 64875000,0 ns 1029090,75 4
Array2D.init 25462500,0 ns 218898,76 16
Array2D.map 41862500,0 ns 418537,47 8
Array2D.reduce 38762500,0 ns 748725,77 8
Array2D.zip 61462500,0 ns 944740,56 8
Array2D.hfold 38687500,0 ns 1227930,17 8
Array2D.vfold 48937500,0 ns 1532347,96 8
Array2D.hreduce 39237500,0 ns 365386,02 8
Array2D.vreduce 49762500,0 ns 1168822,41 8
>benchmark -t 4
QuadRope.init 14912500,0 ns 795876,96 32
QuadRope.map 18950000,0 ns 925994,21 16
QuadRope.reduce 13771875,0 ns 907033,90 32
QuadRope.zip 25037500,0 ns 1362697,49 16
QuadRope.hfold 20081250,0 ns 1769742,31 16
QuadRope.vfold 18631250,0 ns 1586468,93 16
QuadRope.hreduce 18168750,0 ns 1508094,48 16
QuadRope.vreduce 17637500,0 ns 1415685,94 16
Array2D.init 12218750,0 ns 323420,31 32
Array2D.map 16375000,0 ns 485912,66 16
Array2D.reduce 10546875,0 ns 166177,67 32
Array2D.zip 21325000,0 ns 263193,53 16
Array2D.hfold 11625000,0 ns 117851,13 32
Array2D.vfold 14056250,0 ns 232364,06 32
Array2D.hreduce 10487500,0 ns 95697,37 32
Array2D.vreduce 13262500,0 ns 233946,37 32
>benchmark -m mmult -s 20
mmult Array2D 1048046,9 ns 23441,12 256
mmult Array2D.Parallel 111525000,0 ns 861603,80 4
mmult QuadRope 1291015,6 ns 17876,99 256
mmult QuadRope.Parallel 875976,6 ns 13818,35 512
>benchmark -m primes -s 100
primes Array2D.Parallel 332950000,0 ns 19254797,38 2
primes QuadRope.Parallel 64350000,0 ns 2285825,89 4
>benchmark -m fibseq -s 1000
fibseq Array2D 5031250,0 ns 79330,97 64
fibseq QuadRope 1242968,8 ns 16654,46 256
>benchmark -m vdc -s 15
vdc Array2D 1612500,0 ns 15494,24 256
vdc Array2D.Parallel 2855468,8 ns 127164,95 128
vdc QuadRope 1997656,3 ns 35905,41 128
vdc QuadRope.Parallel 2632812,5 ns 77602,42 128
Allocating many small arrays is much slower than allocating a single large array. Chopping this array into smaller pieces, however, seems to be okay fast. I guess the compiler is performing some optimizations in the background, which I will need to investigate further to be able to explain this speedup.
Anyways, the improvement is clearly visible:
small arrays seq 63200000,0 ns 1116791,04 4
large array seq 38337500,0 ns 1044246,80 8
small arrays par 40587500,0 ns 2575963,95 8
large array par 29287500,0 ns 4054875,32 16
The relative speedup from parallelizing the operations however is much smaller.
As a consequence, we can drastically improve the performance of reallocate
. We can implement init in terms of fromArray2D
and then we can implement reallocate
as toArray2D >> fromArray2D
, which yields a more than four-fold improvement.
Old:
QuadRope.reallocate 263800000,0 ns 4151305,02 2
New:
QuadRope.reallocate 62100000,0 ns 2125245,08 4
(Note: shame on me for not keeping this updated.)
E:\rad-trees>benchmark
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T10:04:51
QuadRope.init 61237500,0 ns 693346,35 8
QuadRope.map 44725000,0 ns 901002,53 8
QuadRope.reduce 40725000,0 ns 299304,75 8
QuadRope.zip 65200000,0 ns 1046156,99 4
QuadRope.hfold 50687500,0 ns 497389,02 8
QuadRope.vfold 48237500,0 ns 602223,89 8
QuadRope.hreduce 63425000,0 ns 667187,30 4
QuadRope.vreduce 64800000,0 ns 1005540,21 4
Array2D.init 23868750,0 ns 295289,99 16
Array2D.map 40362500,0 ns 557306,27 8
Array2D.reduce 39375000,0 ns 403973,32 8
Array2D.zip 62462500,0 ns 759317,13 8
Array2D.hfold 38750000,0 ns 549621,08 8
Array2D.vfold 50587500,0 ns 704770,45 8
Array2D.hreduce 39362500,0 ns 480342,76 8
Array2D.vreduce 49037500,0 ns 870125,31 8
E:\rad-trees>benchmark -t 4
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T10:06:43
# Threads 4
QuadRope.init 39712500,0 ns 3765269,84 8
QuadRope.map 20193750,0 ns 1672762,29 16
QuadRope.reduce 12987500,0 ns 689737,52 32
QuadRope.zip 27518750,0 ns 1924136,26 16
QuadRope.hfold 19450000,0 ns 1595838,77 16
QuadRope.vfold 19587500,0 ns 1941237,43 16
QuadRope.hreduce 17243750,0 ns 1484424,34 16
QuadRope.vreduce 17493750,0 ns 1381125,87 16
Array2D.init 17356250,0 ns 1173732,32 16
Array2D.map 20156250,0 ns 402088,73 16
Array2D.reduce 10746875,0 ns 51979,06 32
Array2D.zip 27625000,0 ns 1497973,17 16
Array2D.hfold 12075000,0 ns 262367,69 32
Array2D.vfold 14993750,0 ns 902686,96 32
Array2D.hreduce 11659375,0 ns 748472,11 32
Array2D.vreduce 13896875,0 ns 310427,15 32
Looking at the benchmark results, quad ropes are now quite competitive compared to 2D arrays. However, init
is a weak spot and the performance is roughly three times slower than on 2D arrays. It seems that allocating many small arrays, in addition to the tree structure, is much more costly than allocating a single large array. If we remove the actual allocation from the leaves and only execute the initialization function appropriately many times, we can improve performance by about a third.
What is interesting about this is that map
, which in principle performs the same operation, performs just as fast as on 2d arrays, so there might be something going on under the hood which I am not aware of.
Unfortunately, in most "real-life" benchmarks, quad ropes still can't keep up:
E:\rad-trees>benchmark -m vdc -s 20
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T10:10:10
vdc Array2D 64075000,0 ns 1986237,37 4
vdc Array2D.Parallel 1916850000,0 ns 31358190,49 2
vdc QuadRope 156550000,0 ns 13573769,64 2
vdc QuadRope.Parallel 183350000,0 ns 13725584,38 2
E:\rad-trees>benchmark -m mmult -s 100
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T10:12:42
mmult Array2D 95125000,0 ns 1008643,20 4
mmult QuadRope 295350000,0 ns 5467327,20 2
mmult QuadRope.Parallel 139000000,0 ns 11628509,03 2
E:\rad-trees>benchmark -m fibseq -s 1000
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T10:18:02
fibseq Array2D 5154687,5 ns 54848,47 64
fibseq QuadRope 1440625,0 ns 11617,04 256
E:\rad-trees>benchmark -m primes -s 100
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T10:23:48
primes Array2D.Parallel 300450000,0 ns 23147414,06 2
primes QuadRope.Parallel 66300000,0 ns 2213594,36 4
Since init
seems to be so slow, it may be an interesting idea to defer leaf initialization to later and to apply fusion to the quad rope leaves, such that we save as many intermediate materializations as possible. Such an experiment is implemented on fbie/lazy-leaves
.
Every leaf is a record of two starting indices (for efficient slicing), two size values, an initialization function and a lazy array, which will be initialized using the initialization function if forced:
type 'a LazyLeaf =
{ i : int;
j : int;
h : int;
w : int;
f : int -> int -> 'a;
arr : 'a [,] Lazy }
The idea is that function calls on quad rope produce a new quad rope where the leaves are lazily initialized. This gives us immediately some parallelism for evaluating all leaves and also avoids that we have to materialize all leaves if we only read from part of the quad rope.
The overhead from this seems to be rather significant however. These benchmarks run on a quad rope with already materialized arrays:
E:\rad-trees>benchmark
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T09:49:37
QuadRope.init 64200000,0 ns 1432751,82 4
QuadRope.map 62900000,0 ns 561496,02 8
QuadRope.reduce 41537500,0 ns 404188,14 8
QuadRope.zip 105100000,0 ns 603232,04 4
QuadRope.hfold 73075000,0 ns 1236313,97 4
QuadRope.vfold 74150000,0 ns 2986543,90 4
QuadRope.hreduce 86175000,0 ns 1080444,87 4
QuadRope.vreduce 85325000,0 ns 1086853,26 4
Array2D.init 25612500,0 ns 261539,25 16
Array2D.map 42437500,0 ns 598754,49 8
Array2D.reduce 39875000,0 ns 475073,09 8
Array2D.zip 63375000,0 ns 944648,67 4
Array2D.hfold 37987500,0 ns 578701,85 8
Array2D.vfold 48600000,0 ns 772352,11 8
Array2D.hreduce 39600000,0 ns 411636,30 8
Array2D.vreduce 48937500,0 ns 453573,77 8
E:\rad-trees>benchmark -t 4
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T09:51:12
# Threads 4
QuadRope.init 26087500,0 ns 1210558,30 16
QuadRope.map 28243750,0 ns 2604325,00 16
QuadRope.reduce 13278125,0 ns 885655,61 32
QuadRope.zip 40850000,0 ns 2799925,59 8
QuadRope.hfold 28418750,0 ns 1999576,78 16
QuadRope.vfold 27668750,0 ns 1166685,27 16
QuadRope.hreduce 27325000,0 ns 1415992,49 16
QuadRope.vreduce 26468750,0 ns 1734116,91 16
Array2D.init 16059375,0 ns 637964,50 32
Array2D.map 20387500,0 ns 798109,75 16
Array2D.reduce 11171875,0 ns 710054,59 32
Array2D.zip 25412500,0 ns 537806,76 16
Array2D.hfold 12100000,0 ns 790130,09 32
Array2D.vfold 15415625,0 ns 500444,68 32
Array2D.hreduce 10756250,0 ns 50603,99 32
Array2D.vreduce 14109375,0 ns 551640,94 32
Also, more complex benchmarks, which gave spark to this idea, do not perform faster:
E:\rad-trees>benchmark -m vdc -s 20
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T09:32:05
vdc Array2D 63125000,0 ns 2144922,95 4
vdc Array2D.Parallel 1919200000,0 ns 40053575,23 2
vdc QuadRope 748350000,0 ns 11671546,60 2
vdc QuadRope.Parallel 534650000,0 ns 17496110,68 2
E:\rad-trees>benchmark -m mmult -s 100
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T09:15:35
mmult Array2D 98650000,0 ns 1297433,36 4
mmult QuadRope 336200000,0 ns 5638163,61 2
mmult QuadRope.Parallel 143650000,0 ns 12563284,25 2
E:\rad-trees>benchmark -m fibseq -s 1000
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T09:16:15
fibseq Array2D 5275000,0 ns 195183,99 64
fibseq QuadRope 1659375,0 ns 17056,79 256
E:\rad-trees>benchmark -m primes -s 100
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date 2016-07-29T10:36:01
primes Array2D.Parallel 305250000,0 ns 25173012,44 2
primes QuadRope.Parallel 289950000,0 ns 31296476,55 2
Therefore, I won't continue developing lazy leaves.
Nevertheless, we need a faster initialization function. Parallel initialization also does not quite cut it, parallel array initialization is still faster.
Scan was broken because we used inclusive scan for substructures which lead to value duplication. We can implement inclusive scan by simply concatenating the result of the exclusive scan to the initial state but that is slightly inefficient.
More surprisingly, vscan
seems to be quite a bit faster on quad ropes than on 2D arrays:
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date 2016-04-14T15:24:17
...
QuadRope.hscan 124700000,0 ns 2931628,29 4
Array2D.hscan 97975000,0 ns 4301889,12 4
QuadRope.vscan 102325000,0 ns 1258581,65 4
Array2D.vscan 151950000,0 ns 4591356,61 2
...
Based on some benchmarking, it seems that we should go forward with inlining all utility functions. Inlining functions on quad rope has a slightly negative effect and also forces us to make some functions public which should better remain private.
Some inlining:
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date 2016-04-13T11:37:14
QuadRope.init 61475000,0 ns 758745,31 8
Array2D.init 24381250,0 ns 280206,40 16
QuadRope.map 46237500,0 ns 505009,63 8
Array2D.map 43312500,0 ns 970484,56 8
QuadRope.hfold 147400000,0 ns 3016620,63 2
Array2D.hfold 136600000,0 ns 3373096,17 2
QuadRope.vfold 141200000,0 ns 1766981,10 2
Array2D.vfold 126400000,0 ns 1663329,99 2
QuadRope.hreduce 161650000,0 ns 3223610,81 2
Array2D.hreduce 139600000,0 ns 2401388,49 2
QuadRope.vreduce 151800000,0 ns 3544949,46 2
Array2D.vreduce 155900000,0 ns 1197219,00 2
QuadRope.hcat 131,2 ns 2,01 2097152
QuadRope.hcat + hbalance 496093,8 ns 7322,41 1024
QuadRope.hcat + reallocate 955150000,0 ns 10729114,08 2
Array2D.hcat 20362500,0 ns 1483649,31 16
QuadRope.vcat 128,7 ns 1,56 2097152
QuadRope.vcat + vbalance 452441,4 ns 6890,75 1024
QuadRope.vcat + reallocate 964350000,0 ns 13922503,77 2
Array2D.vcat 19362500,0 ns 619279,38 16
QuadRope.index 452,7 ns 10,21 1048576
Array2D.index 6,1 ns 0,05 67108864
QuadRope.set 4306,0 ns 102,43 65536
Array2D.set 9293750,0 ns 223023,73 32
All utility functions inlined:
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date 2016-04-13T11:42:11
QuadRope.init 58125000,0 ns 677003,20 8
Array2D.init 22337500,0 ns 222829,03 16
QuadRope.map 43712500,0 ns 1012165,58 8
Array2D.map 41112500,0 ns 757944,04 8
QuadRope.hfold 148250000,0 ns 4547587,88 2
Array2D.hfold 136800000,0 ns 4263540,52 2
QuadRope.vfold 138950000,0 ns 1964264,07 2
Array2D.vfold 125225000,0 ns 982132,03 4
QuadRope.hreduce 157800000,0 ns 3029484,74 2
Array2D.hreduce 124650000,0 ns 1550985,35 4
QuadRope.vreduce 148200000,0 ns 1960725,49 2
Array2D.vreduce 134300000,0 ns 2679137,51 2
QuadRope.hcat 156,1 ns 2,27 2097152
QuadRope.hcat + hbalance 488183,6 ns 3780,81 1024
QuadRope.hcat + reallocate 912700000,0 ns 5452828,01 2
Array2D.hcat 18850000,0 ns 508777,13 16
QuadRope.vcat 160,7 ns 2,44 2097152
QuadRope.vcat + vbalance 443066,4 ns 2394,29 1024
QuadRope.vcat + reallocate 928050000,0 ns 14808874,97 2
Array2D.vcat 18775000,0 ns 313470,71 16
QuadRope.index 453,1 ns 4,06 1048576
Array2D.index 6,1 ns 0,07 67108864
QuadRope.set 4225,2 ns 57,30 65536
Array2D.set 9240625,0 ns 150987,78 32
Aggressively inlined:
# OS Microsoft Windows NT 6.2.9200.0
# .NET vers. 4.0.30319.42000
# 64-bit OS True
# 64-bit proc False
# CPU Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date 2016-04-13T11:31:20
QuadRope.init 60875000,0 ns 1611589,97 8
Array2D.init 24068750,0 ns 1654762,17 16
QuadRope.map 46500000,0 ns 1699673,17 8
Array2D.map 42175000,0 ns 957789,70 8
QuadRope.hfold 152850000,0 ns 6936417,58 2
Array2D.hfold 133300000,0 ns 2699794,23 2
QuadRope.vfold 141800000,0 ns 2382808,80 2
Array2D.vfold 126000000,0 ns 2527625,15 2
QuadRope.hreduce 164550000,0 ns 2948163,27 2
Array2D.hreduce 131350000,0 ns 1453921,90 2
QuadRope.vreduce 156600000,0 ns 1822696,42 2
Array2D.vreduce 135300000,0 ns 3001851,28 2
QuadRope.hcat 161,7 ns 2,57 2097152
QuadRope.hcat + hbalance 605078,1 ns 5026,11 512
QuadRope.hcat + reallocate 1037400000,0 ns 80268923,00 2
Array2D.hcat 19975000,0 ns 1108364,66 16
QuadRope.vcat 170,3 ns 2,26 2097152
QuadRope.vcat + vbalance 875195,3 ns 11180,11 512
QuadRope.vcat + reallocate 1055400000,0 ns 31931175,99 2
Array2D.vcat 20212500,0 ns 951606,83 16
QuadRope.index 520,7 ns 21,41 524288
Array2D.index 6,7 ns 0,34 67108864
QuadRope.set 4753,1 ns 181,87 65536
Array2D.set 8881250,0 ns 461588,82 32
Most functions on quad ropes are recursive and can therefore not be inlined.
During lazy tree splitting, if threads become available, we want to pause traversal, split the remaining work, that is the part of the quad rope that has not yet been processed, and continue in parallel on the sub-trees of the split.
To do this efficiently by splitting trees at their child-nodes, intermediate trees must be allowed to be non-regular, such that e.g. the NW sub-tree may be an empty node.
NB: In Bergstrom et al., splitting must be guaranteed to produce ropes that differ in length by at most one. If we want to split efficiently (i.e. constant time vs. O (n log n)), we have to sacrifice this guarantee. Moreover, simpler splitting also results in simpler subsequent merging.
We need also a way to remember where we originally have split the "context" of the sequential execution, because we must later re-insert the already processed part of the rope into its old location. Smells like we want another zipper there.
$ ./benchmark # on mono
# OS Unix 14.5.0.0
# .NET vers. 4.0.30319.17020
# 64-bit OS True
# 64-bit proc True
# CPU ; 8 "cores"
# Date 2016-04-12T09:47:06
...
QuadRope.hcat 99.7 ns 0.70 4194304
QuadRope.hcat + hbalance 561328.1 ns 4801.85 512
QuadRope.hcat + reallocate 647650000.0 ns 3742028.56 2
Array2D.hcat 16025000.0 ns 224768.40 16
QuadRope.vcat 100.1 ns 1.14 4194304
QuadRope.vcat + vbalance 501562.5 ns 5109.74 512
QuadRope.vcat + reallocate 647500000.0 ns 2943920.29 2
...
While re-balancing of large quad ropes seems no problem, re-allocation is a big deal (in the literal sense of the word). Re-allocation of the data structure might be necessary though if a user decides to concatenate small arrays in an "adversarial" manner.
A useful re-balancing/re-allocation strategy might be to allow re-allocation only on sufficiently small arrays and to later fall back to simple balancing.
Filtering is a problem because the resulting 2D array might not be regular. If we cannot guarantee regularity, we need to prohibit the use of functions that produce possibly irregular structures.
We can still allow filtering on arrays of width or height = 1. This is implemented now.
Reducing empty lists should return nothing.
There are still problems with fold but they are related to splitting.