Faster disjoint/isSubsetOf for Set via unbalanced splitting. #865

alexfmpe · 2022-10-29T03:26:05Z

Unlike, say, union/intersection, disjoint doesn't return a new structure. It can avoid the re-balancing work because it immediately inspects and forgets the produced tree. This allows significant constant factor speedups.

  member:                    OK (0.14s)
    133  μs ±  12 μs,   0 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  insert:                    OK (0.18s)
    678  μs ±  47 μs, 2.6 MB allocated,  55 KB copied, 8.0 MB peak memory,       same as baseline
  map:                       OK (0.22s)
    106  μs ± 6.0 μs, 557 KB allocated,  16 KB copied, 8.0 MB peak memory,       same as baseline
  filter:                    OK (0.17s)
    80.2 μs ± 6.4 μs,  80 KB allocated, 1.2 KB copied, 8.0 MB peak memory,       same as baseline
  partition:                 OK (0.16s)
    148  μs ±  15 μs, 239 KB allocated, 4.0 KB copied, 8.0 MB peak memory,       same as baseline
  fold:                      OK (0.31s)
    34.9 ns ± 1.5 ns, 695 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  delete:                    OK (0.12s)
    454  μs ±  44 μs, 1.3 MB allocated, 372 B  copied, 8.0 MB peak memory,       same as baseline
  findMin:                   OK (0.27s)
    15.7 ns ± 784 ps,  31 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  findMax:                   OK (0.30s)
    17.1 ns ± 842 ps,  31 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  deleteMin:                 OK (0.25s)
    114  ns ± 7.0 ns, 471 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  deleteMax:                 OK (0.25s)
    117  ns ± 7.0 ns, 511 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  unions:                    OK (0.31s)
    147  μs ± 7.2 μs, 472 KB allocated,  10 KB copied, 8.0 MB peak memory,       same as baseline
  union:                     OK (0.16s)
    148  μs ±  11 μs, 467 KB allocated,  10 KB copied, 8.0 MB peak memory,       same as baseline
  difference:                OK (0.21s)
    102  μs ± 7.2 μs, 239 KB allocated, 2.4 KB copied, 8.0 MB peak memory,       same as baseline
  intersection:              OK (0.20s)
    49.8 μs ± 3.5 μs,  80 KB allocated, 835 B  copied, 8.0 MB peak memory,       same as baseline
  fromList:                  OK (0.28s)
    67.4 μs ± 4.2 μs, 271 KB allocated, 6.9 KB copied, 8.0 MB peak memory,       same as baseline
  fromList-desc:             OK (0.30s)
    600  μs ±  25 μs, 2.6 MB allocated,  56 KB copied, 8.0 MB peak memory,       same as baseline
  fromAscList:               OK (0.27s)
    127  μs ± 8.9 μs, 414 KB allocated, 8.3 KB copied, 8.0 MB peak memory,       same as baseline
  fromDistinctAscList:       OK (0.21s)
    50.2 μs ± 2.9 μs, 159 KB allocated, 3.2 KB copied, 8.0 MB peak memory,       same as baseline
  disjoint:false:            OK (0.22s)
    24.9 ns ± 1.7 ns,  31 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  disjoint:true:             OK (0.12s)
    111  μs ±  10 μs, 153 KB allocated,  61 B  copied, 8.0 MB peak memory, 23% less than baseline
  isSubsetOf:true:           OK (0.16s)
    18.6 μs ± 1.6 μs,   0 B  allocated,   0 B  copied, 8.0 MB peak memory,       same as baseline
  isSubsetOf:false:          OK (0.43s)
    100  ns ± 4.6 ns, 463 B  allocated,   0 B  copied, 8.0 MB peak memory, 63% less than baseline
  null.intersection:false:   OK (0.21s)
    49.5 μs ± 3.6 μs,  80 KB allocated, 834 B  copied, 8.0 MB peak memory,       same as baseline
  null.intersection:true:    OK (0.15s)
    154  μs ±  13 μs, 293 KB allocated, 174 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:member:             OK (0.21s)
    801  μs ±  48 μs, 2.4 MB allocated, 402 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:insert:             OK (0.23s)
    864  μs ±  52 μs, 3.6 MB allocated, 106 KB copied, 8.0 MB peak memory,       same as baseline
  alterF:delete:             OK (0.14s)
    536  μs ±  46 μs, 1.9 MB allocated, 432 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:four:               OK (0.20s)
    813  μs ±  74 μs, 2.4 MB allocated, 388 B  copied, 8.0 MB peak memory,       same as baseline
  alterF:four:strings:       OK (0.22s)
    1.68 ms ± 120 μs, 2.5 MB allocated, 413 B  copied, 8.0 MB peak memory,       same as baseline
  alterF_naive:four:         OK (0.17s)
    1.39 ms ±  88 μs, 2.0 MB allocated, 327 B  copied, 8.0 MB peak memory,       same as baseline
  alterF_naive:four:strings: OK (0.27s)
    4.16 ms ± 351 μs, 2.1 MB allocated, 397 B  copied, 8.0 MB peak memory,       same as baseline

All 32 tests passed (7.14s)
Benchmark set-benchmarks: FINISH

alexfmpe · 2022-10-29T03:30:13Z

The same should be applicable to Map, but there disjoint and isSubmapOf use different splits, resp, splitMember and splitLookup. Do I duplicate both to replace link with bin ?
FWIW, replacing splitMember with splitMember k0 m = let (a,b,c) = splitLookup k0 m in (a, isJust b, c) seems to cause no change in the benchmarks, but I don't know how fragile the optimizations involved are.

treeowl · 2022-10-29T03:31:09Z

I'm going to need to see a proof that the stated time bounds still hold. Or if they don't, some reasonably tight new bound. We need to be sure that this doesn't introduce some nasty cases that our benchmarks don't happen to catch.

alexfmpe · 2022-10-29T03:31:46Z

containers/src/Data/Set/Internal.hs

+-- Same as 'splitMember' but skips re-balancing by using 'bin' instead of 'link'.
+-- Attempting to build new trees out of these will error when re-balancing but
+-- this can improve performance when the resulting trees are disposable.
+splitMemberUnbalanced :: Ord a => a -> Set a -> (Set a,Bool,Set a)


Tried to factor out bin/link to avoid code duplication but that made the involved functions 2-3x slower.

treeowl · 2022-10-29T23:00:11Z

Did you see my question? The bounds we have for these operations lean on the highly nontrivial published proofs for bounds on intersection and difference. My concern is that by allowing one set of trees to become unbalanced (hence potentially deep for their size), you could break those proofs (and bounds).

alexfmpe · 2022-10-31T02:20:05Z

Did you see my question?

My bad, hadn't noticed you had already replied by the time I finished commenting.

My concern is that by allowing one set of trees to become unbalanced (hence potentially deep for their size), you could break those proofs (and bounds).

Right. The idea here is that the act of balancing that link performs pays a cost anyway since it needs to navigate the unbalanced bits to balance them but we don't get to amortize that cost since we only do one operation on the tree and then throw it away. A bit like linear scanning of an array being faster than building a binomial queue in O(n) but then doing a single lookup.

That said, I don't currently have a proof, only a gut feeling backed by no more than a specific benchmark. I'll peek at the published proofs.

meooow25 · 2023-01-07T15:56:48Z

I think this is interesting, and can be applied to union/intersection/difference/others too. I implemented it and can see some improvements in set-operations-set,

Results

  union-block_nn:              OK (0.39s)
    419  μs ±  25 μs, 32% less than baseline
  union-block_nn_swap:         OK (0.39s)
    410  μs ±  32 μs, 35% less than baseline
  union-block_ns:              OK (0.42s)
    38.5 μs ± 3.0 μs, 44% less than baseline
  union-block_sn_swap:         OK (0.47s)
    48.0 μs ± 3.5 μs, 35% less than baseline
  union-common_nn:             OK (0.77s)
    512  μs ±  34 μs,       same as baseline
  union-common_nn_swap:        OK (0.84s)
    1.23 ms ± 113 μs, 13% more than baseline
  union-common_ns:             OK (0.42s)
    404  μs ±  28 μs, 43% less than baseline
  union-common_nt:             OK (0.55s)
    34.7 μs ± 1.8 μs, 27% less than baseline
  union-common_sn_swap:        OK (0.32s)
    1.49 ms ± 146 μs,       same as baseline
  union-common_tn_swap:        OK (0.42s)
    95.5 μs ± 5.5 μs, 14% more than baseline
  union-disj_nn:               OK (0.53s)
    2.97 μs ± 174 ns, 36% less than baseline
  union-disj_nn_swap:          OK (0.54s)
    2.82 μs ± 197 ns, 43% less than baseline
  union-disj_ns:               OK (0.49s)
    2.11 μs ± 177 ns, 40% less than baseline
  union-disj_nt:               OK (0.54s)
    1.23 μs ±  89 ns, 44% less than baseline
  union-disj_sn_swap:          OK (0.51s)
    2.41 μs ± 195 ns, 38% less than baseline
  union-disj_tn_swap:          OK (0.47s)
    1.72 μs ± 166 ns, 32% less than baseline
  union-mix_nn:                OK (1.16s)
    16.7 ms ± 624 μs,  6% less than baseline
  union-mix_nn_swap:           OK (0.58s)
    16.6 ms ± 565 μs,       same as baseline
  union-mix_ns:                OK (0.48s)
    1.18 ms ±  42 μs, 31% less than baseline
  union-mix_nt:                OK (0.36s)
    66.1 μs ± 5.6 μs, 16% less than baseline
  union-mix_sn_swap:           OK (0.46s)
    2.25 ms ± 111 μs, 16% more than baseline
  union-mix_tn_swap:           OK (0.46s)
    97.9 μs ± 7.5 μs, 14% more than baseline
  difference-block_nn:         OK (0.40s)
    191  μs ±  16 μs, 56% less than baseline
  difference-block_nn_swap:    OK (0.42s)
    188  μs ±  11 μs, 57% less than baseline
  difference-block_ns:         OK (0.45s)
    18.4 μs ± 1.5 μs, 64% less than baseline
  difference-block_sn_swap:    OK (0.42s)
    17.7 μs ± 1.5 μs, 65% less than baseline
  difference-common_nn:        OK (0.53s)
    3.03 ms ± 189 μs, 14% less than baseline
  difference-common_nn_swap:   OK (0.35s)
    577  μs ±  44 μs, 17% less than baseline
  difference-common_ns:        OK (0.31s)
    1.33 ms ± 104 μs, 46% less than baseline
  difference-common_nt:        OK (0.44s)
    92.2 μs ± 7.3 μs, 29% less than baseline
  difference-common_sn_swap:   OK (0.42s)
    453  μs ±  25 μs, 55% less than baseline
  difference-common_tn_swap:   OK (0.45s)
    43.4 μs ± 3.4 μs, 49% less than baseline
  difference-disj_nn:          OK (0.56s)
    1.53 μs ±  90 ns, 56% less than baseline
  difference-disj_nn_swap:     OK (0.56s)
    1.55 μs ±  85 ns, 47% less than baseline
  difference-disj_ns:          OK (0.51s)
    1.19 μs ±  84 ns, 55% less than baseline
  difference-disj_nt:          OK (0.58s)
    772  ns ±  49 ns, 54% less than baseline
  difference-disj_sn_swap:     OK (0.51s)
    1.19 μs ±  87 ns, 50% less than baseline
  difference-disj_tn_swap:     OK (0.56s)
    736  ns ±  42 ns, 55% less than baseline
  difference-mix_nn:           OK (0.31s)
    3.11 ms ± 213 μs, 49% less than baseline
  difference-mix_nn_swap:      OK (0.58s)
    3.23 ms ± 147 μs, 47% less than baseline
  difference-mix_ns:           OK (0.37s)
    833  μs ±  55 μs, 40% less than baseline
  difference-mix_nt:           OK (0.36s)
    68.5 μs ± 6.1 μs, 28% less than baseline
  difference-mix_sn_swap:      OK (0.33s)
    562  μs ±  47 μs, 61% less than baseline
  difference-mix_tn_swap:      OK (0.45s)
    50.1 μs ± 3.1 μs, 44% less than baseline
  intersection-block_nn:       OK (0.40s)
    191  μs ±  16 μs, 66% less than baseline
  intersection-block_nn_swap:  OK (0.42s)
    189  μs ±  13 μs, 66% less than baseline
  intersection-block_ns:       OK (0.44s)
    18.4 μs ± 1.5 μs, 73% less than baseline
  intersection-block_sn_swap:  OK (0.41s)
    17.8 μs ± 1.5 μs, 74% less than baseline
  intersection-common_nn:      OK (0.27s)
    1.06 ms ±  90 μs, 32% less than baseline
  intersection-common_nn_swap: OK (0.20s)
    545  μs ±  42 μs, 33% less than baseline
  intersection-common_ns:      OK (0.26s)
    975  μs ±  87 μs, 46% less than baseline
  intersection-common_nt:      OK (0.38s)
    77.0 μs ± 6.2 μs, 40% less than baseline
  intersection-common_sn_swap: OK (0.43s)
    430  μs ±  22 μs, 67% less than baseline
  intersection-common_tn_swap: OK (0.45s)
    43.3 μs ± 3.6 μs, 61% less than baseline
  intersection-disj_nn:        OK (0.58s)
    1.54 μs ±  85 ns, 62% less than baseline
  intersection-disj_nn_swap:   OK (0.55s)
    1.55 μs ±  92 ns, 65% less than baseline
  intersection-disj_ns:        OK (0.51s)
    1.20 μs ±  85 ns, 64% less than baseline
  intersection-disj_nt:        OK (0.58s)
    780  ns ±  47 ns, 66% less than baseline
  intersection-disj_sn_swap:   OK (0.51s)
    1.19 μs ±  84 ns, 65% less than baseline
  intersection-disj_tn_swap:   OK (0.57s)
    750  ns ±  47 ns, 65% less than baseline
  intersection-mix_nn:         OK (0.54s)
    3.16 ms ± 183 μs, 60% less than baseline
  intersection-mix_nn_swap:    OK (0.31s)
    3.06 ms ± 300 μs, 62% less than baseline
  intersection-mix_ns:         OK (0.40s)
    839  μs ±  54 μs, 58% less than baseline
  intersection-mix_nt:         OK (0.51s)
    65.5 μs ± 3.1 μs, 47% less than baseline
  intersection-mix_sn_swap:    OK (0.31s)
    585  μs ±  48 μs, 65% less than baseline
  intersection-mix_tn_swap:    OK (0.48s)
    56.1 μs ± 3.0 μs, 53% less than baseline

union and intersection are only changed in terms of the unbalanced split, for difference I had to make a larger change so it is not a good direct comparison. There are also a handful of increases in union, not sure why.

Anyway, this seems useful, so I'll also try to understand the proofs and see if they still apply with this change.

treeowl · 2023-01-07T16:25:04Z

One option to consider is to switch to an unbalanced split (or the "hedge" algorithms we used to use) when the sets/maps get small enough (below some fixed size). That will avoid breaking big O while getting a lot of the performance benefits in the cases where it's good.

alexfmpe · 2023-01-09T19:40:34Z

I'm surprised it's not always worse for operations that return sets since they must return balanced sets in the end to preserve invariants no? Are you doing a single call to balance at the very end?

I haven't been in the headspace to look at this in a while, but one thing I'd been meaning to do is try and make this allocation free. It sounds plausible to me since without re-balancing the triple that's returned is immediately consumed. I had tried to do this via CPS but having functions as arguments seemed to kill performance and I'm always a bit lost when trying to reason about the Core that comes out.

treeowl · 2023-01-09T20:46:28Z

Not all the reconstructed pieces end up getting incorporated. For intersection, none of them do. (I wish that were true of difference as well, but we follow the exact algorithm proved optimal; I don't know how to adapt the proof to work with difference going the other way.)

meooow25 · 2023-01-10T16:03:41Z

Are you doing a single call to balance at the very end?

I had to only for union. For difference I changed the algorithm so we split t2 instead.
To avoid confusion, here are the modified versions I tested.

-- A possibly unbalanced set.
-- Invariant: A Bin with non-zero size is balanced.
--            To construct an unbalanced set: Unbalanced (Bin 0 x l r)
newtype Unbalanced a = Unbalanced (Set a)

fromUnbalanced :: Unbalanced a -> Set a
fromUnbalanced (Unbalanced s0) = go s0
  where
    go (Bin 0 x l r) = link x (go l) (go r)
    go s = s

splitSUnbalanced :: Ord a => a -> Unbalanced a -> StrictPair (Unbalanced a) (Unbalanced a)
splitMemberUnbalanced :: Ord a => a -> Unbalanced a -> (Unbalanced a,Bool,Unbalanced a)

union :: Ord a => Set a -> Set a -> Set a
union t10 t20 = go t10 (Unbalanced t20)
  where
    go t1 (Unbalanced Tip) = t1
    go t1 (Unbalanced (Bin _ x Tip Tip)) = insertR x t1
    go (Bin 1 x _ _) t2 = insert x (fromUnbalanced t2)
    go Tip t2 = fromUnbalanced t2
    go t1@(Bin _ x l1 r1) t2 = case splitSUnbalanced x t2 of
      (l2 :*: r2)
        | l1l2 `ptrEq` l1 && r1r2 `ptrEq` r1 -> t1
        | otherwise -> link x l1l2 r1r2
        where !l1l2 = go l1 l2
              !r1r2 = go r1 r2

difference :: Ord a => Set a -> Set a -> Set a
difference t10 t20 = go t10 (Unbalanced t20)
  where
    go Tip _ = Tip
    go t1 (Unbalanced Tip) = t1
    go t1@(Bin _ x l1 r1) t2 = case splitMemberUnbalanced x t2 of
      (l2,b,r2)
        | b -> merge l1l2 r1r2
        | l1l2 `ptrEq` l1 && r1r2 `ptrEq` r1 -> t1
        | otherwise -> link x l1l2 r1r2
        where !l1l2 = go l1 l2
              !r1r2 = go r1 r2

intersection :: Ord a => Set a -> Set a -> Set a
intersection t10 t20 = go t10 (Unbalanced t20)
  where
    go Tip _ = Tip
    go _ (Unbalanced Tip) = Tip
    go t1@(Bin _ x l1 r1) t2
      | b = if l1l2 `ptrEq` l1 && r1r2 `ptrEq` r1
            then t1
            else link x l1l2 r1r2
      | otherwise = merge l1l2 r1r2
      where
        !(l2, b, r2) = splitMemberUnbalanced x t2
        !l1l2 = go l1 l2
        !r1r2 = go r1 r2

Faster disjoint/isSubsetOf for Set via unbalanced splitting.

d6f7f95

alexfmpe commented Oct 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster disjoint/isSubsetOf for Set via unbalanced splitting. #865

Faster disjoint/isSubsetOf for Set via unbalanced splitting. #865

alexfmpe commented Oct 29, 2022

alexfmpe commented Oct 29, 2022

treeowl commented Oct 29, 2022

alexfmpe Oct 29, 2022

treeowl commented Oct 29, 2022

alexfmpe commented Oct 31, 2022

meooow25 commented Jan 7, 2023

treeowl commented Jan 7, 2023

alexfmpe commented Jan 9, 2023

treeowl commented Jan 9, 2023

meooow25 commented Jan 10, 2023

Faster disjoint/isSubsetOf for Set via unbalanced splitting. #865

Are you sure you want to change the base?

Faster disjoint/isSubsetOf for Set via unbalanced splitting. #865

Conversation

alexfmpe commented Oct 29, 2022

alexfmpe commented Oct 29, 2022

treeowl commented Oct 29, 2022

alexfmpe Oct 29, 2022

Choose a reason for hiding this comment

treeowl commented Oct 29, 2022

alexfmpe commented Oct 31, 2022

meooow25 commented Jan 7, 2023

treeowl commented Jan 7, 2023

alexfmpe commented Jan 9, 2023

treeowl commented Jan 9, 2023

meooow25 commented Jan 10, 2023