Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling out our own Tape #28

Open
homonoidian opened this issue Aug 29, 2022 · 8 comments
Open

Rolling out our own Tape #28

homonoidian opened this issue Aug 29, 2022 · 8 comments
Labels
discussion A question by the author(s) of the language future Not urgent optimization Improvements to the speed

Comments

@homonoidian
Copy link
Collaborator

homonoidian commented Aug 29, 2022

Maybe rolling out our own, custom Tape implementation (i.e. not using an Array underneath) would be beneficial in terms of performance, because we use an extremely small subset of Array features where e.g. index is always guaranteed to be in bounds (at least for tape), plus there are often a lot of 1-element insertions which means we'd want to have a heuristic to allocate more initially (I think that would make sense). Memory is cheap these days, electricity isn't :)

This should also reduce the number of objects we create. Currently there are (at worst, if no copy is made thanks to CoW we have only Block) 3 objects per block (Block itself -> RealSubstrate -> Array), having Block -> (copy-on-write Tape < Value) -> Pointer (somehow!) may be a bit better.

These are just my speculations & a bit of a reminder to myself, so don't think I'm too serious with it.

@homonoidian homonoidian added optimization Improvements to the speed future Not urgent labels Aug 29, 2022
@homonoidian homonoidian changed the title Rolling our own Tape Rolling out our own Tape Aug 29, 2022
@homonoidian homonoidian added the discussion A question by the author(s) of the language label Aug 29, 2022
@homonoidian
Copy link
Collaborator Author

Spoiler alert ...

require "benchmark"

Benchmark.ips do |x|
  x.report("tape") do
    t = Tape(Int32).new

    1_000_000.times do |x|
      t = t.add(x)
    end

    1_000_000.times do |x|
      t = t.to?(x).not_nil!
    end

    100_000.times do |x|
      t = t.to?(x).not_nil!
      t = t.add(x)
    end

    t = t.map! do |i|
      i + 1 if i.even?
    end

    t = t.to?(t.count).not_nil!

    t.count.times do |x|
      t = t.drop?.not_nil!
    end
  end

  x.report("array") do
    a = [] of Int32
    ac = 0

    1_000_000.times do |x|
      a << x
      ac += 1
    end

    1_000_000.times do |x|
      ac = x
    end

    100_000.times do |x|
      ac = x
      a.insert(ac, x)
      ac += 1
    end

    a.map! do |i|
      i.even? ? i + 1 : i
    end

    ac = a.size

    a.size.times do |x|
      ac -= 1
      a.delete_at(ac)
    end
  end
end
 tape  24.45  ( 40.91ms) (± 1.00%)  17.4MB/op         fastest
array  47.82m ( 20.91s ) (± 0.00%)  27.6MB/op  511.25× slower

@homonoidian
Copy link
Collaborator Author

I guess Novika's tape is going to be extremely fast after all.

@homonoidian
Copy link
Collaborator Author

Giving these amazing results with linear insertion, it still struggles with non-linear insertion, for it uses two buffers (one before & after cursor) unless cursor is at the end. 17x slower than array right now, because it realloc's/copies the two huge buffers a lot. Using a heuristic to figure out semi-randomness of insertion & turning to contiguous buffer is going to be beneficial. Detecting non-linear cursor movement via some sort of cursor neighborhood heuristic may help. Alternatively, reducing allocation after cursor movement may help.

require "benchmark"

Benchmark.ips do |x|
  probe = (0..1_000_000).sample(10_000)

  x.report("random insertion tape") do
    t = Tape(Int32).new
    1_000_000.times do |x|
      t = t.add(x)
    end

    probe.each_with_index do |x, i|
      t = t.to?(x)
      t = t.not_nil!.add(x)
    end
  end

  x.report("random insertion ary") do
    t = [] of Int32
    1_000_000.times do |x|
      t << x
    end

    probe.each_with_index do |x, i|
      t.insert(x, x)
    end
  end
end
random insertion tape  80.19m ( 12.47s ) (± 0.00%)  56.0GB/op  17.57× slower
 random insertion ary   1.41  (709.63ms) (± 2.31%)  22.0MB/op        fastest

@homonoidian
Copy link
Collaborator Author

homonoidian commented Aug 29, 2022

We would also like to retain these two buffers for as long as possible because then Tape#slice has no cost.

UPD: if integrated with CoW of course

@homonoidian
Copy link
Collaborator Author

homonoidian commented Aug 29, 2022

After cursor movement & tape mod we can concat two buffers + 1, add element, then by math figure out when first & second buffer starts. Realloc may not be possible.

@homonoidian
Copy link
Collaborator Author

As nothing is free in this world, we'll have to make some compromises (depending, of course, on whether the new implementation of Tape increases the performance of Novika sufficiently enough).

  • Millions of random cursor movements followed by mutation (e.g. move 100 forward, write 100, move 1234 backward, write 123) are expensive in terms of computation and memory. They create N + (N' < N) + (N'' < N) new bytes, where N is the amount of elements in the tape.
  • Mass-writing (meaning millions of writes) to ahead is expensive in terms of computation and memory unless you are (a) writing to the end, or (b) have set a single write-point and begun writing, i.e., without changing the cursor during the process of writing.
  • The former may be fixed by making eject and inject native words and implementing a fast version of them, as both are the most often used words mutating ahead, and ahead is the most random-point mutated entity in Novika.

Note that at certain points, the code below makes tens if not hundreds of millions of iterations (and even more reallocations). Note: memory output seems to be broken, or maybe it doesn't decrease when garbage is cleared. Note: I may have other ideas, these are not the final results. It is better though to plug this implementation of Tape into Novika some time later, to see how much it improves the situation.

require "benchmark"

Benchmark.ips do |x|
  x.report("tape") do # Does ~1000000000000 iterations, each making ~8 ptrs of N, N' < N
    1_000_000.times do
      t = Tape(Int32).new
      1000.times do |x|
        t = t.add(x)
      end
      t
        .to?(4).not_nil!
        .add(5).not_nil!
        .to?(2).not_nil!
        .add(100).not_nil!
        .add(200).not_nil!
        .add(300).not_nil!
        .to?(2).not_nil!
        .add(100).not_nil!
        .add(200).not_nil!
        .add(300).not_nil!
        .to?(t.count - 3).not_nil!
        .add(680).not_nil!
        .to?(0).not_nil!
        .add(100).not_nil!
        .add(100).not_nil!
        .drop?.not_nil!
        .to?(t.count).not_nil!
        .add(1000).not_nil!
      t.count.times { t = t.drop?.not_nil! }
    end
  end
  x.report("array") do
    1_000_000.times do
      ary = [] of Int32
      cur = 0
      1_000.times do |i|
        ary.insert(cur, i)
        cur += 1
      end

      cur = 4
      ary.insert(cur, 5)
      cur += 1

      cur = 2

      ary.insert(cur, 100)
      cur += 1

      ary.insert(cur, 200)
      cur += 1

      ary.insert(cur, 300)
      cur += 1

      cur = 2

      ary.insert(cur, 100)
      cur += 1

      ary.insert(cur, 200)
      cur += 1

      ary.insert(cur, 300)
      cur += 1

      cur = ary.size - 3
      ary.insert(cur, 680)
      cur += 1

      cur = 0
      ary.insert(cur, 800)
      cur += 1

      ary.insert(cur, 100)
      cur += 1

      ary.delete_at(cur - 1)
      cur -= 1

      cur = ary.size
      ary.insert(cur, 1000)
      cur += 1

      ary.size.times { ary.delete_at(cur - 1); cur -= 1 }
    end
  end

  probe = (0..1_000_000).sample(100_000)
  x.report("random insertion tape") do
    t = Tape(Int32).new
    1_000_000.times do |x|
      t = t.add(x)
    end

    probe.each_with_index do |x, i|
      t = t.to?(x)
      t = t.not_nil!.add(x)
    end
  end

  x.report("random insertion ary") do
    t = [] of Int32
    1_000_000.times do |x|
      t << x
    end

    probe.each_with_index do |x, i|
      t.insert(x, x)
    end
  end

  x.report("tape test") do
    t = Tape(Int32).new

    1_000_000.times do |x|
      t = t.add(x)
    end

    1_000_000.times do |x|
      t = t.to?(x).not_nil!
    end

    100_000.times do |x|
      t = t.to?(x).not_nil!
      t = t.add(x)
    end

    t = t.map! do |i|
      i + 1 if i.even?
    end

    t = t.to?(t.count).not_nil!

    t.count.times do |x|
      t = t.drop?.not_nil!
    end
  end

  x.report("array test") do
    a = [] of Int32
    ac = 0

    1_000_000.times do |x|
      a << x
      ac += 1
    end

    1_000_000.times do |x|
      ac = x
    end

    100_000.times do |x|
      ac = x
      a.insert(ac, x)
      ac += 1
    end

    a.map! do |i|
      i.even? ? i + 1 : i
    end

    ac = a.size

    a.size.times do |x|
      ac -= 1
      a.delete_at(ac)
    end
  end
end
                 tape  65.99m ( 15.15s ) (± 0.00%)  38.9GB/op           fastest
                array  59.73m ( 16.74s ) (± 0.00%)   9.9GB/op     1.10×  slower
random insertion tape   5.53m (180.92s ) (± 0.00%)   778GB/op  7775.89×  slower
 random insertion ary 125.91m (  7.94s ) (± 0.00%)  27.6MB/op   341.35×  slower
            tape test  42.98  ( 23.27ms) (± 0.68%)  25.0MB/op           fastest
           array test  44.63m ( 22.41s ) (± 0.00%)  27.6MB/op   963.12×  slower

@homonoidian
Copy link
Collaborator Author

homonoidian commented Dec 30, 2022

Another approach would be to have a buffer and a claims array.

A claim is a range. Whenever a copy of substrate is made, a new claim is added to the claims array (or the refcount of an existing claim is incremented, if there is already a claim with the same begin&end).

A claim is basically the bounds of a particular substrate. Upon a mutation, claim conflict is checked (the way this is done depends on the kind of mutation, e.g., insert/push/map/etc.)

If a claim conflict is detected, refcount of the current claim is decremented; if its refcount = 0, then its is removed from the claims array. If conflict was detected, a copy of the buffer is made with the current claim attached. Buffer could (and probably should) be copied according to the bounds of the claim ± power 2 or smth. rather than in its entirety.

This allows to insert and drop in refs without copying in some cases -- that is, the one who is pushed first out of any number of same-bound copies wins in performance over everyone else. If you have two generations of substrates, A and its copy a, and they point to the same buffer, then N pushes to a is free (copy-less), and the N drops are free as well for a.

However, drops that conflict with A's claim (e.g. a drop when A = a) or inserts that conflict with A's claim (e.g. insert in the middle of a) make a personal copy of the buffer for a.

Note that A and a could be swapped in the explanation above.

I don't know how practical all of this would be for Novika in particular, though. Even with the most naive implementation I've tried writing, it seems to pay off only when having large and very large arrays, the current implementation degrading more and more up to points where the one I'm describing here is 20x or so faster. The benchmarks I've done are disputable, though, because I'm carefully designing them to hit the "sweet spot" rather than do dirty, massively repetitive work of everyday Novika code.

It should also be noted that the average size of a Novika substrate is ~3.56 elements (running the tests); 2793793 substrates are made. Small and heavily copied substrates rule in the Novika universe. Doing e.g. 100 enclose results in [] 100 << which does make a copy, but not for the most obvious reason. The fact is that it is created with an empty tape, a copy is made due to block open of enclose (which copies the entire definition of enclose including []), leaving the refsubstrate for [], and then a mutation is introduced, henceforth copying the empty substrate and inserting 100 into the copy. The approach I've described here seems to prevent a copy in this case, because it is completely unnecessary as no conflicts with other claims exist. ALL OF THIS IS DISPUTABLE THOUGH AND I'D NEED TO GIVE THIS A BIG LONG THOUGHT.

Also, Idk how easy it'd be to realloc etc.

Also, Idk how fast this'd be on real Novika code.

Also, I think there cannot exist a buffer with no claims. This is simply impossible or is an invalid state. Most often ,buffers have exactly one claim. We could introduce a staticarray of claims with some claim threshold after which new claims make a copy and attach to the copy, but this needs a live implementation where stuff like this could be tested & magic numbers found, not simplistic speculation as I do here, now. What I see most clearly is that we don't need to create a claims array initially -- only when more than one claim is introduced. Otherwise, we'll simply store the "master claim" or something like that and wait out until new claims arrive.

A lower-level optimization would be to split buffer into chunks and copy them only, but this'd require some mind-boggling pointer jugglery and linked-listery which does not necessarily result in good performance. Another idea is to store changes (mutations) themselves rather than make a full copy & apply them on it, e.g. an "invalid range" (from the viewpoint of a particular buffer) and a "replacement sub-buffer", used when operation index falls in the "invalid range" (this is kind of a mirror-side version of the linked-listery I was just talking about). This paragraph is about some very complex stuff though, and maybe I'm trying to be smart here more than practical...

What needs to be understood is that reducing the amount of block copies is one of the most important pillars in optimizing Novika. The less space is used, the faster are mallocs and reallocs. The less the blobs of stuff are, the faster the GC can look at them and free then. The less copying is done, the less time is spent on copying!

@homonoidian
Copy link
Collaborator Author

homonoidian commented Dec 30, 2022

An illustration to the 2nd from the end paragraph.

  • a and b are claims of substrates A and B (where B is an extended (pushed-into) copy of A)
  • xa and xb are invalidations of buffer by a and b respectively

Again, I have no idea whether this will work and how much of an effect it would result in. Maybe it won't work. I. Don't. Know.

Lady Gaga

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion A question by the author(s) of the language future Not urgent optimization Improvements to the speed
Projects
None yet
Development

No branches or pull requests

1 participant