-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling out our own Tape
#28
Comments
Spoiler alert ... require "benchmark"
Benchmark.ips do |x|
x.report("tape") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
1_000_000.times do |x|
t = t.to?(x).not_nil!
end
100_000.times do |x|
t = t.to?(x).not_nil!
t = t.add(x)
end
t = t.map! do |i|
i + 1 if i.even?
end
t = t.to?(t.count).not_nil!
t.count.times do |x|
t = t.drop?.not_nil!
end
end
x.report("array") do
a = [] of Int32
ac = 0
1_000_000.times do |x|
a << x
ac += 1
end
1_000_000.times do |x|
ac = x
end
100_000.times do |x|
ac = x
a.insert(ac, x)
ac += 1
end
a.map! do |i|
i.even? ? i + 1 : i
end
ac = a.size
a.size.times do |x|
ac -= 1
a.delete_at(ac)
end
end
end
|
I guess Novika's tape is going to be extremely fast after all. |
Giving these amazing results with linear insertion, it still struggles with non-linear insertion, for it uses two buffers (one before & after cursor) unless cursor is at the end. 17x slower than array right now, because it realloc's/copies the two huge buffers a lot. Using a heuristic to figure out semi-randomness of insertion & turning to contiguous buffer is going to be beneficial. Detecting non-linear cursor movement via some sort of cursor neighborhood heuristic may help. Alternatively, reducing allocation after cursor movement may help. require "benchmark"
Benchmark.ips do |x|
probe = (0..1_000_000).sample(10_000)
x.report("random insertion tape") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
probe.each_with_index do |x, i|
t = t.to?(x)
t = t.not_nil!.add(x)
end
end
x.report("random insertion ary") do
t = [] of Int32
1_000_000.times do |x|
t << x
end
probe.each_with_index do |x, i|
t.insert(x, x)
end
end
end
|
We would also like to retain these two buffers for as long as possible because then UPD: if integrated with CoW of course |
After cursor movement & tape mod we can concat two buffers + 1, add element, then by math figure out when first & second buffer starts. Realloc may not be possible. |
As nothing is free in this world, we'll have to make some compromises (depending, of course, on whether the new implementation of Tape increases the performance of Novika sufficiently enough).
Note that at certain points, the code below makes tens if not hundreds of millions of iterations (and even more reallocations). Note: memory output seems to be broken, or maybe it doesn't decrease when garbage is cleared. Note: I may have other ideas, these are not the final results. It is better though to plug this implementation of Tape into Novika some time later, to see how much it improves the situation. require "benchmark"
Benchmark.ips do |x|
x.report("tape") do # Does ~1000000000000 iterations, each making ~8 ptrs of N, N' < N
1_000_000.times do
t = Tape(Int32).new
1000.times do |x|
t = t.add(x)
end
t
.to?(4).not_nil!
.add(5).not_nil!
.to?(2).not_nil!
.add(100).not_nil!
.add(200).not_nil!
.add(300).not_nil!
.to?(2).not_nil!
.add(100).not_nil!
.add(200).not_nil!
.add(300).not_nil!
.to?(t.count - 3).not_nil!
.add(680).not_nil!
.to?(0).not_nil!
.add(100).not_nil!
.add(100).not_nil!
.drop?.not_nil!
.to?(t.count).not_nil!
.add(1000).not_nil!
t.count.times { t = t.drop?.not_nil! }
end
end
x.report("array") do
1_000_000.times do
ary = [] of Int32
cur = 0
1_000.times do |i|
ary.insert(cur, i)
cur += 1
end
cur = 4
ary.insert(cur, 5)
cur += 1
cur = 2
ary.insert(cur, 100)
cur += 1
ary.insert(cur, 200)
cur += 1
ary.insert(cur, 300)
cur += 1
cur = 2
ary.insert(cur, 100)
cur += 1
ary.insert(cur, 200)
cur += 1
ary.insert(cur, 300)
cur += 1
cur = ary.size - 3
ary.insert(cur, 680)
cur += 1
cur = 0
ary.insert(cur, 800)
cur += 1
ary.insert(cur, 100)
cur += 1
ary.delete_at(cur - 1)
cur -= 1
cur = ary.size
ary.insert(cur, 1000)
cur += 1
ary.size.times { ary.delete_at(cur - 1); cur -= 1 }
end
end
probe = (0..1_000_000).sample(100_000)
x.report("random insertion tape") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
probe.each_with_index do |x, i|
t = t.to?(x)
t = t.not_nil!.add(x)
end
end
x.report("random insertion ary") do
t = [] of Int32
1_000_000.times do |x|
t << x
end
probe.each_with_index do |x, i|
t.insert(x, x)
end
end
x.report("tape test") do
t = Tape(Int32).new
1_000_000.times do |x|
t = t.add(x)
end
1_000_000.times do |x|
t = t.to?(x).not_nil!
end
100_000.times do |x|
t = t.to?(x).not_nil!
t = t.add(x)
end
t = t.map! do |i|
i + 1 if i.even?
end
t = t.to?(t.count).not_nil!
t.count.times do |x|
t = t.drop?.not_nil!
end
end
x.report("array test") do
a = [] of Int32
ac = 0
1_000_000.times do |x|
a << x
ac += 1
end
1_000_000.times do |x|
ac = x
end
100_000.times do |x|
ac = x
a.insert(ac, x)
ac += 1
end
a.map! do |i|
i.even? ? i + 1 : i
end
ac = a.size
a.size.times do |x|
ac -= 1
a.delete_at(ac)
end
end
end
|
Another approach would be to have a buffer and a claims array. A claim is a range. Whenever a copy of substrate is made, a new claim is added to the claims array (or the refcount of an existing claim is incremented, if there is already a claim with the same begin&end). A claim is basically the bounds of a particular substrate. Upon a mutation, claim conflict is checked (the way this is done depends on the kind of mutation, e.g., insert/push/map/etc.) If a claim conflict is detected, refcount of the current claim is decremented; if its refcount = 0, then its is removed from the claims array. If conflict was detected, a copy of the buffer is made with the current claim attached. Buffer could (and probably should) be copied according to the bounds of the claim ± power 2 or smth. rather than in its entirety. This allows to insert and drop in refs without copying in some cases -- that is, the one who is pushed first out of any number of same-bound copies wins in performance over everyone else. If you have two generations of substrates, However, drops that conflict with Note that I don't know how practical all of this would be for Novika in particular, though. Even with the most naive implementation I've tried writing, it seems to pay off only when having large and very large arrays, the current implementation degrading more and more up to points where the one I'm describing here is 20x or so faster. The benchmarks I've done are disputable, though, because I'm carefully designing them to hit the "sweet spot" rather than do dirty, massively repetitive work of everyday Novika code. It should also be noted that the average size of a Novika substrate is ~3.56 elements (running the tests); 2793793 substrates are made. Small and heavily copied substrates rule in the Novika universe. Doing e.g. Also, Idk how easy it'd be to realloc etc. Also, Idk how fast this'd be on real Novika code. Also, I think there cannot exist a buffer with no claims. This is simply impossible or is an invalid state. Most often ,buffers have exactly one claim. We could introduce a staticarray of claims with some claim threshold after which new claims make a copy and attach to the copy, but this needs a live implementation where stuff like this could be tested & magic numbers found, not simplistic speculation as I do here, now. What I see most clearly is that we don't need to create a claims array initially -- only when more than one claim is introduced. Otherwise, we'll simply store the "master claim" or something like that and wait out until new claims arrive. A lower-level optimization would be to split buffer into chunks and copy them only, but this'd require some mind-boggling pointer jugglery and linked-listery which does not necessarily result in good performance. Another idea is to store changes (mutations) themselves rather than make a full copy & apply them on it, e.g. an "invalid range" (from the viewpoint of a particular buffer) and a "replacement sub-buffer", used when operation index falls in the "invalid range" (this is kind of a mirror-side version of the linked-listery I was just talking about). This paragraph is about some very complex stuff though, and maybe I'm trying to be smart here more than practical... What needs to be understood is that reducing the amount of block copies is one of the most important pillars in optimizing Novika. The less space is used, the faster are mallocs and reallocs. The less the blobs of stuff are, the faster the GC can look at them and free then. The less copying is done, the less time is spent on copying! |
An illustration to the 2nd from the end paragraph.
Again, I have no idea whether this will work and how much of an effect it would result in. Maybe it won't work. I. Don't. Know. |
Maybe rolling out our own, custom Tape implementation (i.e. not using an Array underneath) would be beneficial in terms of performance, because we use an extremely small subset of Array features where e.g. index is always guaranteed to be in bounds (at least for tape), plus there are often a lot of 1-element insertions which means we'd want to have a heuristic to allocate more initially (I think that would make sense). Memory is cheap these days, electricity isn't :)
This should also reduce the number of objects we create. Currently there are (at worst, if no copy is made thanks to CoW we have only Block) 3 objects per block (Block itself -> RealSubstrate -> Array), having Block -> (copy-on-write Tape < Value) -> Pointer (somehow!) may be a bit better.
These are just my speculations & a bit of a reminder to myself, so don't think I'm too serious with it.
The text was updated successfully, but these errors were encountered: