Squeezing out performance. #1

l1mey112 · 2024-09-25T23:37:46Z

20 H/s on release is okay, only 5 times slower than reference. Squeezing out most likely 25 H/s or an unlikely 30 H/s would be amazing. I have not done huge amounts of testing, that comes later. But improvements that I can see right now I will list here:

Meaningful optimisations:

API: RandomX batching API, where it hashes the scratchpad and generates the program at the same time.
JIT: Re-enable the use of fused multiply-add instructions. Figure out why it isn't as deterministic as it seems, even after doing a runtime feature detect.

Possibly meaningful optimisations:

JIT: Adjust how semifloat is dispatched based on fprc, currently it is just a simple table implementation.
JIT: Precompute scratchpad addresses as much as possible in JIT code
CRYPTO: Optimise Blake2b to use vector instructions. (small improvements as Blake2b only called in the hot path once to hash the register file, with a fixed 256 bytes)
JIT: Improvements to scheduling of virtual machine instructions. It is possible to reorder instructions to place them close to eachother, without using any local.* instructions. This will improve baseline JITs register allocation (RA), and may improve subpar JITs RA.

Holy grail optimisations:

CRYPTO: Fast AES. It is unlikely improvements will be found utilising the host JS runtime, crypto.subtle doesn't give us what we need. Possible improvements could be found in the software implementation. Probably more effective to lobby for v128.aes.* instructions in the WASM spec, no matter how long it takes.

Any improvement to WASM code that allows the host JIT to get quality code quicker is so important, as we incur a cost going from WASM to native code that RandomX never has to experience going straight to native code. Inspection of JS runtime's JITs is needed to actually understand what it is doing.

This document will track my progress with further optimisations. Feel free to add to it if needed or to voice criticisms and improvements.

The text was updated successfully, but these errors were encountered:

l1mey112 · 2024-09-25T23:57:07Z

JIT: A possibly meaningful optimisation:

Take a tradeoff with longer host JIT compilation time but resulting in removing all branching for floating point operations based on fprc. How? Generate four versions of the virtual machine, one per rounding mode, then perform a continuation passing style (CPS) transformation to jump to an entirely different implementation of the virtual machine exclusively for a certain rounding mode.

Because CFROUND instructions can appear inside branches, CPS is necessary to implement this effectively. We can perform a feature detection to check if the JS host supports return_call (tail call) instructions, and use those when JITting.

Considering the overhead of dispatching a call, even a tail call, could be exacerbated by the fact that most JITs probably won't do what we want in the aspect of calling convention. There is a lot of registers for the RandomX VM, and shuffling them among function calls would probably result in loads and stores (with subpar RA, they're probably doing that already with the current implementation).

SChernykh · 2024-09-26T09:08:29Z

Fused multiply-add instructions can't be used because they don't do rounding between MUL and ADD, so the end result will be different.

l1mey112 · 2024-09-26T09:57:19Z

Fused multiply-add instructions can't be used because they don't do rounding between MUL and ADD, so the end result will be different.

@SChernykh I use fused multiply-add instructions in the implementation of the directed rounding floating point operations. For example emulating the multiplicative instructions an FMA can be used to calculate the error term of an operation efficiently (compute and subtract the operation from itself in a single rounding), then branch on that to adjust the final result. These slides explain error free transformations perfectly.

Problem is, the FMA in WASM isn't exactly deterministic and even with feature detections the JIT just "does what it wants". It is disabled for now, but with it a single FP multiplicative instruction will only have a couple cycles overhead instead of ~10 extra FP operations (superscalar + out of order CPUs mean this won't be as slow as it's seemed)

l1mey112 · 2024-09-28T06:06:35Z

Possibly meaningful: Minimising the amount of inter-language calls. Only one inter-language call in the hot loop of virtual machine instantiations.

l1mey112 · 2024-09-28T23:48:41Z

Possibly meaningful: Extract out hot parts of VM body and implement as another function to give the runtime a chance to optimise its body. We are spending ~10ms per virtual machine execution, we can surely spare some time for the runtime to hot swap the body.

Execution time is dominated by VM execution, which is what you want. A JS runtime cannot replace a function with an optimised one during its execution, so we're left with absolutely zero VM executions that are optimised (in the prof output, a star *function is optimised).

Update: Implemented and tested successfully, not good enough to commit to the library. In a sea of a thousand unoptimised randomx_body functions, only a couple were optimised, without actually increasing the hashrate.

Very rarely V8 is able to optimise the body, and you'll be able to see it here. I do think its correct to ignore the possibility that a compiler can optimise the JIT VM code.

TheScreechingBagel · 2024-10-03T07:12:00Z

Mining with an initialised dataset (2 GiB allocation) is not supported (though easy to implement), no one on earth would give a webpage multiple gigabytes of memory.

is that really true though? Chrome might haha

maybe some basic environment checks could be used to decide whether to attempt using an initialized dataset or not

edit: maybe relevant article?

TheScreechingBagel · 2024-11-11T15:09:10Z

https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md

l1mey112 mentioned this issue Sep 26, 2024

Competing implementation of RandomX in JavaScript. tevador/RandomX#306

Open

l1mey112 pinned this issue Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squeezing out performance. #1

Squeezing out performance. #1

l1mey112 commented Sep 25, 2024 •

edited

Loading

l1mey112 commented Sep 25, 2024

SChernykh commented Sep 26, 2024

l1mey112 commented Sep 26, 2024 •

edited

Loading

l1mey112 commented Sep 28, 2024

l1mey112 commented Sep 28, 2024 •

edited

Loading

TheScreechingBagel commented Oct 3, 2024 •

edited

Loading

TheScreechingBagel commented Nov 11, 2024

Squeezing out performance. #1

Squeezing out performance. #1

Comments

l1mey112 commented Sep 25, 2024 • edited Loading

l1mey112 commented Sep 25, 2024

SChernykh commented Sep 26, 2024

l1mey112 commented Sep 26, 2024 • edited Loading

l1mey112 commented Sep 28, 2024

l1mey112 commented Sep 28, 2024 • edited Loading

TheScreechingBagel commented Oct 3, 2024 • edited Loading

TheScreechingBagel commented Nov 11, 2024

l1mey112 commented Sep 25, 2024 •

edited

Loading

l1mey112 commented Sep 26, 2024 •

edited

Loading

l1mey112 commented Sep 28, 2024 •

edited

Loading

TheScreechingBagel commented Oct 3, 2024 •

edited

Loading