Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squeezing out performance. #1

Open
7 tasks
l1mey112 opened this issue Sep 25, 2024 · 7 comments
Open
7 tasks

Squeezing out performance. #1

l1mey112 opened this issue Sep 25, 2024 · 7 comments

Comments

@l1mey112
Copy link
Owner

l1mey112 commented Sep 25, 2024

20 H/s on release is okay, only 5 times slower than reference. Squeezing out most likely 25 H/s or an unlikely 30 H/s would be amazing. I have not done huge amounts of testing, that comes later. But improvements that I can see right now I will list here:

Meaningful optimisations:

  • API: RandomX batching API, where it hashes the scratchpad and generates the program at the same time.
  • JIT: Re-enable the use of fused multiply-add instructions. Figure out why it isn't as deterministic as it seems, even after doing a runtime feature detect.

Possibly meaningful optimisations:

  • JIT: Adjust how semifloat is dispatched based on fprc, currently it is just a simple table implementation.
  • JIT: Precompute scratchpad addresses as much as possible in JIT code
  • CRYPTO: Optimise Blake2b to use vector instructions. (small improvements as Blake2b only called in the hot path once to hash the register file, with a fixed 256 bytes)
  • JIT: Improvements to scheduling of virtual machine instructions. It is possible to reorder instructions to place them close to eachother, without using any local.* instructions. This will improve baseline JITs register allocation (RA), and may improve subpar JITs RA.

Holy grail optimisations:

  • CRYPTO: Fast AES. It is unlikely improvements will be found utilising the host JS runtime, crypto.subtle doesn't give us what we need. Possible improvements could be found in the software implementation. Probably more effective to lobby for v128.aes.* instructions in the WASM spec, no matter how long it takes.

Any improvement to WASM code that allows the host JIT to get quality code quicker is so important, as we incur a cost going from WASM to native code that RandomX never has to experience going straight to native code. Inspection of JS runtime's JITs is needed to actually understand what it is doing.

This document will track my progress with further optimisations. Feel free to add to it if needed or to voice criticisms and improvements.

@l1mey112
Copy link
Owner Author

  • JIT: A possibly meaningful optimisation:

Take a tradeoff with longer host JIT compilation time but resulting in removing all branching for floating point operations based on fprc. How? Generate four versions of the virtual machine, one per rounding mode, then perform a continuation passing style (CPS) transformation to jump to an entirely different implementation of the virtual machine exclusively for a certain rounding mode.

CPS transformation

Because CFROUND instructions can appear inside branches, CPS is necessary to implement this effectively. We can perform a feature detection to check if the JS host supports return_call (tail call) instructions, and use those when JITting.

Considering the overhead of dispatching a call, even a tail call, could be exacerbated by the fact that most JITs probably won't do what we want in the aspect of calling convention. There is a lot of registers for the RandomX VM, and shuffling them among function calls would probably result in loads and stores (with subpar RA, they're probably doing that already with the current implementation).

@SChernykh
Copy link

Fused multiply-add instructions can't be used because they don't do rounding between MUL and ADD, so the end result will be different.

@l1mey112
Copy link
Owner Author

l1mey112 commented Sep 26, 2024

Fused multiply-add instructions can't be used because they don't do rounding between MUL and ADD, so the end result will be different.

@SChernykh I use fused multiply-add instructions in the implementation of the directed rounding floating point operations. For example emulating the multiplicative instructions an FMA can be used to calculate the error term of an operation efficiently (compute and subtract the operation from itself in a single rounding), then branch on that to adjust the final result. These slides explain error free transformations perfectly.

Problem is, the FMA in WASM isn't exactly deterministic and even with feature detections the JIT just "does what it wants". It is disabled for now, but with it a single FP multiplicative instruction will only have a couple cycles overhead instead of ~10 extra FP operations (superscalar + out of order CPUs mean this won't be as slow as it's seemed)

@l1mey112
Copy link
Owner Author

  • Possibly meaningful: Minimising the amount of inter-language calls. Only one inter-language call in the hot loop of virtual machine instantiations.

image

@l1mey112
Copy link
Owner Author

l1mey112 commented Sep 28, 2024

  • Possibly meaningful: Extract out hot parts of VM body and implement as another function to give the runtime a chance to optimise its body. We are spending ~10ms per virtual machine execution, we can surely spare some time for the runtime to hot swap the body.

Execution time is dominated by VM execution, which is what you want. A JS runtime cannot replace a function with an optimised one during its execution, so we're left with absolutely zero VM executions that are optimised (in the prof output, a star *function is optimised).

image


Update: Implemented and tested successfully, not good enough to commit to the library. In a sea of a thousand unoptimised randomx_body functions, only a couple were optimised, without actually increasing the hashrate.

image

Very rarely V8 is able to optimise the body, and you'll be able to see it here. I do think its correct to ignore the possibility that a compiler can optimise the JIT VM code.

Body opt

@TheScreechingBagel
Copy link

TheScreechingBagel commented Oct 3, 2024

Mining with an initialised dataset (2 GiB allocation) is not supported (though easy to implement), no one on earth would give a webpage multiple gigabytes of memory.

is that really true though? Chrome might haha

maybe some basic environment checks could be used to decide whether to attempt using an initialized dataset or not

edit: maybe relevant article?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants