-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squeezing out performance. #1
Comments
Take a tradeoff with longer host JIT compilation time but resulting in removing all branching for floating point operations based on Because Considering the overhead of dispatching a call, even a tail call, could be exacerbated by the fact that most JITs probably won't do what we want in the aspect of calling convention. There is a lot of registers for the RandomX VM, and shuffling them among function calls would probably result in loads and stores (with subpar RA, they're probably doing that already with the current implementation). |
Fused multiply-add instructions can't be used because they don't do rounding between MUL and ADD, so the end result will be different. |
@SChernykh I use fused multiply-add instructions in the implementation of the directed rounding floating point operations. For example emulating the multiplicative instructions an FMA can be used to calculate the error term of an operation efficiently (compute and subtract the operation from itself in a single rounding), then branch on that to adjust the final result. These slides explain error free transformations perfectly. Problem is, the FMA in WASM isn't exactly deterministic and even with feature detections the JIT just "does what it wants". It is disabled for now, but with it a single FP multiplicative instruction will only have a couple cycles overhead instead of ~10 extra FP operations (superscalar + out of order CPUs mean this won't be as slow as it's seemed) |
Execution time is dominated by VM execution, which is what you want. A JS runtime cannot replace a function with an optimised one during its execution, so we're left with absolutely zero VM executions that are optimised (in the prof output, a star Update: Implemented and tested successfully, not good enough to commit to the library. In a sea of a thousand unoptimised Very rarely V8 is able to optimise the body, and you'll be able to see it here. I do think its correct to ignore the possibility that a compiler can optimise the JIT VM code. |
is that really true though? Chrome might haha maybe some basic environment checks could be used to decide whether to attempt using an initialized dataset or not edit: maybe relevant article? |
20 H/s on release is okay, only 5 times slower than reference. Squeezing out most likely 25 H/s or an unlikely 30 H/s would be amazing. I have not done huge amounts of testing, that comes later. But improvements that I can see right now I will list here:
Meaningful optimisations:
Possibly meaningful optimisations:
fprc
, currently it is just a simple table implementation.local.*
instructions. This will improve baseline JITs register allocation (RA), and may improve subpar JITs RA.Holy grail optimisations:
crypto.subtle
doesn't give us what we need. Possible improvements could be found in the software implementation. Probably more effective to lobby forv128.aes.*
instructions in the WASM spec, no matter how long it takes.Any improvement to WASM code that allows the host JIT to get quality code quicker is so important, as we incur a cost going from WASM to native code that RandomX never has to experience going straight to native code. Inspection of JS runtime's JITs is needed to actually understand what it is doing.
This document will track my progress with further optimisations. Feel free to add to it if needed or to voice criticisms and improvements.
The text was updated successfully, but these errors were encountered: