-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My personal experiment with building clang sharpened for building godot #11
Comments
Thank you for sharing your project! If your aim is providing as fast as possible LLVM toolchain, I can recommend you check comments from this Phoronix thread. Especially take a look on "static vs dynamic Clang builds". Since static Clang is faster, and you are able to use static version instead of dynamic - you probably can improve performance even more.
Did you get 1% improvement with BOLT after applying PGO, right? Honestly, I would expect a bit more improvements. Did you apply BOLT via sampling (aka external
You are welcome! I would be happy if you share these materials with other people. The more people know about positive PGO effects on performance - the higher chances to get PGO-optimized software here and there. |
I already use static binaries. Though applying (instrumented) BOLT for static binaries isn't possible. I use instrumentation mode and it requires dynamic linking. Looks like it inserts some logic between libc and the binary especially looking at error message (will try to remember and post it output next time). Though I personally didn't saw difference with static build. The numbers got lower for 1 min at some point from 28:30 to 27:30 maybe it is because of static build tbh I don't know
I've applied PGO+Full LTO and then instrumented BOLT. and BOLT on top of it gave me less than 1% though I've ran only one training pass with bolt (3 with pgo) but control pass was the same as BOLT training pass so I am not sure that more passes with bigger coverage will help. I am also trying right now PGO+Thin LTO + BOLT (+ more passes. I store intermediate profiles from each pass so I will also be able to compare PGO+Full LTO+BOLT and PGO+Thin LTO+BOLT on the same data) I also have an idea how to use instrumtation mode of BOLT for static binaries but haven't tried it yet. SO compile two version: 1 static and 1 dynamic. Perform instrumentation and profile gathering for dynamic and use profile on static binary. Maybe it will work. Can also tell you that in my case adding CSIR on top of IR gave me the same results as bolt less then 1% but it was fully the same number. I've also tried to compile LLVM with Polly but it didn't give me any results. Though I am not sure about correct usage of both of them as like of BOLT I haven't thoroughly read your article yet and don't know if it is there but want to highlight one thing from my experience. %Nm parameter for PGO works for me and it is possible to place them in ram through /dev/shm. Won't wear 40GB of disk just because of one ram. It doesn't work wih bolt profiles though... Placing in ram is exactly what I did so if someone goes to build with should never use %p and store files on real disk... At least it worked for me |
Got it. Additionally, I can recommend you to join the CachyOS Discord channel (the invite link). These people also experimented a lot with PGO, BOLT, Propeller and other staff so they can share their experience with you - possibly it will be helpful too.
Hmmm... I would not expect any improvements from doing more training passes in this case since they will execute the same code paths. So probably in your setup you got your "performance sweet point" just with LTO + PGO. However, even 1% improvement is a good result for software like compilers that executed frequently by build machines, developers work machines, etc. You did a good job even with this 1% - it's definitely worth it for users!
In theory, the former variant - PGO+Full LTO+BOLT - should result in a more efficient code than with ThinLTO (this comes from the ThinLTO nature - I recently wrote a brief comment about it). In practice - it depends a lot on a project. So it would be very interesting to see the actual results for LLVM project itself. In this case, I would expect pretty the same performance level between Full and Thin LTO but smaller binaries in the Full LTO case - but it's just an assumption for now.
Ohh, be careful with this approach. Such techniques can be quite sensitive to dynamic vs static compilation so a profile from the dynamic version can be mostly invalid for the static case. However, it would be interesting to see the actual evidence. I am not a huge expert in the BOLT nuances so probably asking this question in LLVM Discord in the
I can confirm your findings - from my experiments I also wasn't able to get a difference in performance between IR PGO and CSIR PGO. I tried to get any numbers from Google people but they didn't provide any good benchmark results, so... At least for now I consider IR PGO has the same efficiency as CSIR PGO in practice. Maybe more tests with other projects are required.
I played with Polly several years ago and my experience wasn't good either. However, I can confirm that Polly is pretty buggy - and these results confirm my findings (my colleagues also had the same experience). So maybe somewhere Polly actually works but I didn't meet such project in my practice.
No-no, I am pretty sure that in this case you do everything in a right way - these options just don't bring performance improvements for your case. That's it ;)
Actually, that's a clever option! I was thinking about it but didn't use it in my experiments. I am sure that this "trick" is worth to be documented in the article. Thanks for mentioning it!
Yeah. That's because AFAIK PGO and BOLT for now use different internal infras for working with profiles - that's why options are not fully compatible between PGO and BOLT profiling. |
Noticed some updates in the article and wanted to kindly remind you about it if you had forgotten about it :) And one more thing. If people lack RAM for LTO and use Linux they can use zram. It is really a holly grail when you need that RAM to produce optimized binary but doesn't have enough... I've seen compression ratio 1:4 with default settings (zstd of level 3 not sure about 3) so 4GB of RAM are placed inside 1 physical GB of RAM which. I think it should be mentioned too because little people know about it |
Thanks! I didn't forget about your suggestions ;) I am just busy with other activity at the moment so adding your advices are just in a backlog for now. |
Hello! You can be interested in some way so I am giving you a link. https://github.com/dustdfg/godot-devkit. Maybe reading your article will help me because BOLT gave me only 1% speed up... Btw, thank you for collecting all this in one place. I'd like to find it a bit earlier...
The text was updated successfully, but these errors were encountered: