Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My personal experiment with building clang sharpened for building godot #11

Open
dustdfg opened this issue Jan 12, 2025 · 5 comments
Open

Comments

@dustdfg
Copy link

dustdfg commented Jan 12, 2025

Hello! You can be interested in some way so I am giving you a link. https://github.com/dustdfg/godot-devkit. Maybe reading your article will help me because BOLT gave me only 1% speed up... Btw, thank you for collecting all this in one place. I'd like to find it a bit earlier...

@dustdfg dustdfg changed the title My personal experiment with buildingclang sharpened for building godot My personal experiment with building clang sharpened for building godot Jan 12, 2025
@zamazan4ik
Copy link
Owner

Hello! You can be interested in some way so I am giving you a link. https://github.com/dustdfg/godot-devkit

Thank you for sharing your project! If your aim is providing as fast as possible LLVM toolchain, I can recommend you check comments from this Phoronix thread. Especially take a look on "static vs dynamic Clang builds". Since static Clang is faster, and you are able to use static version instead of dynamic - you probably can improve performance even more.

Maybe reading your article will help me because BOLT gave me only 1% speed up...

Did you get 1% improvement with BOLT after applying PGO, right? Honestly, I would expect a bit more improvements. Did you apply BOLT via sampling (aka external perf BOLT profile) or via instrumentation?

Btw, thank you for collecting all this in one place. I'd like to find it a bit earlier...

You are welcome! I would be happy if you share these materials with other people. The more people know about positive PGO effects on performance - the higher chances to get PGO-optimized software here and there.

@dustdfg
Copy link
Author

dustdfg commented Jan 13, 2025

Thank you for sharing your project! If your aim is providing as fast as possible LLVM toolchain, I can recommend you check comments from this Phoronix thread. Especially take a look on "static vs dynamic Clang builds". Since static Clang is faster, and you are able to use static version instead of dynamic - you probably can improve performance even more.

I already use static binaries. Though applying (instrumented) BOLT for static binaries isn't possible. I use instrumentation mode and it requires dynamic linking. Looks like it inserts some logic between libc and the binary especially looking at error message (will try to remember and post it output next time). Though I personally didn't saw difference with static build. The numbers got lower for 1 min at some point from 28:30 to 27:30 maybe it is because of static build tbh I don't know

Did you get 1% improvement with BOLT after applying PGO, right? Honestly, I would expect a bit more improvements. Did you apply BOLT via sampling (aka external perf BOLT profile) or via instrumentation?

I've applied PGO+Full LTO and then instrumented BOLT. and BOLT on top of it gave me less than 1% though I've ran only one training pass with bolt (3 with pgo) but control pass was the same as BOLT training pass so I am not sure that more passes with bigger coverage will help.

I am also trying right now PGO+Thin LTO + BOLT (+ more passes. I store intermediate profiles from each pass so I will also be able to compare PGO+Full LTO+BOLT and PGO+Thin LTO+BOLT on the same data)

I also have an idea how to use instrumtation mode of BOLT for static binaries but haven't tried it yet. SO compile two version: 1 static and 1 dynamic. Perform instrumentation and profile gathering for dynamic and use profile on static binary. Maybe it will work.

Can also tell you that in my case adding CSIR on top of IR gave me the same results as bolt less then 1% but it was fully the same number. I've also tried to compile LLVM with Polly but it didn't give me any results. Though I am not sure about correct usage of both of them as like of BOLT

I haven't thoroughly read your article yet and don't know if it is there but want to highlight one thing from my experience. %Nm parameter for PGO works for me and it is possible to place them in ram through /dev/shm. Won't wear 40GB of disk just because of one ram. It doesn't work wih bolt profiles though... Placing in ram is exactly what I did so if someone goes to build with should never use %p and store files on real disk... At least it worked for me

@zamazan4ik
Copy link
Owner

I already use static binaries. Though applying (instrumented) BOLT for static binaries isn't possible. I use instrumentation mode and it requires dynamic linking. Looks like it inserts some logic between libc and the binary especially looking at error message (will try to remember and post it output next time). Though I personally didn't saw difference with static build. The numbers got lower for 1 min at some point from 28:30 to 27:30 maybe it is because of static build tbh I don't know

Got it. Additionally, I can recommend you to join the CachyOS Discord channel (the invite link). These people also experimented a lot with PGO, BOLT, Propeller and other staff so they can share their experience with you - possibly it will be helpful too.

I've applied PGO+Full LTO and then instrumented BOLT. and BOLT on top of it gave me less than 1% though I've ran only one training pass with bolt (3 with pgo) but control pass was the same as BOLT training pass so I am not sure that more passes with bigger coverage will help.

Hmmm... I would not expect any improvements from doing more training passes in this case since they will execute the same code paths. So probably in your setup you got your "performance sweet point" just with LTO + PGO. However, even 1% improvement is a good result for software like compilers that executed frequently by build machines, developers work machines, etc. You did a good job even with this 1% - it's definitely worth it for users!

I am also trying right now PGO+Thin LTO + BOLT (+ more passes. I store intermediate profiles from each pass so I will also be able to compare PGO+Full LTO+BOLT and PGO+Thin LTO+BOLT on the same data)

In theory, the former variant - PGO+Full LTO+BOLT - should result in a more efficient code than with ThinLTO (this comes from the ThinLTO nature - I recently wrote a brief comment about it). In practice - it depends a lot on a project. So it would be very interesting to see the actual results for LLVM project itself. In this case, I would expect pretty the same performance level between Full and Thin LTO but smaller binaries in the Full LTO case - but it's just an assumption for now.

I also have an idea how to use instrumtation mode of BOLT for static binaries but haven't tried it yet. SO compile two version: 1 static and 1 dynamic. Perform instrumentation and profile gathering for dynamic and use profile on static binary. Maybe it will work.

Ohh, be careful with this approach. Such techniques can be quite sensitive to dynamic vs static compilation so a profile from the dynamic version can be mostly invalid for the static case. However, it would be interesting to see the actual evidence. I am not a huge expert in the BOLT nuances so probably asking this question in LLVM Discord in the #bolt channel would be a good time-saving option.

Can also tell you that in my case adding CSIR on top of IR gave me the same results as bolt less then 1% but it was fully the same number.

I can confirm your findings - from my experiments I also wasn't able to get a difference in performance between IR PGO and CSIR PGO. I tried to get any numbers from Google people but they didn't provide any good benchmark results, so... At least for now I consider IR PGO has the same efficiency as CSIR PGO in practice. Maybe more tests with other projects are required.

I've also tried to compile LLVM with Polly but it didn't give me any results.

I played with Polly several years ago and my experience wasn't good either. However, I can confirm that Polly is pretty buggy - and these results confirm my findings (my colleagues also had the same experience). So maybe somewhere Polly actually works but I didn't meet such project in my practice.

Though I am not sure about correct usage of both of them as like of BOLT

No-no, I am pretty sure that in this case you do everything in a right way - these options just don't bring performance improvements for your case. That's it ;)

I haven't thoroughly read your article yet and don't know if it is there but want to highlight one thing from my experience. %Nm parameter for PGO works for me and it is possible to place them in ram through /dev/shm.

Actually, that's a clever option! I was thinking about it but didn't use it in my experiments. I am sure that this "trick" is worth to be documented in the article. Thanks for mentioning it!

It doesn't work wih bolt profiles though...

Yeah. That's because AFAIK PGO and BOLT for now use different internal infras for working with profiles - that's why options are not fully compatible between PGO and BOLT profiling.

@dustdfg
Copy link
Author

dustdfg commented Jan 25, 2025

Actually, that's a clever option! I was thinking about it but didn't use it in my experiments. I am sure that this "trick" is worth to be documented in the article. Thanks for mentioning it!

Noticed some updates in the article and wanted to kindly remind you about it if you had forgotten about it :)

And one more thing. If people lack RAM for LTO and use Linux they can use zram. It is really a holly grail when you need that RAM to produce optimized binary but doesn't have enough... I've seen compression ratio 1:4 with default settings (zstd of level 3 not sure about 3) so 4GB of RAM are placed inside 1 physical GB of RAM which. I think it should be mentioned too because little people know about it

@zamazan4ik
Copy link
Owner

Thanks! I didn't forget about your suggestions ;) I am just busy with other activity at the moment so adding your advices are just in a backlog for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants