-
-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized SSE2 alpha blit algorithm #2766
base: main
Are you sure you want to change the base?
Optimized SSE2 alpha blit algorithm #2766
Conversation
Hey, this is a great avenue for a PR, the SSE alpha blitters are some of the oldest SIMD code we have, and we've learned new strategies since they were written. In the old algorithm, since it was going one pixel at a time in the odd path, it had a fast path where it would just copy over the pixel data if certain conditions were met. How does the performance now stack up against that fast path? Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism? |
Anyway about this:
The changes are mainly focusing on improving parallelism as I'm following the sdl formula exactly. I tried to save a few unpack here and there but it's really not much different from before. |
Test program: from sys import stdout
from pstats import Stats
from cProfile import Profile
import pygame
pygame.init()
def speed_test_blits():
# chosen because they contain edge cases.
nums = [0, 1, 65, 126, 127, 199, 254, 255]
results = {}
for iterations in range(0, 10000):
for dst_r, dst_b, dst_a in zip(nums, reversed(nums), reversed(nums)):
for src_r, src_b, src_a in zip(nums, reversed(nums), nums):
src_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
src_surf.fill((src_r, 255, src_b, src_a))
dest_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
dest_surf.fill((dst_r, 255, dst_b, dst_a))
dest_surf.blit(src_surf, (0, 0))
key = ((dst_r, dst_b, dst_a), (src_r, src_b, src_a))
results[key] = dest_surf.get_at((65, 33))
if __name__ == "__main__":
profiler = Profile()
profiler.runcall(speed_test_blits)
stats = Stats(profiler, stream=stdout)
stats.strip_dirs()
stats.sort_stats("cumulative")
stats.print_stats() Results on my Raspberry Pi:
Looks like a solid improvement here when the SSE2 code is converted to NEON code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, LGTM 👍
I see a solid performance improvement heading down this code path and it passes all the unit tests. The SIMD code is a lot simpler now and more in tune with what we've been doing in recent blitters rather than these older versions. Much has been learned since!
Good job! 👍 🎈 🍾
Remake of #2680. Sort of a continuation of #2601. This should solve some issues with CI and clear the mess that was the previous PR.
The old strategy of processing 2 pixels at once for even widths or 1 pixel at a time for odd widths has been replaced. Now, it processes 4 pixels at once for (width // 4) iterations, and 1, 2, or 3 pixels as needed.
This change speeds up the algorithm by approximately 55% for even blits and 170% for odd blits.