Optimized SSE2 alpha blit algorithm #2766

itzpr3d4t0r · 2024-03-24T16:18:33Z

Remake of #2680. Sort of a continuation of #2601. This should solve some issues with CI and clear the mess that was the previous PR.

The old strategy of processing 2 pixels at once for even widths or 1 pixel at a time for odd widths has been replaced. Now, it processes 4 pixels at once for (width // 4) iterations, and 1, 2, or 3 pixels as needed.

This change speeds up the algorithm by approximately 55% for even blits and 170% for odd blits.

Starbuck5 · 2024-03-28T07:25:25Z

Hey, this is a great avenue for a PR, the SSE alpha blitters are some of the oldest SIMD code we have, and we've learned new strategies since they were written.

In the old algorithm, since it was going one pixel at a time in the odd path, it had a fast path where it would just copy over the pixel data if certain conditions were met. How does the performance now stack up against that fast path?

Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism?

itzpr3d4t0r · 2024-03-28T10:06:10Z

I can give a partial answer because I'm in a hurry but thanks for the comment, i didn't test that. Using the same test i showed in the previous PR with a surface set to 0 alpha or 255 alpha i get the folowing result:

Basically the new strategy is faster (60%) when the surface has odd width and a bit slower (25-30%) in the even case. This may suggest further changes are needed but since these are edge cases (i mean when does someone blit something to a 0 alpha destination or a full 255 alpha surface onto another using alpha?) I don't think this is too bad.

itzpr3d4t0r · 2024-03-31T12:37:09Z

Random test fails aside (not caused by this PR) I've managed to improve performance by 15-20% by removing some unnecessary instructions and now the situation is much better compared to the old strategy in the dst_alpha=0 case as well as src=255:

itzpr3d4t0r · 2024-05-15T16:57:52Z

Anyway about this:

Also, tell me more about the changes. Did you do any algorithmic changes or is it just changes to increase the parallelism?

The changes are mainly focusing on improving parallelism as I'm following the sdl formula exactly. I tried to save a few unpack here and there but it's really not much different from before.

MyreMylar · 2024-05-25T15:22:49Z

Test program:

from sys import stdout
from pstats import Stats
from cProfile import Profile


import pygame

pygame.init()


def speed_test_blits():
    # chosen because they contain edge cases.
    nums = [0, 1, 65, 126, 127, 199, 254, 255]
    results = {}
    for iterations in range(0, 10000):
        for dst_r, dst_b, dst_a in zip(nums, reversed(nums), reversed(nums)):
            for src_r, src_b, src_a in zip(nums, reversed(nums), nums):
                src_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
                src_surf.fill((src_r, 255, src_b, src_a))
                dest_surf = pygame.Surface((66, 66), pygame.SRCALPHA, 32)
                dest_surf.fill((dst_r, 255, dst_b, dst_a))

                dest_surf.blit(src_surf, (0, 0))
                key = ((dst_r, dst_b, dst_a), (src_r, src_b, src_a))
                results[key] = dest_surf.get_at((65, 33))


if __name__ == "__main__":
    profiler = Profile()
    profiler.runcall(speed_test_blits)
    stats = Stats(profiler, stream=stdout)
    stats.strip_dirs()
    stats.sort_stats("cumulative")
    stats.print_stats()

Results on my Raspberry Pi:


Current main branch:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.606    2.606   19.145   19.145 blitter_speed_test.py:11(speed_test_blits)
   640000   13.319    0.000   13.319    0.000 {method 'blit' of 'pygame.surface.Surface' objects}
  1280000    2.997    0.000    2.997    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
   640000    0.223    0.000    0.223    0.000 {method 'get_at' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

With PR:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.642    2.642   13.446   13.446 blitter_speed_test.py:11(speed_test_blits)
   640000    7.507    0.000    7.507    0.000 {method 'blit' of 'pygame.surface.Surface' objects}
  1280000    3.056    0.000    3.056    0.000 {method 'fill' of 'pygame.surface.Surface' objects}
   640000    0.241    0.000    0.241    0.000 {method 'get_at' of 'pygame.surface.Surface' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Looks like a solid improvement here when the SSE2 code is converted to NEON code.

MyreMylar

OK, LGTM 👍

I see a solid performance improvement heading down this code path and it passes all the unit tests. The SIMD code is a lot simpler now and more in tune with what we've been doing in recent blitters rather than these older versions. Much has been learned since!

Good job! 👍 🎈 🍾

optimized alphablit_alpha_sse2_argb_no_surf_alpha algorithm.

a61f6f1

itzpr3d4t0r added Performance Related to the speed or resource usage of the project SIMD Surface pygame.Surface labels Mar 24, 2024

itzpr3d4t0r requested a review from a team as a code owner March 24, 2024 16:18

itzpr3d4t0r requested review from MyreMylar and Starbuck5 March 24, 2024 16:37

itzpr3d4t0r added 2 commits March 31, 2024 12:50

tentative to improve performance

efd89a1

remove unused function

e71d400

itzpr3d4t0r closed this Mar 31, 2024

itzpr3d4t0r reopened this Mar 31, 2024

itzpr3d4t0r closed this Apr 1, 2024

itzpr3d4t0r reopened this Apr 1, 2024

MyreMylar closed this May 11, 2024

MyreMylar reopened this May 11, 2024

MyreMylar approved these changes May 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized SSE2 alpha blit algorithm #2766

Optimized SSE2 alpha blit algorithm #2766

itzpr3d4t0r commented Mar 24, 2024 •

edited

Loading

Starbuck5 commented Mar 28, 2024

itzpr3d4t0r commented Mar 28, 2024

itzpr3d4t0r commented Mar 31, 2024

itzpr3d4t0r commented May 15, 2024

MyreMylar commented May 25, 2024 •

edited

Loading

MyreMylar left a comment

Optimized SSE2 alpha blit algorithm #2766

Are you sure you want to change the base?

Optimized SSE2 alpha blit algorithm #2766

Conversation

itzpr3d4t0r commented Mar 24, 2024 • edited Loading

Starbuck5 commented Mar 28, 2024

itzpr3d4t0r commented Mar 28, 2024

itzpr3d4t0r commented Mar 31, 2024

itzpr3d4t0r commented May 15, 2024

MyreMylar commented May 25, 2024 • edited Loading

MyreMylar left a comment

Choose a reason for hiding this comment

itzpr3d4t0r commented Mar 24, 2024 •

edited

Loading

MyreMylar commented May 25, 2024 •

edited

Loading